Re: [boost] Boost.Unicode (was Re: Boost.Locale)

15 Dec 2010


      On Wed, Dec 15, 2010 at 2:15 PM, Mathias Gaunard
<mathias.gaunard@ens-lyon.org> wrote:
...
On 15/12/2010 08:20, Matus Chochlik wrote:
The interface is modeled after that of standard algorithms, and therefore it
takes an output iterator to write the output to, rather than creating a
container directly.
// ws is a std::string (utf-8) or std::wstring (utf-16 or utf-32).
std::basic_string<TCHAR> out;
utf_transcode<TCHAR>(ws, std::back_inserter(out));
AnotherWinapiFunc(..., out.c_str(), ...);
Assuming TCHAR is either char or wchar_t this should work out of the box.
The fact it takes an output iterator is quite practical, as you can easily
do two passes for example, one to count how many characters you need, and
one to copy that data.
Or you can just grow the container as you add elements, as
std::back_inserter does.
Something like
convert_to<std::basic_string<TCHAR>>(utf_transcode<TCHAR>(ws)).c_str()
would also work, but that's maybe a bit verbose.
My point is that many times you need to do this kind of conversion once or
twice in the whole application code, i.e. when you have to use a library that
does not play well with the WinAPI's character type switching. Besides
the low level tools which are cool if you need efficiency, it would be good
to have some syntactc-sugar-wrappers on top of them for situations where
the clarity of code and nonverbosity is more important.

Currently I use a wrapper around the MultiByteToWideChar / WideCharToMultiByte
functions that does the conversion and cleans up afterwards and I copy the code
whenever I start a new project, but I was looking for something more
"standardized" and portable.
...
...
Another thing is some kind of adaptor for std::(w)string providing
begin()/end()
functions returning an iterator traversing through the code points instead
of utf-XY "chars". i.e. in C++0x:
std::string s = get_utf8_string();
auto as = adapt(s);
auto i = as.begin(), e = as.end();
while(i != e)
{
   char32_t c = *i;
Replace adapt(s) by utf_decode(s)
Great, this is what I've been looking for.
...
...
   ...
   *i = transform(c);
No, you can't do that.
Data accessed like this is immutable.
It's not impossible to make them mutable (a bit complicated in the code
though, the range concepts don't support inserting/erasing elements), but
it's probably not a good idea because it would be O(n) worst case.
That's a valid point, so the more efficient alternative to inplace
transformation is to use another container for the output and
an inserter.
...
If you really want to do that, you can already do it using i.base() and
next(i).base(), which gives you the range of the character in terms of
original std::string iterators, so you can use std::string::replace.
Still, it could be useful if the transformation was made only
on a small subset of characters in the range or if the original and
the replacement byte sequence had equal length, which tends
to happen for characters from the same "script".
...
...
   ++i;
}
I have just scrolled through the docs for Boost.Unicode some time ago
so maybe it is already there and I've missed it. If so, links to some
examples showing this would be appreciated.
Of course it's there, transcoding between UTF encodings is the most basic
feature.
Yes, I know that Boost.Unicode does the transcoding (it would be a sad
unicode library if it didn't :-)) , I was asking about the syntactic sugar
functions and the possibly mutating iteration. But thanks for your response :)