
On Wed, Dec 15, 2010 at 2:15 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
On 15/12/2010 08:20, Matus Chochlik wrote:
The interface is modeled after that of standard algorithms, and therefore it takes an output iterator to write the output to, rather than creating a container directly.
// ws is a std::string (utf-8) or std::wstring (utf-16 or utf-32). std::basic_string<TCHAR> out; utf_transcode<TCHAR>(ws, std::back_inserter(out)); AnotherWinapiFunc(..., out.c_str(), ...);
Assuming TCHAR is either char or wchar_t this should work out of the box.
The fact it takes an output iterator is quite practical, as you can easily do two passes for example, one to count how many characters you need, and one to copy that data. Or you can just grow the container as you add elements, as std::back_inserter does.
Something like
convert_to<std::basic_string<TCHAR>>(utf_transcode<TCHAR>(ws)).c_str()
would also work, but that's maybe a bit verbose.
My point is that many times you need to do this kind of conversion once or twice in the whole application code, i.e. when you have to use a library that does not play well with the WinAPI's character type switching. Besides the low level tools which are cool if you need efficiency, it would be good to have some syntactc-sugar-wrappers on top of them for situations where the clarity of code and nonverbosity is more important. Currently I use a wrapper around the MultiByteToWideChar / WideCharToMultiByte functions that does the conversion and cleans up afterwards and I copy the code whenever I start a new project, but I was looking for something more "standardized" and portable.
Another thing is some kind of adaptor for std::(w)string providing begin()/end() functions returning an iterator traversing through the code points instead of utf-XY "chars". i.e. in C++0x:
std::string s = get_utf8_string(); auto as = adapt(s); auto i = as.begin(), e = as.end(); while(i != e) { char32_t c = *i;
Replace adapt(s) by utf_decode(s)
Great, this is what I've been looking for.
... *i = transform(c);
No, you can't do that. Data accessed like this is immutable.
It's not impossible to make them mutable (a bit complicated in the code though, the range concepts don't support inserting/erasing elements), but it's probably not a good idea because it would be O(n) worst case.
That's a valid point, so the more efficient alternative to inplace transformation is to use another container for the output and an inserter.
If you really want to do that, you can already do it using i.base() and next(i).base(), which gives you the range of the character in terms of original std::string iterators, so you can use std::string::replace.
Still, it could be useful if the transformation was made only on a small subset of characters in the range or if the original and the replacement byte sequence had equal length, which tends to happen for characters from the same "script".
++i; }
I have just scrolled through the docs for Boost.Unicode some time ago so maybe it is already there and I've missed it. If so, links to some examples showing this would be appreciated.
Of course it's there, transcoding between UTF encodings is the most basic feature.
Yes, I know that Boost.Unicode does the transcoding (it would be a sad unicode library if it didn't :-)) , I was asking about the syntactic sugar functions and the possibly mutating iteration. But thanks for your response :)