
What is the concrete limitation of codecvt specification that prevents creating a codecvt facet that converts UTF-16 to-from UTF-8? I just re-read 22.2.1.5 but wasn't able to see it.
Good to hear. Yes, I agree it's very important. boost::detail::utf8_codecvt_facet fails that test, at least on windows, but I'm wondering what is the fundamental restriction it can't be patched to support them?
The standard defines (form my memory) following: - The conversion can be performed converting single wide character one-by-one - i.e. Implementation should work even if only one wide character is given (and BTW MSVC indeed converts one character in time) - There is absolutely no information given about std::mbstate_t that should save intermediate data between conversions so, there is actually no way to pass anything between sequential calls of std::locale::codecvt<...>::in/out. So even if I observe first surrogate pair there is no way to pass this information for next call and thus I loose this information This is exactly the reason you can't implement utf-8 - utf-16 codepage conversion using codecvt facet. On the other hand there is no such limitations for utf-32 encodings as there is no information to preserve between calls. Additional note: it is also not possible to convert statefull encodings like UTF-7 as there is no way to move state around. So generally std::locale::codecvt is not well designed to be derived from, so only way to to stream conversion correctly is redesign this facet, but in such case you can't use it with std::iostreams library.
For the original (non-compliance) point I raised it would be interesting to see how well codecvt< char32_t, char, std::mbstate_t > is going to be implemented under windows :)
There is no problem to implement it correctly.
BTW, I see some interesting additions to codecvts in n3090, 22.5. Any plans to implement them in Boost.Locale?
On same wave, when char32_t/char16_t would be available, hopefully these facets would be implemented. But today it is impossible to implement utf-16 codecvt facets. My personal opinion - avoid wide characters and any "Unicode" characters. Because it is best way to full yourself with "Unicode" support as in reality they do not provide any advantage over plain char and utf-8 encodings. So, unless you are using Win32 API avoid wide characters. However too many programmers would disagree with me, epsecially Windows programmers who grew on "Unicode" and "Wide" API. So Boost.Locale fully supports wide characters.
Non-iterator interface is a real pain in using codecvt, I admit.
I think best interface would be rather something like boost::iostreams filter but I think this should be rather part of iostreams library then localization. Also it should not pass wide encoding in the middle when converting utf-8 to ISO-8859-8. But that is different story. For simple string conversion boost::locale provides from_utf/to_utf that work correctly with utf-8/16/32. Artyom