Re: [boost] Review Request: Boost.Locale

24 May 2010

      ...
What is the concrete limitation of codecvt specification
that prevents creating a codecvt facet that converts UTF-16
to-from UTF-8? I just re-read 22.2.1.5 but wasn't able to
see it.
Good to hear. Yes, I agree it's very important.
boost::detail::utf8_codecvt_facet fails that test, at least
on windows, but I'm wondering what is the fundamental
restriction it can't be patched to support them?
The standard defines (form my memory) following:

- The conversion can be performed converting single wide character
  one-by-one - i.e. Implementation should work even if only one
  wide character is given (and BTW MSVC indeed converts one character
  in time)

- There is absolutely no information given about std::mbstate_t that
  should save intermediate data between conversions so, there is actually
  no way to pass anything between sequential calls of 
  std::locale::codecvt<...>::in/out. So even if I observe first surrogate
  pair there is no way to pass this information for next call and thus
  I loose this information

This is exactly the reason you can't implement utf-8 - utf-16 codepage
conversion using codecvt facet.

On the other hand there is no such limitations for utf-32 encodings
as there is no information to preserve between calls.

Additional note: it is also not possible to convert statefull encodings
like UTF-7 as there is no way to move state around.

So generally std::locale::codecvt is not well designed to be derived
from, so only way to to stream conversion correctly is redesign this
facet, but in such case you can't use it with std::iostreams library.
...
For the original (non-compliance) point I raised it would
be interesting to see how well codecvt< char32_t, char,
std::mbstate_t > is going to be implemented under windows
:)
There is no problem to implement it correctly.
...
BTW, I see some interesting additions to codecvts in n3090,
22.5.
Any plans to implement them in Boost.Locale?
On same wave, when char32_t/char16_t would be available, hopefully
these facets would be implemented. But today it is impossible to
implement utf-16 codecvt facets.

My personal opinion - avoid wide characters and any "Unicode"
characters. Because it is best way to full yourself with "Unicode"
support as in reality they do not provide any advantage over plain
char and utf-8 encodings.

So, unless you are using Win32 API avoid wide characters. 
However too many programmers would disagree with me, epsecially
Windows programmers who grew on "Unicode" and "Wide" API.
So Boost.Locale fully supports wide characters.
...
Non-iterator interface is a real pain in using codecvt, I
admit.
I think best interface would be rather something like boost::iostreams
filter but I think this should be rather part of iostreams library
then localization. Also it should not pass wide encoding in the middle
when converting utf-8 to ISO-8859-8.

But that is different story.

For simple string conversion boost::locale provides from_utf/to_utf
that work correctly with utf-8/16/32.

Artyom