On Sat, Nov 5, 2011 at 12:43, John M. Dlugosz
On Fri, Nov 4, 2011 at 11:28, Igor R
wrote: On Windows you should convert it to utf16.
I know that is how it stores it internally. My question is "how". Given that I have data that are file names and encoded in UTF-8, how do I make the Boost path class accept them, and operate conveniently enough to be worth using instead of plain strings?
On Fri, Nov 4, 2011 at 22:54, Andrey Moshbear
wrote: For my rewrite of UTF-8 to UTF-16/32, look at https://github.com/moshbear/fastcgipp/blob/master/src/utf8_cvt.cpp.
So this is a codecvt that I should use as the extra argument, that works better than the undocumented one that came with Boost?
And the boost utf8<->utf32 one is indeed documented: http://www.boost.org/doc/libs/1_47_0/libs/serialization/doc/codecvt.html. It's just not going to work correctly with extended Unicode if you decide to use 16-bit char as the char type. The code itself isn't that self-documenting, though, which makes hacking in the U+10FFFF limit and surrogate pair parsing more work than simply rewriting the codecvt.
And, the implicit answer is that this is indeed how I do it?
But:
1) When I write something like path p2= p1 / "Foo" / s1 / name; there is no place to pass the extra codecvt argument. I thought it might take strings and keep the existing encoding, but it actually uses the default code page. How can I use path in a simple and convenient manner given that in this program all the strings I will use with it are already in UTF-8?
Make a std::wstringstream. Imbue it with locale(locale::classic(), new Utf8_cvt). Use operator<< to build up a path. Call .str() to get the string. Pass that to the path constructor.
2) How can I write a line like: path p2 (somestring, codecvt()); in a portable manner? On the Mac the internal representation is char, so will it object to having the codecvt passed? Once I set things up, I want the bulk of the source code to be the same on all platforms, so writing the argument on Windows and leaving it out on Mac is not acceptable.
Because Mac assumes char, use of wide UTF isn't going to work because the libraries look for char 0 as terminators, not wchar_t 0. The best solution is to #ifdef _WIN32 the utf-8 to utf-16 code.