boost::filesystem::path in UTF-8 on Windows
If I have a string that is in UTF-8, how do I tell the path constructor? path p1 ("my utf8 data", SOME_CODECVT); I think it is a matter of passing the right SOME_CODECVT. What is it? The |path::value_type| is |wchar_t, according to the docs. —John |
If I have a string that is in UTF-8, how do I tell the path constructor?
path p1 ("my utf8 data", SOME_CODECVT);
I think it is a matter of passing the right SOME_CODECVT. What is it? The path::value_type is wchar_t, according to the docs.
On Windows you should convert it to utf16.
On Fri, Nov 4, 2011 at 11:28, Igor R
If I have a string that is in UTF-8, how do I tell the path constructor?
path p1 ("my utf8 data", SOME_CODECVT);
I think it is a matter of passing the right SOME_CODECVT. What is it? The path::value_type is wchar_t, according to the docs.
On Windows you should convert it to utf16.
Word of warning: the boost utf8 codecvt will cause undefined operations if you have and cps above U+FFFF. You'll have to hack do_in to and do_out in order to emit/parse surrogate pairs. Also, hack do_length to increment the counter by 2 for cp>0xFFFF.
On Fri, Nov 4, 2011 at 22:54, Andrey Moshbear
On Fri, Nov 4, 2011 at 11:28, Igor R
wrote: If I have a string that is in UTF-8, how do I tell the path constructor?
path p1 ("my utf8 data", SOME_CODECVT);
I think it is a matter of passing the right SOME_CODECVT. What is it? The path::value_type is wchar_t, according to the docs.
On Windows you should convert it to utf16.
Word of warning: the boost utf8 codecvt will cause undefined operations if you have and cps above U+FFFF. You'll have to hack do_in to and do_out in order to emit/parse surrogate pairs. Also, hack do_length to increment the counter by 2 for cp>0xFFFF.
For my rewrite of UTF-8 to UTF-16/32, look at https://github.com/moshbear/fastcgipp/blob/master/src/utf8_cvt.cpp. While it can still decode above U+10FFFF, it's still more RFC 3629 compliant than utf8_codecvt_facet. It also supports true UTF-16.
On Fri, Nov 4, 2011 at 11:28, Igor R
On Windows you should convert it to utf16.
I know that is how it stores it internally.
My question is "how". Given that I have data that are file names and encoded in UTF-8,
how do I make the Boost path class accept them, and operate conveniently enough to be
worth using instead of plain strings?
On Fri, Nov 4, 2011 at 22:54, Andrey Moshbear
For my rewrite of UTF-8 to UTF-16/32, look at https://github.com/moshbear/fastcgipp/blob/master/src/utf8_cvt.cpp.
So this is a codecvt that I should use as the extra argument, that works better than the undocumented one that came with Boost? And, the implicit answer is that this is indeed how I do it? But: 1) When I write something like path p2= p1 / "Foo" / s1 / name; there is no place to pass the extra codecvt argument. I thought it might take strings and keep the existing encoding, but it actually uses the default code page. How can I use path in a simple and convenient manner given that in this program all the strings I will use with it are already in UTF-8? 2) How can I write a line like: path p2 (somestring, codecvt()); in a portable manner? On the Mac the internal representation is char, so will it object to having the codecvt passed? Once I set things up, I want the bulk of the source code to be the same on all platforms, so writing the argument on Windows and leaving it out on Mac is not acceptable. Thanks, --John
On Windows you should convert it to utf16.
I know that is how it stores it internally. My question is "how". Given that I have data that are file names and encoded in UTF-8, how do I make the Boost path class accept them, and operate conveniently enough to be worth using instead of plain strings?
I don't think "path" object can do such a conversion automatically, so you should convert it on your own using CRT, WinAPI, ATL macros or any other facilities.
On 11/5/2011 12:57 PM, Igor R wrote:
On Windows you should convert it to utf16.
I don't think "path" object can do such a conversion automatically, so you should convert it on your own using CRT, WinAPI, ATL macros or any other facilities.
Uh, I don't think you understood the point of the question at all, nor know about the class. "If the value type of [begin,end) or source arguments for member functions is not value_type, and no cvt argument is supplied, conversion to value_type occurs using an imbued locale." "For Windows-like implementations, including Cygwin and MinGW, path::value_type is wchar_t. The default imbued locale provides a codecvt facet that invokes Windows MultiByteToWideChar or WideCharToMultiByte API's with a codepage of CP_THREAD_ACP if Windows AreFileApisANSI()is true, otherwise codepage CP_OEMCP. " It DOES CONVERT, and that is the starting point of my issue. See? It does convert, and not in the way I want (if I indeed wanted it to).
On Sat, Nov 5, 2011 at 12:43, John M. Dlugosz
On Fri, Nov 4, 2011 at 11:28, Igor R
wrote: On Windows you should convert it to utf16.
I know that is how it stores it internally. My question is "how". Given that I have data that are file names and encoded in UTF-8, how do I make the Boost path class accept them, and operate conveniently enough to be worth using instead of plain strings?
On Fri, Nov 4, 2011 at 22:54, Andrey Moshbear
wrote: For my rewrite of UTF-8 to UTF-16/32, look at https://github.com/moshbear/fastcgipp/blob/master/src/utf8_cvt.cpp.
So this is a codecvt that I should use as the extra argument, that works better than the undocumented one that came with Boost?
And the boost utf8<->utf32 one is indeed documented: http://www.boost.org/doc/libs/1_47_0/libs/serialization/doc/codecvt.html. It's just not going to work correctly with extended Unicode if you decide to use 16-bit char as the char type. The code itself isn't that self-documenting, though, which makes hacking in the U+10FFFF limit and surrogate pair parsing more work than simply rewriting the codecvt.
And, the implicit answer is that this is indeed how I do it?
But:
1) When I write something like path p2= p1 / "Foo" / s1 / name; there is no place to pass the extra codecvt argument. I thought it might take strings and keep the existing encoding, but it actually uses the default code page. How can I use path in a simple and convenient manner given that in this program all the strings I will use with it are already in UTF-8?
Make a std::wstringstream. Imbue it with locale(locale::classic(), new Utf8_cvt). Use operator<< to build up a path. Call .str() to get the string. Pass that to the path constructor.
2) How can I write a line like: path p2 (somestring, codecvt()); in a portable manner? On the Mac the internal representation is char, so will it object to having the codecvt passed? Once I set things up, I want the bulk of the source code to be the same on all platforms, so writing the argument on Windows and leaving it out on Mac is not acceptable.
Because Mac assumes char, use of wide UTF isn't going to work because the libraries look for char 0 as terminators, not wchar_t 0. The best solution is to #ifdef _WIN32 the utf-8 to utf-16 code.
participants (3)
-
Andrey Moshbear
-
Igor R
-
John M. Dlugosz