boost filesystem path as utf-8?

Hello, I understand that the path class .native() member function's return type differs depending on the platform (wstring on windows, for example.) Is there a way to get the path as a utf-8 string regardless of the platform? Likewise, is there a way to construct a path object from a utf-8 string regardless of the platform? Emil Dotchevski Reverge Studios, Inc. http://www.revergestudios.com/reblog/index.php?n=ReCode

When you are using Boost.FileSystem.v3 you can imbue a locale with UTF-8 codecvt facet globally using. boost::path::imbue() Note path::imbue is static member function. Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.sf.net/ CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/ ----- Original Message -----
From: Emil Dotchevski <emildotchevski@gmail.com> To: boost@lists.boost.org Cc: Sent: Monday, January 23, 2012 10:15 AM Subject: [boost] boost filesystem path as utf-8?
Hello,
I understand that the path class .native() member function's return type differs depending on the platform (wstring on windows, for example.) Is there a way to get the path as a utf-8 string regardless of the platform? Likewise, is there a way to construct a path object from a utf-8 string regardless of the platform?
Emil Dotchevski Reverge Studios, Inc. http://www.revergestudios.com/reblog/index.php?n=ReCode
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

As Artyom said you can imbue whatever locale you want to specify the conversion form narrow to wide strings. It will make almost all the conversions transparent, except that the path will still be stored as UTF-16 on windows. Unfortunately it boils to the interface whence you can get a c_str() to a UTF-16 string only. You may want to revert to Boost.Filesystem.v2 (afaik removed completely in 1.48 so you'll need to merge from the old release), it is better designed in the sense that it has a templatized basic_path that allows you to store utf-8 encoding internally (once you imbue the correct locale) and convert to UTF-16 on demand. On Mon, Jan 23, 2012 at 10:33, Artyom Beilis <artyomtnk@yahoo.com> wrote:
When you are using Boost.FileSystem.v3 you can imbue a locale with UTF-8 codecvt facet globally using.
boost::path::imbue()
Note path::imbue is static member function.
Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.sf.net/ CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/
----- Original Message -----
From: Emil Dotchevski <emildotchevski@gmail.com> To: boost@lists.boost.org Cc: Sent: Monday, January 23, 2012 10:15 AM Subject: [boost] boost filesystem path as utf-8?
Hello,
I understand that the path class .native() member function's return type differs depending on the platform (wstring on windows, for example.) Is there a way to get the path as a utf-8 string regardless of the platform? Likewise, is there a way to construct a path object from a utf-8 string regardless of the platform?
Emil Dotchevski Reverge Studios, Inc. http://www.revergestudios.com/reblog/index.php?n=ReCode
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Yakov

On Mon, Jan 23, 2012 at 4:46 AM, Yakov Galka <ybungalobill@gmail.com> wrote:
As Artyom said you can imbue whatever locale you want to specify the conversion form narrow to wide strings. It will make almost all the conversions transparent, except that the path will still be stored as UTF-16 on windows.
So far, so good.
Unfortunately it boils to the interface whence you can get a c_str() to a UTF-16 string only.
That's not correct. If you have a path p, and the imbued codecvt if UTF-8, you can always get a UTF-8 narrow string by writing p.string<std::string>(), so you can always write p.string<std::string>().c_str() if you want a const char* to a UTF-8 encoded narrow string. If your app mostly needs UTF-8 strings, use std::string and only convert to a path when a path is actually needed. If your app mostly needs paths, use boost::filesystem::path and only convert to std::string when a std::string or const char* is actually needed.
You may want to revert to Boost.Filesystem.v2 (afaik removed completely in 1.48 so you'll need to merge from the old release), it is better designed in the sense that it has a templatized basic_path that allows you to store utf-8 encoding internally (once you imbue the correct locale) and convert to UTF-16 on demand.
V2 is no longer supported and bugs are not being fixed. --Beman

On Mon, Jan 23, 2012 at 14:47, Beman Dawes <bdawes@acm.org> wrote:
On Mon, Jan 23, 2012 at 4:46 AM, Yakov Galka <ybungalobill@gmail.com> wrote: [...]
Unfortunately it boils to the interface whence you can get a c_str() to a UTF-16 string only.
That's not correct.
It's correct. I state that path::c_str() returns UTF-16 on Windows. It's a fact. So the encoding isn't an implementation detail but a part of the interface. So you can do a conversion, but it has different semantics because....
If you have a path p, and the imbued codecvt if UTF-8, you can always get a UTF-8 narrow string by writing p.string<std::string>(), so you can always write p.string<std::string>().c_str() if you want a const char* to a UTF-8 encoded narrow string.
...it has a different life time. path::c_str() has the same lifetime as the path, so would have the utf8-path::c_str(). If your app mostly needs UTF-8 strings, use std::string and only
convert to a path when a path is actually needed.
If your app mostly needs paths, use boost::filesystem::path and only convert to std::string when a std::string or const char* is actually needed.
My app needs UTF-8 paths. Don't use the term 'path' as synonym for 'boost::filesystem::path'. There are other paths in the world (QDir, Poco::Path) and yours are neither special nor better. -- Yakov

On Mon, Jan 23, 2012 at 9:28 AM, Yakov Galka <ybungalobill@gmail.com> wrote:
On Mon, Jan 23, 2012 at 14:47, Beman Dawes <bdawes@acm.org> wrote:
On Mon, Jan 23, 2012 at 4:46 AM, Yakov Galka <ybungalobill@gmail.com> wrote: [...]
Unfortunately it boils to the interface whence you can get a c_str() to a UTF-16 string only.
That's not correct.
It's correct. I state that path::c_str() returns UTF-16 on Windows. It's a fact. So the encoding isn't an implementation detail but a part of the interface.
As quoted above, you said only that "...the interface whence you can get a c_str() to a UTF-16 string only." The interface includes multiple observers, which return values with various encodings other than UTF-16. The return types from the observers allow c_str() to access those values. During the design discussions, two other alternatives were discussed. (1) Always hold the path internally in a char string encoded UTF-8. The cost on Windows is that a conversion has to be done before every file system operation. The cost on POSIX is that a double conversion has to be done before every file system operation if the encoding is not UTF-8. (2) Hold two strings internally, one in the native type and encoding, the other in UTF-8. The cost is trying to keep them in sync, with the conversions that implies, for some definition of "in sync". If class std::basic_string itself had better support for string interoperability, class path would be able to side step at least some of the conversion headaches. --Beman

On Mon, Jan 23, 2012 at 21:52, Beman Dawes <bdawes@acm.org> wrote: > On Mon, Jan 23, 2012 at 9:28 AM, Yakov Galka <ybungalobill@gmail.com> wrote: > >> On Mon, Jan 23, 2012 at 14:47, Beman Dawes <bdawes@acm.org> wrote: >> >> > On Mon, Jan 23, 2012 at 4:46 AM, Yakov Galka <ybungalobill@gmail.com> >> > wrote: >> > [...] >> > >> > > Unfortunately it boils to the interface whence you can >> > > get a c_str() to a UTF-16 string only. >> > >> > That's not correct. >> > >> >> It's correct. I state that path::c_str() returns UTF-16 on Windows. It's a >> fact. So the encoding isn't an implementation detail but a part of the >> interface. >> > > As quoted above, you said only that "...the interface whence you can get a > c_str() to a UTF-16 string only." Don't be picky at words. Yes, this sentence might be ambiguous. But I say that the correct resolution, using C++ name lookup rules, is "you can get a path::c_str() to a UTF-16 string only". > The interface includes multiple observers, which return values with various > encodings other than UTF-16. The return types from the observers allow > c_str() to access those values. Since you didn't read it, I'll repeat it again: path::string().c_str() is a *temporary*. path::c_str() is NOT. The two has difference semantics, and your library starting with version 3 doesn't let the user choose what string path holds inside. As said above, it's not an implementation detail since it's observable from the interface. > During the design discussions, two other alternatives were discussed. (1) > Always hold the path internally in a char string encoded UTF-8. The cost on > Windows is that a conversion has to be done before every file system > operation. Not an issue, because: 1) last time I measured with CreateFile and a naive implementation using MultiByteToWideChar it took less than 3% overhead. Faster conversions routines exist and you will have to do the conversions anyway when you communicate with the external world. 2) Let the user choose between narrow chars and wide chars. Why do you force me to use the later? Why getting the filename from a UTF-8 std::string must involve 2 conversions (to and from UTF-16) even if I don't pass anything to the system? > The cost on POSIX is that a double conversion has to be done > before every file system operation if the encoding is not UTF-8. 1) Most POSIX systems use UTF-8 these days. 2) It's fine if it will be the native encoding on POSIX, as long as the user can override it. On windows she just can't do this because boost::path uses wide string. > (2) Hold > two strings internally, one in the native type and encoding, the other in > UTF-8. The cost is trying to keep them in sync, with the conversions that > implies, for some definition of "in sync". I 100% agree (2) is not an option. > If class std::basic_string itself had better support for string > interoperability, class path would be able to side step at least some of > the conversion headaches. Maybe, but almost surely not. It would just shift the burden to other place—the user. What you didn't say is that *during original filesystem review* it had a templatized basic_path and the user *could choose* between narrow and wide strings. Add this option to the list above. -- Yakov

On Mon, Jan 23, 2012 at 4:47 AM, Beman Dawes <bdawes@acm.org> wrote:
On Mon, Jan 23, 2012 at 4:46 AM, Yakov Galka <ybungalobill@gmail.com> wrote:
As Artyom said you can imbue whatever locale you want to specify the conversion form narrow to wide strings. It will make almost all the conversions transparent, except that the path will still be stored as UTF-16 on windows.
So far, so good.
Unfortunately it boils to the interface whence you can get a c_str() to a UTF-16 string only.
That's not correct.
If you have a path p, and the imbued codecvt if UTF-8
How exactly do I imbue UTF-8 codecvt in a path? I Googled around and couldn't find anything. Emil Dotchevski Reverge Studios, Inc. http://www.revergestudios.com/reblog/index.php?n=ReCode

On Mon, Jan 23, 2012 at 5:15 PM, Emil Dotchevski <emildotchevski@gmail.com> wrote:
How exactly do I imbue UTF-8 codecvt in a path? I Googled around and couldn't find anything.
There are two approaches: * If you always want all class path arguments and returned values with a value type of char to be treated as being UTF-8 encoded, and aren't worried about changing a potentially dangerous global, then do this: #include <boost/filesystem/detail/utf8_codecvt_facet.hpp> ... std::locale global_loc = std::locale(); std::locale loc(global_loc, new boost::filesystem::detail::utf8_codecvt_facet); boost::filesystem::path::imbue(loc); * If you only want one specific path to treat its narrow character arguments and returns as UTF-8, do this: boost::filesystem::detail::utf8_codecvt_facet utf8; ... boost::filesystem::path p; ... p.assign(u8"...", utf8); // many other path functions can take a codecvt argument, too By the way, you can use a UTF-8 codecvt facet from someone else if you prefer. HTH, --Beman

On Mon, Jan 23, 2012 at 3:41 PM, Beman Dawes <bdawes@acm.org> wrote:
There are two approaches:
* If you always want all class path arguments and returned values with a value type of char to be treated as being UTF-8 encoded, and aren't worried about changing a potentially dangerous global, then do this:
Beman, thanks for your detailed answer! I have one more question: is it possible to get a grammatically native string, encoded in UTF-8 regardless of the platform? Emil Dotchevski Reverge Studios, Inc. http://www.revergestudios.com/reblog/index.php?n=ReCode
participants (4)
-
Artyom Beilis
-
Beman Dawes
-
Emil Dotchevski
-
Yakov Galka