File System - unicode support - Boost - lists.preview.boost.org

newer
future .vs. Join, is "future" a...

File System - unicode support

Enda Mannion

16 May 2008 16 May '08

8:37 a.m.

Hi, I am trying to use boost filesystem. wchar_t* file = L"myfile"; boost::filesystem::path path(file); but boost can not seem to find wide character versions of the file system functions and constructors I am using. Does boost filesystem support wide character. I am compiling this on linux and on Windows. Thanks, Enda

Show replies by date

Jens Seidel

16 May 16 May

9:22 a.m.

On Fri, May 16, 2008 at 09:37:42AM +0100, Enda Mannion wrote:

...

I am trying to use boost filesystem.

Does boost filesystem support wide character.

I am compiling this on linux and on Windows.

Stupid question: Do you really use the UTF-16 Unicode encoding on Linux? I now about some classical Asian 16bit encodings but these days UTF-8 (which is compatible with char*) is used everywhere on Linux ... Jens

Matus Chochlik

9:48 a.m.

On Fri, May 16, 2008 at 11:22 AM, Jens Seidel <jensseidel@users.sf.net> wrote:

...

On Fri, May 16, 2008 at 09:37:42AM +0100, Enda Mannion wrote:

...
I am trying to use boost filesystem.

Does boost filesystem support wide character.

I am compiling this on linux and on Windows.

Stupid question: Do you really use the UTF-16 Unicode encoding on Linux? I now about some classical Asian 16bit encodings but these days UTF-8 (which is compatible with char*) is used everywhere on Linux ...

I've already made a couple of posts concerning this issue, but I didn't get too many answers, so sorry if I'm missing something really obvious and for repeating myself :-P .. ... but, couldn't be this issue solved by defining a portable equivalent of TCHAR type which is consistently used by WINAPI and the real char type is switched there at compile time by the means or the "UNICODE" PP symbol ? TCHAR is wchar_t or char depending on whether UNICODE is or isn't defined. Boost library functions would use this *boost-char-type* (whatever it's name would be) instead of char or wchar_t, where applicable. On Windows this allows to use the same WIN32 "functions" with both character types and allows an application (when coded properly) to be compiled with both character types without the need of messing with the code. I'm sorely missing something like this in the C++ standard or at least in Boost and I think I'm not the only one. My Mirror library uses this kind of char-type switching, but the implementation is still rather sloppy. If there is a general interest in this being added to Boost I volunteer :-) to do it and I will gladly accept any comments or help in making it compliant with the Boost quality standards.

...

Jens _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- ________________ ::matus_chochlik

Ulrich Eckhardt

12:22 p.m.

On Friday 16 May 2008 11:48:30 Matus Chochlik wrote:

...

On Fri, May 16, 2008 at 11:22 AM, Jens Seidel <jensseidel@users.sf.net> wrote:

...
On Fri, May 16, 2008 at 09:37:42AM +0100, Enda Mannion wrote:

...
I am trying to use boost filesystem.

Does boost filesystem support wide character.

I am compiling this on linux and on Windows.

Stupid question: Do you really use the UTF-16 Unicode encoding on Linux? I now about some classical Asian 16bit encodings but these days UTF-8 (which is compatible with char*) is used everywhere on Linux ...

Sorry to chime in here, but UTF-8 is internally represented as char string, but that doesn't make it 'compatible' in any way. It's like saying SCSI and ATA disks are compatible because they both use 8 bits per byte. Rather, strings encoded in UTF-8 or e.g. ISO8859-1 can(!) be represented as char strings both, though for both using an unsigned char string is IMHO an even better idea.

...

I've already made a couple of posts concerning this issue, but I didn't get too many answers, so sorry if I'm missing something really obvious and for repeating myself :-P ..

... but, couldn't be this issue solved by defining a portable equivalent of TCHAR type which is consistently used by WINAPI and the real char type is switched there at compile time by the means or the "UNICODE" PP symbol ?

Hmmm, I personally consider TCHAR just a hack to ease transition from a char-based win32 API to a wchar_t-based (and thus Unicode-capable) one. The goal is in any way to have full Unicode support, be it via char and UTF-8 or wchar_t and UTF-16. However, the problem with wchar_t is that it is only UTF-16 on some platforms, the standard doesn't mandate its encoding at all. Further, the problem with char is that it can also hold strings with a totally different encoding like one of the ISO8859 encodings.

...

TCHAR is wchar_t or char depending on whether UNICODE is or isn't defined. Boost library functions would use this *boost-char-type* (whatever it's name would be) instead of char or wchar_t, where applicable.

On Windows this allows to use the same WIN32 "functions" with both character types and allows an application (when coded properly) to be compiled with both character types without the need of messing with the code.

I'm sorely missing something like this in the C++ standard or at least in Boost and I think I'm not the only one.

FYI, I don't. When I need Unicode for some string, I use wchar_t (which isn't the holy grail though). Then, when I have to interfere with the win32 API, I need to either convert it to TCHAR (whatever that currently is) or, preferable, use the function version that takes a wchar_t, like CreateFileW(). When I do logging, I typically restrict myself to ASCII, so I can also use simple char strings. My opinion is that it is better to actually define the encoding of a string on a case-by-case basis and do conscious conversions instead of relying on a string type (TCHAR) which changes meanings depending on a macro and in the char-case even depending on the OS' locale. Uli

Matus Chochlik

1:12 p.m.

On Fri, May 16, 2008 at 2:22 PM, Ulrich Eckhardt <doomster@knuut.de> wrote:

...

On Friday 16 May 2008 11:48:30 Matus Chochlik wrote:

...
On Fri, May 16, 2008 at 11:22 AM, Jens Seidel <jensseidel@users.sf.net> wrote:

...
On Fri, May 16, 2008 at 09:37:42AM +0100, Enda Mannion wrote:

...
I am trying to use boost filesystem.

Does boost filesystem support wide character.

I am compiling this on linux and on Windows.

Stupid question: Do you really use the UTF-16 Unicode encoding on Linux? I now about some classical Asian 16bit encodings but these days UTF-8 (which is compatible with char*) is used everywhere on Linux ...

Sorry to chime in here, but UTF-8 is internally represented as char string, but that doesn't make it 'compatible' in any way. It's like saying SCSI and ATA disks are compatible because they both use 8 bits per byte. Rather, strings encoded in UTF-8 or e.g. ISO8859-1 can(!) be represented as char strings both, though for both using an unsigned char string is IMHO an even better idea.

...
I've already made a couple of posts concerning this issue, but I didn't get too many answers, so sorry if I'm missing something really obvious and for repeating myself :-P ..

... but, couldn't be this issue solved by defining a portable equivalent of TCHAR type which is consistently used by WINAPI and the real char type is switched there at compile time by the means or the "UNICODE" PP symbol ?

Hmmm, I personally consider TCHAR just a hack to ease transition from a char-based win32 API to a wchar_t-based (and thus Unicode-capable) one. The goal is in any way to have full Unicode support, be it via char and UTF-8 or wchar_t and UTF-16. However, the problem with wchar_t is that it is only UTF-16 on some platforms, the standard doesn't mandate its encoding at all. Further, the problem with char is that it can also hold strings with a totally different encoding like one of the ISO8859 encodings.

Well, I consider TCHAR a hack myself, but it is a useful one. I've used the approach when developing a group of quite large applications and somewhere in the middle of the process it became clear that it's better to use widechars and UTF than to mess with different encodings. I'm glad a didn't have to do the replacing of char->wchar_t, string->wstring, cout->wcout, not mentioning things like strlen/wcslen and modify all the Winapi specific wrappers ;-) There are several other libraries in Boost just to ease the pain of waiting for new things to become standard and widespread. Boost.Typeof and BCCL are IMHO good examples. No offense :-) The implementation of switching between for example LoadLibraryA and LoadLibraryW by the means of a preprocessor symbol "LoadLibrary" is really crazy and I'm not suggesting doing that in Boost. Any implementation in Boost definitelly needs to be better than this.

...

...
TCHAR is wchar_t or char depending on whether UNICODE is or isn't defined. Boost library functions would use this *boost-char-type* (whatever it's name would be) instead of char or wchar_t, where applicable.

On Windows this allows to use the same WIN32 "functions" with both character types and allows an application (when coded properly) to be compiled with both character types without the need of messing with the code.

I'm sorely missing something like this in the C++ standard or at least in Boost and I think I'm not the only one.

FYI, I don't. When I need Unicode for some string, I use wchar_t (which isn't the holy grail though). Then, when I have to interfere with the win32 API, I need to either convert it to TCHAR (whatever that currently is) or, preferable, use the function version that takes a wchar_t, like CreateFileW(). When I do logging, I typically restrict myself to ASCII, so I can also use simple char strings. My opinion is that it is better to actually define the encoding of a string on a case-by-case basis and do conscious conversions instead of relying on a string type (TCHAR) which changes meanings depending on a macro and in the char-case even depending on the OS' locale.

Well that's something that I really like to avoid whenever possible. I know from my experience that both approaches are problematic. Conversion is slowing things down terribly, and it is not easy to decide, when starting a large project, which character type (and all the related stuff) is the best. There are many tradeoffs between chars/wchars and I know that UTF-whatever and wchars have problems of their own and exactly because of that I like to have the freedom to do the of choice at the time of deployment of the application. -- ________________ ::matus_chochlik

...

Uli _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Jens Seidel

12:36 p.m.

On Fri, May 16, 2008 at 11:48:30AM +0200, Matus Chochlik wrote:

...

On Fri, May 16, 2008 at 11:22 AM, Jens Seidel <jensseidel@users.sf.net> wrote:

...
Stupid question: Do you really use the UTF-16 Unicode encoding on Linux? I now about some classical Asian 16bit encodings but these days UTF-8 (which is compatible with char*) is used everywhere on Linux ...

...

... but, couldn't be this issue solved by defining a portable equivalent of TCHAR type which is consistently used by WINAPI and the real char type is switched there at compile time by the means or the "UNICODE" PP symbol ?

No, I don't think so. First beside the type you also have to support initialisations and access to the type. As far as I know (really never used wchar_t) it is: const char *text = "Hi world" and const wchar_t *text = L"Hi world" How do you want to know whether you need "L" if you just have a new type? What about functions/methods which do not exist for both types? You would always have to write #ifdef ... #else ... #end Can not even UTF-8 data be stored in wchar_t (first byte is always zero)? I think support both types together with different encodings in one program is just asking for trouble. Use a fixed encoding and one of char or wchar_t accross your whole program and you simplify your code a lot. Together with wrappers which convert your data e.g. from UTF-16 wchar_t to UTF-8 char after calling string functions on Win* you may have a slowdown but also a compatible program.

...

TCHAR is wchar_t or char depending on whether UNICODE is or

UNICODE is a very bad name! The size of the type (char, wchar_t) could depend on the encoding (UTF-8, UTF-16, ...), not the character set (Unicode)!

...

isn't defined. Boost library functions would use this *boost-char-type* (whatever it's name would be) instead of char or wchar_t, where applicable.

On Windows this allows to use the same WIN32 "functions" with both character types and allows an application (when coded properly) to be compiled with both character types without the need of messing with the code.

I'm sorely missing something like this in the C++ standard or at least in Boost and I think I'm not the only one.

Please use instead a proper string class which is aware of it's encoding and just transfers it on need. This avoids really any problems and is portable. See e.g. Qt's QString class: http://doc.trolltech.com/4.4/qstring.html Jens

Matus Chochlik

1:37 p.m.

On Fri, May 16, 2008 at 2:36 PM, Jens Seidel <jensseidel@users.sf.net> wrote:

...

On Fri, May 16, 2008 at 11:48:30AM +0200, Matus Chochlik wrote:

...
On Fri, May 16, 2008 at 11:22 AM, Jens Seidel <jensseidel@users.sf.net> wrote:

...
Stupid question: Do you really use the UTF-16 Unicode encoding on Linux? I now about some classical Asian 16bit encodings but these days UTF-8 (which is compatible with char*) is used everywhere on Linux ...

...
... but, couldn't be this issue solved by defining a portable equivalent of TCHAR type which is consistently used by WINAPI and the real char type is switched there at compile time by the means or the "UNICODE" PP symbol ?

No, I don't think so. First beside the type you also have to support initialisations and access to the type. As far as I know (really never used wchar_t) it is: const char *text = "Hi world" and const wchar_t *text = L"Hi world"

Yeah, this is done with the TEXT("literal") macro in winapi. if TCHAR = char then TEXT() expands to "literal" if TCHAR = wchar_t the it expands to L"literal" To name the macro TEXT was however not the best choice ;) and we can avoid repeating this mistake.

...

How do you want to know whether you need "L" if you just have a new type?

What about functions/methods which do not exist for both types? You would always have to write #ifdef ... #else ... #end

Correct .. and I'm suggesting wrapping these routines (mainly those from cstring) with inline functions doing this. This way one does not have to do it in the application code. Instead of having to choose between strlen/wcslen/mbcslen you would use say "bstrlen".

...

Can not even UTF-8 data be stored in wchar_t (first byte is always zero)? I think support both types together with different encodings in one program is just asking for trouble. Use a fixed encoding and one of char or wchar_t accross your whole program and you simplify your code a lot. Together with wrappers which convert your data e.g. from UTF-16 wchar_t to UTF-8 char after calling string functions on Win* you may have a slowdown but also a compatible program.

I was not suggesting supporting both character types at once in the same compiled binary of the application. Instead I would like to have the opportunity to decide which character type to use at the time of deployment on a particular hardware platform, OS and depending on other circumstances.

...

...
TCHAR is wchar_t or char depending on whether UNICODE is or

UNICODE is a very bad name! The size of the type (char, wchar_t) could depend on the encoding (UTF-8, UTF-16, ...), not the character set (Unicode)!

I strongly agree that UNICODE is a bad name and it would be necessary to apply the Boost conventions for naming PP symbols.

...

...
isn't defined. Boost library functions would use this *boost-char-type* (whatever it's name would be) instead of char or wchar_t, where applicable.

On Windows this allows to use the same WIN32 "functions" with both character types and allows an application (when coded properly) to be compiled with both character types without the need of messing with the code.

I'm sorely missing something like this in the C++ standard or at least in Boost and I think I'm not the only one.

Please use instead a proper string class which is aware of it's encoding and just transfers it on need. This avoids really any problems and is portable. See e.g. Qt's QString class: http://doc.trolltech.com/4.4/qstring.html

Well, this is exactly what I'm suggesting to do in Boost. Qt has its uses and it has its problems and there are some applications where I certainly would like to avoid using Qt. Why not define a "bstring" instead ;) I'm a new guy here, but still, I've noticed several posts related to this mainly from people using libraries that are wrapping around the WINAPI calls like Boost.Filesystem, Extension, etc. and found more of them in the archives.

...

Jens _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Cheers -- ________________ ::matus_chochlik

Ulrich Eckhardt

12:15 p.m.

On Friday 16 May 2008 10:37:42 Enda Mannion wrote:

...

I am trying to use boost filesystem.

wchar_t* file = L"myfile";

*meeep* Wrong! Use 'wchar_t const*' instead. [Yes, that's not the issue here, but I just wanted to point it out. ]

...

boost::filesystem::path path(file);

Take a look at the 'wpath' class: typedef basic_path<std::string, path_traits> path; typedef basic_path<std::wstring, wpath_traits> wpath; 'wpath' use a wchar_t representation for a path. Uli

Mathias Gaunard

1:55 p.m.

Enda Mannion wrote:

...

Does boost filesystem support wide character.

boost filesystem supports wide characters, indeed. But wide characters aren't Unicode. They're platform and locale dependent. Cf any unix manual. Personally I'd rather want filesystem to drop support for wide characters and put real support for Unicode instead. It could take strings it knows are in Unicode, and even if the target system does not have Unicode support it could convert between encodings.

6303

Age (days ago)

6303

Last active (days ago)

List overview

Download

8 comments

5 participants

participants (5)

Enda Mannion
Jens Seidel
Mathias Gaunard
Matus Chochlik
Ulrich Eckhardt