RE: [boost] Re: Re: New design proposal for boost::filesystem

25 Aug 2004

      Hi,

I am sorry for the length of the e-mail, I hope I make myself clear better
than last time.
...
[mailto:boost-bounces@lists.boost.org] On Behalf Of Bennett, Patrick
...
On Behalf Of Ferdinand Prantl
I have nothing against usage of UTF-8 if it suits the 
scenario well. I 
just say that it is not an encoding for all purposes. It is a
multibyte 
one and so extremely inefficient for getting size, searching, etc.
[Bennett, Patrick] I fail to see how this is the case.  Right 
now, today, filesystem supports *only* ascii.  If you
IMHO it is not true. boost::filesystem does neither recoding nor
modification to the strings passed into its methods. It calls underlying OS
functions, the XxxA methods on Windows and the POSIX functions on UN*Xes.
These work with the current locale on UN*Xes and in the current ANSI
codepage on Windows. They accept all characters valied in the current
codepage. If you change it, you may see the filenames crippled, because the
characters can be mapped differently, but that's it, it's not problem of
boost::filesystem.

This the current boost::filesystem does not solve and lets it on the
programmer, to convert the filenames, if needed. You can use UTF-8 even
today on non-Windows platforms. Unfortunately, there is no setting on
Windows to enforce a UTF-8 codepage, like on UN*Xes. You set the regional
settings and the codepage is assigned automatically. So the XxxA functions
continue use the ANSI encoding.

Well, the speed of searching in filenames can be discutable... But I like to
use only one string type for everything in my application model and convert
only on rare places, where a contact with OS is necessary. I like to use
wstring only.
...
If you 
continued to use ASCII, nothing would change.
Yes, if I used ASCII but I use all characters from the current locale. I
write localized application, including names of files. Then simply throw
UTF-8 in would break it.
...
There is zero 
speed penalty for calculating the # of *bytes* in an utf-8 
string.  If you want to determine the # of characters, then 
there is, but then only if you're actually working on a 
Unicode string.  There is no getting around this for an 
international application, no matter what encoding is used.
It is not predictable, if there are some non-ASCII characters in a string or
not. Use a different, more suitable Unicode encoding for it. If you have to
declare the complexity of your algorithms and the application is time
critical, the multibyte (UTF) encodings are not used, but the fixed-byte
(UCS) ones. UTF are recommended for transport because they do not waste
space, UCS for runtime because they guarantee constant complexity of
operations. But it is not a dogma, UTF encodings are not "forbidden" in
runtime, it depends on requirements on an application.

Windows (XxxW) and Java use UCS-2 for the Unicode character representation
in their APIs, for example.
...
...
Why to prescribe it for all boost::filesystem users and 
force them to
put 
recoding into their sources,
[Bennett, Patrick] Absolutely no recoding would be necessary 
for current users of boost::filesystem.  boost::filesystem 
has no support for unicode today, so why would they have to 
recode anything?
As I said, absolutely no recoding is necessary now, if you use
boost::filesystem with the strings in a native codepage. For example East
Europe or Cyrillic. Everything will work well because you deliver your
application localized in these countries.

If you start accepting only UTF-8, such application would be broken. The
application, which does not use Unicode (on Windows it does not have to and
it can run on Japanese Windows good though), would have to perform recoding
from the local codepage, in which it works, into UTF-8, which
boost::filesystem would accept.

And I did not mention, that Windows9x do not implement all XxxW methods
(aditional library must be used to compile an application) and the recoding
methods on Windows95 do not have support for UTF-8 as a codepage as Windows
98/NT have. It would mean a different solution for Windows9x withinn
boost::filesystem.
...
...
UTF-8 is not identical with the complete iso-8859-1 
(latin1) codepage.
Some code could be broken by accepting UTF-8 in the new version.
[Bennett, Patrick] Hmmm, good point, but... would it break 
for any of the characters that are valid characters for a 
path or filename on an
8859-1 system?  No, not that I can think of.
Again, valid are all characters in the current codepage for XxxA methods and
in the current locale for POSIX methods. Except for wildcards and path
delimiters, of course. Then it is change of interface breaking all usage
scenarios.
...
[Bennett, Patrick] If you can think of a good way of handling 
this that doesn't involve a mess of codepages, locales, and 
facets, then I'm all for it.  Frankly I think C++'s 
'built-in' internationalization support is a nightmare, but 
that's probably just me.  My (intentional) limited exposure 
to them probably hasn't helped.  It's hard to beat having a 
'single' encoding like UTF-8 that can handle all defined characters.
Unicode is definitely the way to go IMO.
Unicode is the ideal way, however, not all systems support it or are
configured to use it. If you force its support in boost::filesystem, you
would force Unicode to all applications using it, which do not do it now.

Currently boost::filesystem does not force it, but allows it with the
following setup:

A local encoding on UNIX: no problem
  OS: configure for example cs_CZ environment
  boost::filesystem: use cs_CZ strings

UTF-8 on UNIX: no problem
  OS: configure UTF-8 environment
  boost::filesystem: use UTF-8 strings

A local encoding on Windows: no problem
  System: configure for example Czech environment
  boost::filesystem: use czech strings

Windows has no predefined regional settings with UTF-8, but uses always an
ANSI codepage. Unicode support is additional and preferred for NT. If you
wanted UTF-8 on API boost::filesystem, you would have to:

1) either use XxxW methods in boost::filesystem and convert from UTF-8 to
UCS-2
2) or recode the input UTF-8 strings into the local ANSI codepage within
boost::filesystem to be able to use XxxA methods

ad 1) it needs a recoding engine (built in Windows, for UTF-8 on 98/NT)
ad 2) it will work for names from the current codepage only (UTF-8 recoding
is from 98/NT)
...
My real issue with boost::filesystem is that as currently 
defined, it's unusable in an application that will be used 
around the world.  My initial response to this whole thread 
was just to point out to David that there *are* issues 
preventing people from using the library.  He didn't think 
there were any, so I was compelled to point out what one of 
the issues was for me at least.
Could you specify exactly what prevents you to use it? I can use
boost::filesystem in czech and russian environments happily and for japanese
it is the same as long as you stay within the local codepage.

boost::filesystem is perfectly usable in any localized application, which
instance runs in a single language environment. The application itself can
be installed in any language environment, however. The only thing you have
to keep in mind is, that the strings being accepted by boost::filesystem are
in the current locale/ANSI codepage.

The problem is somewhere else and does not concern only boost::filesystem
but the whole locale support by OSes.

Just to clarify: localized application is an application, which can run on a
localized environment, has a localized interface and accepts all arguments
(here: filenames) from the localized environment. This boost::filesystem
fulfills this for any existing locale, including japanese, without
recompiling (single binary, localized resources).

Nowadays comes another requirement - sharing applications and its data
throughout more locales (locale switching, data distribution) without
loosing information (here: crippled characters in filenames). To make myself
clear - to use more locales together in a single application, for example
czech, russian and japanese together.

Unfortunately, this is not possible to achieve using POSIX methods (libc,
libstdc++/STL), because they use the current locale, which does not need to
be defined to Unicode (UTF-8) but to any ANSI. AFAIK only Windows implement
C/C++ runtime with native Unicode suport, using UCS-2.

It is a question, if boost::filesystem should help it. Currently it behaves
the same way, how POSIX and libc and libstdc++/STL: works in the current
locale. I think, that ANSI-C comitee should have updated the library
specification to contain wchar_t methods since a long time, btw... It could
have a positive influence on std::streams (wchar_t* methods) and
boost::filesystem as well. In such ideal worlds you would have no problems
:-)

You are right in these points:

1) you cannot write an application, which has characters from more codepages
(locales) in a filename
2) if characters in two codepages are mapped differently, filenames would be
incompatible, although the characters are valid in both encodings

Currently you are not able to write applications for Windows, which require
usage of Unicode API (XxxW functions) - the two poins above. It is not
because boost::filesystem does not use UTF-8. It is because: ANSI-C does not
use Unicode -> ANSI-C++ (STL) does not use Unicode -> boost::filesystem does
not use Unicode.

My opinion: the way to solve it is not to change encoding of std::string and
char* parameters of the current boost::filesystem to UTF-8 but to follow the
STL way - to introduce methods with std::wstring and wchar_t or better,
templatize it with basic_string<CharType>. It is difficult because wchar_t
is not in STL and libc, I know.
...
At the company where I work we're currently just pursuing our 
own wrappers for what filesystem provides.  I originally 
tried using filesystem, but once I saw that it's handling of 
internationalization was absent, I had no choice but to dump 
it.  I certainly have an interest in it being improved, and I 
could see looking at it again, but someone will have to 
spearhead that initiative.  Considering that this hasn't 
really been brought up before  tells me that people either 
aren't using the library, or simply don't care about 
internationalization (probably the latter).  I, 
unfortunately, don't have that luxury.
I am from an Eastern European country and you bet we are concerned with
internationalization. However, no change comes without a discussion and you
are not alone using the library and your suggested change would make the
others (at least me ;-) dificulties.

If nobody has got the problem could also mean, that there is a usage style,
which works. This I pointed out above.

Besides, I don't have the luxury to support platforms with full support of
XxxW methods and UTF-8 support in WideCharToMultibyte. That's why would like
to extend boost::filesystem with the support of the current behavior (work
in the current locale) but for Unicode applications perform recoding, which
does not need to happen, if OS supports Unicode natively. That's why I
pointed out the imbue principle used in std::streams for the similar case -
application can work in Unicode (UCS-2 wchar_t) and content of the stream
can be in any encoding (including UCS2 wchar_t for no recoding).

Ferda
...
Cheers...
Patrick Bennett