Filesystem, serialization, character encoding and portable software
Hello, I'm currently writting an open source and portable (Win32 / OS X / Linux) software, and i'm using parts of boost libraries, especially for the issue described here : - boost serialization - boost filesystem At some point, i'm storing a directory name (as a std::string) into a xml file through boost serialization library. The encoding occurs on a windows platform, and as far as i can understand (i'm really new to all these character set issues) it seems to be done in win1252 character set. One of the directory "stored" is named "français" (notice the cedilla on the c) and is the issue.. When i load this file back on a linux or osx platform, and try to access files in the "français" directory, i got an error of the kind (no such directory)... because the encoding of the ç with cedilla seem to be wrong... (everythong works fine if i put "francais" everywhere without cedilla) Is there somewhere a good starting document for character set for dummies with tutorials or stuff like that ? How would you suggest i correct the issue (while keeping the "ç") ? Altought, I understand this is not completely boost issue... But it seems to have so many "hidden" functions that i'm wondering if the libraries would have something to do that that i missed... Regards, Mathieu -- http://www.incub.net/
Mathieu Peyréga wrote:
Hello,
I'm currently writting an open source and portable (Win32 / OS X / Linux) software, and i'm using parts of boost libraries, especially for the issue described here :
- boost serialization - boost filesystem
At some point, i'm storing a directory name (as a std::string) into a xml file through boost serialization library.
The encoding occurs on a windows platform, and as far as i can understand (i'm really new to all these character set issues) it seems to be done in win1252 character set.
If it's an xml file there is no 'seems'. An XML file always has a well defined encoding or else it's tag-soup - or whatever the term :) The filename in the std::string however may well be win1252, and I'd say this is already a problem, because on Windows path-elements are Unicode encoded and if you use a singlebyte character set you will always hit problems sooner or later. I recommend encoding the path elements as UTF-8 if you want to stick with std::string.
(...)
br, Martin
the wide character xml archives use UTF8. the narrow character xml archives use the currently set locale. Robert Ramey Martin Trappel wrote:
Mathieu Peyréga wrote:
Hello,
I'm currently writting an open source and portable (Win32 / OS X / Linux) software, and i'm using parts of boost libraries, especially for the issue described here :
- boost serialization - boost filesystem
At some point, i'm storing a directory name (as a std::string) into a xml file through boost serialization library.
The encoding occurs on a windows platform, and as far as i can understand (i'm really new to all these character set issues) it seems to be done in win1252 character set.
If it's an xml file there is no 'seems'. An XML file always has a well defined encoding or else it's tag-soup - or whatever the term :)
The filename in the std::string however may well be win1252, and I'd say this is already a problem, because on Windows path-elements are Unicode encoded and if you use a singlebyte character set you will always hit problems sooner or later. I recommend encoding the path elements as UTF-8 if you want to stick with std::string.
(...)
br, Martin
Robert Ramey wrote:
the wide character xml archives use UTF8.
the narrow character xml archives use the currently set locale.
Robert Ramey
Sorry for asking offhand: What is the reasoning behind this different behaviour? I would have expected that in both cases UTF-8 would have been used assuming that the xml-encoding is described as UTF-8. But I probably have overlooked something very basic? Thanks for your patience and Greetings from Bremen, Daniel
Daniel Krügler wrote:
Robert Ramey wrote:
the wide character xml archives use UTF8.
the narrow character xml archives use the currently set locale.
Robert Ramey
Sorry for asking offhand:
What is the reasoning behind this different behaviour?
I assumed that most programs built with narrow characters used the locale concept to deal with this. Wide character systems lend themselves to UTF coding so I used that for wide char archives. In order to do this, I used Ron Garcia's UTF code conversion facet for streams. It would be quite easy to generate UTF coding for narrow character archives. Just do the following: a) Build the UTF code conversion facet for narrow character input (its templated on character type). b) When the stream is opened, attach this facet to the stream. Note the the output char format is not really a property of the serialization library, but rather an artifact of the way it has been used. That is, the serialization library depends on the standard stream library for this property. Robert Ramey
I would have expected that in both cases UTF-8 would have been used assuming that the xml-encoding is described as UTF-8. But I probably have overlooked something very basic?
Thanks for your patience and
Greetings from Bremen,
Daniel
Robert Ramey wrote:
Daniel Krügler wrote:
Robert Ramey wrote:
the wide character xml archives use UTF8.
the narrow character xml archives use the currently set locale.
Robert Ramey
Sorry for asking offhand:
What is the reasoning behind this different behaviour?
I assumed that most programs built with narrow characters used the locale concept to deal with this.
Wide character systems lend themselves to UTF coding so I used that for wide char archives. In order to do this, I used Ron Garcia's UTF code conversion facet for streams.
It would be quite easy to generate UTF coding for narrow character archives. Just do the following:
a) Build the UTF code conversion facet for narrow character input (its templated on character type).
b) When the stream is opened, attach this facet to the stream.
Note the the output char format is not really a property of the serialization library, but rather an artifact of the way it has been used. That is, the serialization library depends on the standard stream library for this property.
Thanks for your thorough explanation, Robert. There remains a slight bad feeling in my stomach (Apologies for a possibly inappropriate metaphorical speaking): Many programs are written to be compilable (and executable) in both narrow character or wide character mode. The above described difference of the serialization library unfortunately seem to have the effect that those two programs could not interact with the same persisted serialization product, right? Or to say it in different words: If the programmer decides to switch e.g. from one character mode to the other (a typical usecase I think), (s)he has to take care of those possibly needed extra steps to realize compatibility of the serialization IO. This is especially quite cumbersome, because the more typical way would be to switch from narrow character to wide character mode. In this case the serialization has already caused harm, because the old code had created output which is locale-dependent, while the newer code is free of this local-dependency, but has now the problem to interpret existing serialization outputs. Have I understood this effect correctly? Thanks, - Daniel
participants (4)
-
Daniel Krügler
-
Martin Trappel
-
Mathieu Peyréga
-
Robert Ramey