Unicode and codecvt facets

Mathias Gaunard

5 Jul 2010 5 Jul '10

4:03 p.m.

As some may know, I am working on a Unicode library that I plan to submit to Boost fairly soon. The codecs in that library are based around iterators and ranges, but since there was some demand for support for codecvt facets I am working on adapting those into that form as well. Unfortunately, it seems it is only possible to subclass std::codecvt<char, char, mbstate_t> and std::codecvt<wchar_t, char, mbstate_t>. I personally don't know and understand that much about iostreams/locales, but I have looked quickly at libstdc++'s implementation and it doesn't seem like it is possible for std::locale to contain any other instance of codecvt. What I wonder is if there is really a point to facets, then. std::codecvt<wchar_t, char, mbstate_t> means that the in-memory charset would be UTF-16 or UTF-32 (depending on the size of wchar_t) while the file would be UTF-8. The problem is that wchar_t is platform-dependent and not really reliable, so it's not really something I'd recommend to use as the in-memory representation to deal with Unicode. Why do people even use utf8_codecvt_facet anyway? What's wrong with dealing with UTF-8 rather than maybe UTF-16 or UTF-32?

Show replies by date

Artyom

5 Jul 5 Jul

4:27 p.m.

...

As some may know, I am working on a Unicode library that I plan to submit to Boost fairly soon.

Take a look on Boost.Locale proposal.

...

The codecs in that library are based around iterators and ranges, but since there was some demand for support for codecvt facets I am working on adapting those into that form as well.

Unfortunately, it seems it is only possible to subclass std::codecvt<char, char, mbstate_t> and std::codecvt<wchar_t, char, mbstate_t>.

Yes, these are actually the only specialized classes. More then that std::codecvt<char, char, mbstate_t> should be - "noconvert" facet.

...

I personally don't know and understand that much about iostreams/locales, but I have looked quickly at libstdc++'s implementation and it doesn't seem like it is possible for std::locale to contain any other instance of codecvt.

You can derive from these two classes in re-implement them (like I did in Boost.Locale). Also I strongly recommend to take a look on locale and iostreams in standard library if you are working with Unicode for C++.

...

What I wonder is if there is really a point to facets, then. std::codecvt<wchar_t, char, mbstate_t> means that the in-memory charset would be UTF-16 or UTF-32 (depending on the size of wchar_t) while the file would be UTF-8.

Not exactly narrow encoding may be any 8-bit encoding, even something like Latin1 or Shift-JIS (and UTF-8 as well).

...

The problem is that wchar_t is platform-dependent and not really reliable, so it's not really something I'd recommend to use as the in-memory representation to deal with Unicode.

Welcome to broken Unicode world of C++. Yes. wchar_t is platform dependent, if you want to use it you should support both of these encodings UTF-16 and UTF-32 (technically it may be even 8 bits wide, but there is no such implementations). C++0x provides char16_t and char32_t to fix this standard's bug.

...

Why do people even use utf8_codecvt_facet anyway? What's wrong with dealing with UTF-8 rather than maybe UTF-16 or UTF-32?

Ask Windows developers, they use wide strings because it is the only way to work correctly with their OS. Artyom

Mathias Gaunard

5:01 p.m.

On 05/07/10 17:27, Artyom wrote:

...

...
As some may know, I am working on a Unicode library that I plan to submit to Boost fairly soon.

Take a look on Boost.Locale proposal.

I know of it, yes. But my library purposely *doesn't* use the standard C++ locale subsystem because it's slow, broken, and inflexible. Nevertheless I want to provide the ability to bridge my library with that system.

...

...
The codecs in that library are based around iterators and ranges, but since there was some demand for support for codecvt facets I am working on adapting those into that form as well.

Unfortunately, it seems it is only possible to subclass std::codecvt<char, char, mbstate_t> and std::codecvt<wchar_t, char, mbstate_t>.

Yes, these are actually the only specialized classes.

I was hoping I could specialize some more myself. Some implementations appear to support using arbitrary codecvt facets just fine, but not GCC's and MSVC's.

...

More then that std::codecvt<char, char, mbstate_t> should be - "noconvert" facet.

I'm talking about types derived from these. There is no restriction for subclasses of std::codecvt<char, char, mbstate_t> to be non-convert, only std::codecvt<char, char, mbstate_t> is.

...

You can derive from these two classes in re-implement them (like I did in Boost.Locale).

That's indeed what I said I can do, but as I said I find that very limiting.

...

Also I strongly recommend to take a look on locale and iostreams in standard library if you are working with Unicode for C++.

The thing is, I'm not sure it's worth delving into it too much. On top of being a so-so design, the popular implementations seem to all do things differently and have different limitations.

...

...
What I wonder is if there is really a point to facets, then. std::codecvt<wchar_t, char, mbstate_t> means that the in-memory charset would be UTF-16 or UTF-32 (depending on the size of wchar_t) while the file would be UTF-8.

Not exactly narrow encoding may be any 8-bit encoding, even something like Latin1 or Shift-JIS (and UTF-8 as well).

My library doesn't aim at providing code conversion from/to every character set ever invented, which is why I just put UTF-8 in there. Regardless I intend to allow to define a codecvt facet from any pair of objects modeling the Converter concept; so nothing would prevent someone from writing one or chaining them to do whatever they want, granted it converts between char and char or wchar_t and char, since it seems there is no way around that one. That way you can also do normalization, case conversion or whatnot with a codecvt facet.

...

C++0x provides char16_t and char32_t to fix this standard's bug.

GCC has those types in C++0x mode, but doesn't support codecvt facets with them.

...

...
Why do people even use utf8_codecvt_facet anyway? What's wrong with dealing with UTF-8 rather than maybe UTF-16 or UTF-32?

Ask Windows developers, they use wide strings because it is the only way to work correctly with their OS.

utf8_codecvt_facet is an utility provided by boost in the detail namespace, that some libraries not particularly tied to Windows appear to use.

Bo Persson

5:48 p.m.

Mathias Gaunard wrote:

...

On 05/07/10 17:27, Artyom wrote:

...
...
As some may know, I am working on a Unicode library that I plan to submit to Boost fairly soon.

Take a look on Boost.Locale proposal.

I know of it, yes. But my library purposely *doesn't* use the standard C++ locale subsystem because it's slow, broken, and inflexible. Nevertheless I want to provide the ability to bridge my library with that system.

...
...
The codecs in that library are based around iterators and ranges, but since there was some demand for support for codecvt facets I am working on adapting those into that form as well. Unfortunately, it seems it is only possible to subclass std::codecvt<char, char, mbstate_t> and std::codecvt<wchar_t, char, mbstate_t>.

Yes, these are actually the only specialized classes.

I was hoping I could specialize some more myself.

You can, but you should specialize on the third parameter, if you want to treat chars differently than default. A codecvt<char, char, my_state_t> could easily do any kind of conversions you need. You then only need to supply a non-default char_traits template parameter to basic_fstream, with my_state_t as its state_type, and imbue the file with a locale containing your specialized codecvt. Who said it should be easy? :-) Bo Persson

Mathias Gaunard

6 Jul 6 Jul

9:44 a.m.

Le 05/07/2010 18:48, Bo Persson wrote:

...

You can, but you should specialize on the third parameter, if you want to treat chars differently than default.

A codecvt<char, char, my_state_t> could easily do any kind of conversions you need.

I don't particularly need state, what I would like is to be able to use the right character types rather than wide characters, which are implementation-specific. Also it doesn't seem states other than mbstate_t can be used by filebufs.

Bo Persson

5:09 p.m.

Mathias Gaunard wrote:

...

Le 05/07/2010 18:48, Bo Persson wrote:

...
You can, but you should specialize on the third parameter, if you want to treat chars differently than default.

A codecvt<char, char, my_state_t> could easily do any kind of conversions you need.

I don't particularly need state, what I would like is to be able to use the right character types rather than wide characters, which are implementation-specific.

You mean the implementation doesn't provide the right character types? :-) The third parameter is there so you can specialize the use of chars, other that the default implementation's. The codecvt<char, char, mbstate_t> does what is "standard" on the platform. If you want something else, you need a codecvt<char, char, non_std_state_t>, even if you don't use the state variable passed to your codecvt<>. The standard codecvt<> doesn't use it either.

...

Also it doesn't seem states other than mbstate_t can be used by filebufs.

It should work, but perhaps not the most tested part on a standard library. The filebuf (basic_filebuf, that is) is templated on char and std::char_traits<char>. Again, if you want something other that default, you need basic_filebuf<char, non_std_char_traits<char>>. With the current standard you only have char and wchar_t. You could perhaps use some other int types as characters, but will then run into problems like having no string literals available. Bo Persson

Artyom

5 Jul 5 Jul

7:35 p.m.

...

I know of it, yes. But my library purposely *doesn't* use the standard C++ locale subsystem because it's slow, broken, and inflexible. Nevertheless I want to provide the ability to bridge my library with that system.

I would disagree. It has very interesting basics and very extensible allowing you to carry in one class all culture information and copy it efficiently.

...

Regardless I intend to allow to define a codecvt facet from any pair of objects modeling the Converter concept; so nothing would prevent someone from writing one or chaining them to do whatever they want, granted it converts between char and char or wchar_t and char, since it seems there is

no way around that one.

[snip]

That way you can also do normalization, case conversion or whatnot with a codecvt facet.

It is not the job of codecvt facet, also you'll find may issues with it, it is not suitable for it. In such case I would suggest that it maybe better to use iostreams like interface. or create additional facet that does normalization and case conversion, don't use codecvt, it is designed for specific purposes. You need facets? Create your own. Also, do you aware of fact that case conversion is locale dependent? So you still need to connect locale somehow.

...

utf8_codecvt_facet is an utility provided by boost in the detail namespace, that some libraries

not particularly tied to Windows appear to use.

utf8_codecvt_facet is designed to be used by many libraries that want to convert text from wide to narrow format. For example boost.program_options use it for text conversion for thous who needs wide api. Artyom

Mathias Gaunard

6 Jul 6 Jul

9:42 a.m.

Le 05/07/2010 20:35, Artyom wrote:

...

I would disagree. It has very interesting basics and very extensible allowing you to carry

in one class all culture information and copy it efficiently.

I do it better: no class, just a big table, some converters, and some segmenters.

...

It is not the job of codecvt facet

I provide a generic bridge between my converters and codecvt facets. If you instantiate it with a converter that decodes UTF-8, does normalization, then re-encodes in UTF-16, it does just that. I actually think it's a good idea to put the normalization in there. A lot of things require the string to be normalized to work properly, so if you can automatically do that for unreliable data from files it's one less worry.

...

also you'll find may issues with it, it is not suitable for it.

What kind of issues? Are you expecting many-to-many conversion to not work?

...

or create additional facet that does normalization and case conversion, don't use codecvt, it is designed for specific purposes. You need facets? Create your own.

I only need facets if they can be used by an fstream to convert their data. There is no point in creating other types of facets if they're not used by the iostreams subsystem. The whole point of the exercise is to allow the iostreams subsystem to make use of my converters.

...

Also, do you aware of fact that case conversion is locale dependent? So you still need to connect

locale somehow.

My library is locale-agnostic.

Artyom

12:04 p.m.

...

...
Also, do you aware of fact that case conversion is locale dependent? So you still need to connect

locale somehow.

My library is locale-agnostic.

That you should not provide case conversion as you can't do it right. Artyom

Mathias Gaunard

6:45 p.m.

Le 06/07/2010 13:04, Artyom a écrit :

...

...
...
Also, do you aware of fact that case conversion is locale dependent? So you still need to connect

locale somehow.

My library is locale-agnostic.

That you should not provide case conversion as you can't do it right.

Sure you can, it's just not tailored to a particular language. See <http://unicode.org/reports/tr21/tr21-5.html>. My library doesn't do it at the moment anyway, and probably won't before some time.

Eric MALENFANT

2:30 p.m.

De : Mathias Gaunard

...

Le 05/07/2010 20:35, Artyom wrote:

...
also you'll find may issues with it, it is not suitable for it.

What kind of issues? Are you expecting many-to-many conversion to not work?

There *may* be issues, yes. The question is still unclear, IMO, and is summarized at the end of this post by Alberto Barbati, who once proposed a "Boost UTF" library of codecvt facets: http://lists.boost.org/Archives/boost/2003/01/41969.php. The DR Alberto refers to in this post is issue 393: http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-closed.html#393, whose proposed resolution is applied in n3092. However, this seems to create a contradiction: The note coming from the resolution of issue 393 is there, but the footnote stating that this paragraph informally means that only 1 to N encodings are supported with basic_filebuf is there too. This is unfortunate considering that issue 393 was created to allow the use of "state" in order to support N to M encodings without violating the "one internal character at a time" rule. Summing up: It seems clear that codecvt in itself can support N to M convertions, but it is unclear (to me, at least) whether it is usable with basic_filebuf.

...

I only need facets if they can be used by an fstream to convert their data. There is no point in creating other types of facets if they're not used by the iostreams subsystem. The whole point of the exercise is to allow the iostreams subsystem to make use of my converters.

Indeed. And, considering the above, there may be a problem. However, even if it may not work with the standard fstream, there is still be a useful application area: Boost.IOStream's code_converter. (in fact, we currently use Alberto's proposed library that way).

5513

Age (days ago)

5514

Last active (days ago)

List overview

Download

10 comments

4 participants

participants (4)

Artyom
Bo Persson
Eric MALENFANT
Mathias Gaunard