[locale] Composing asymmetric locale for character encoding conversion

newer
[thread] C++11 once_flag enabled...

Andrey Semashev

2 Mar 2013 2 Mar '13

10:56 a.m.

Hi, Suppose I have a logging application that writes log records in wide (wchar_t, UTF-16) and narrow (char, UTF-8) encodings and I want these logs to be stored in a UTF-16LE encoded file. For simplicity, let's assume that I write log files with std::wofstream. Now, the standard says that the file stream buffer is supposed to convert wide characters to byte sequences using the locale imbued into the buffer. However, it seems that the locale should be the same as the one imbued into the stream (basic_ostream::imbue makes sure of that). What this leads to is that in order to achieve my goal the locale should be able to convert narrow characters of UTF-8 to wide characters of UTF-16 and wide characters of UTF-16 to narrow characters representing byte sequence of UTF16LE. Is it possible to make such an asymmetric locale with Boost.Locale? Or maybe there is another way of doing this? An additional question. Is it possible to to achieve my goal with std::ofstream (as opposed to std::wofstream)? I have a very strong suspicion that the answer is no because the narrow characters will pass on unconverted to the file instead of being translated from UTF-8 to UTF-16LE, but maybe I'm missing something. Thank you.

Show replies by date

Artyom Beilis

2 Mar 2 Mar

8:30 p.m.

New subject: [locale] Composing asymmetric locale for character encoding conversion

...

________________________________ From: Andrey Semashev <andrey.semashev@gmail.com> To: boost@lists.boost.org Sent: Saturday, March 2, 2013 12:56 PM Subject: [boost] [locale] Composing asymmetric locale for character encoding conversion

Hi,

Suppose I have a logging application that writes log records in wide (wchar_t, UTF-16) and narrow (char, UTF-8) encodings and I want these logs to be stored in a UTF-16LE encoded file. For simplicity, let's assume that I write log files with std::wofstream. Now, the standard says that the file stream buffer is supposed to convert wide characters to byte sequences using the locale imbued into the buffer.

In generally it is done by codecvt facet, but it id designed to covert wide characters to 8 bit encode and vise versa.

...

However, it seems that the locale should be the same as the one imbued into the stream (basic_ostream::imbue makes sure of that).

No you can install your own codecvt to existing locale object and than imbue it into the stream.

...

What this leads to is that in order to achieve my goal the locale should be able to convert narrow characters of UTF-8 to wide characters of UTF-16 and wide characters of UTF-16 to narrow characters representing byte sequence of UTF16LE. Is it possible to make such an asymmetric locale with Boost.Locale? Or maybe there is another way of doing this?

No, the stuff you are probably looking for is in an interface that provides both `std::basic_ostream<char>` and `std::basic_ostream<wchar_t>~ And than implement your stream buffer that would do the conversion.

...

An additional question. Is it possible to to achieve my goal with

...

std::ofstream (as opposed to std::wofstream)?

No, you will need: 1. two different wide and narrow streams. 2. Your custom stream buffer that would convert input characters to your arbitrary encoding You'd better start from boost::iostream and use boost::locale::utf::* functions for character set manipulation.

...

I have a very strong suspicion that the answer is no because the narrow characters will pass on unconverted to the file instead of being translated from UTF-8 to UTF-16LE, but maybe I'm missing something.

Yes you are correct the codecvt<char,char> is no-op.

...

Thank you.

Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/

...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Andrey Semashev

9:42 p.m.

New subject: [locale] Composing asymmetric locale for character encoding conversion

On Sun, Mar 3, 2013 at 12:30 AM, Artyom Beilis <artyomtnk@yahoo.com> wrote:

...

...
________________________________ From: Andrey Semashev <andrey.semashev@gmail.com> To: boost@lists.boost.org Sent: Saturday, March 2, 2013 12:56 PM Subject: [boost] [locale] Composing asymmetric locale for character encoding conversion

Hi,

Suppose I have a logging application that writes log records in wide (wchar_t, UTF-16) and narrow (char, UTF-8) encodings and I want these logs to be stored in a UTF-16LE encoded file. For simplicity, let's assume that I write log files with std::wofstream. Now, the standard says that the file stream buffer is supposed to convert wide characters to byte sequences using the locale imbued into the buffer.

In generally it is done by codecvt facet, but it id designed to covert wide characters to 8 bit encode and vise versa.

...
However, it seems that the locale should be the same as the one imbued into the stream (basic_ostream::imbue makes sure of that).

No you can install your own codecvt to existing locale object and than imbue it into the stream.

I'm not sure you understood. I was pointing out there are two locales in the stream: the one in the stream and the one in the stream buffer. And apparently, they should be the same.

...

...
What this leads to is that in order to achieve my goal the locale should be able to convert narrow characters of UTF-8 to wide characters of UTF-16 and wide characters of UTF-16 to narrow characters representing byte sequence of UTF16LE. Is it possible to make such an asymmetric locale with Boost.Locale? Or maybe there is another way of doing this?

No, the stuff you are probably looking for is in an interface that provides both `std::basic_ostream<char>` and `std::basic_ostream<wchar_t>~

Hmm, why two streams? This will add operator<< ambiguity, won't it? I can already output narrow strings to wide streams, which results in character conversion (to UTF-16 wchar_t). The problem left is to convert it to UTF-16LE byte sequence. I tried to create a locale that would perform this conversion (I tried boost::locale::generator()("en_US.UTF-16LE");) and it didn't work. Does Boost.Locale support this kind of conversion?

Jan Hudec

3 Mar 3 Mar

9:32 p.m.

New subject: [locale] Composing asymmetric locale for character encoding conversion

On Sat, Mar 02, 2013 at 14:56:52 +0400, Andrey Semashev wrote:

...

Suppose I have a logging application that writes log records in wide (wchar_t, UTF-16)

wchar_t does not have to be UTF-16. On most non-Windows platforms it is UCS-4. The standard also seems to expect each wchar_t to contain complete codepoint, which isn't the case with UTF-16, so UTF-16 isn't supported. That said everybody uses it as UTF-16 on Windows, because Microsoft jumped on the Unicode bandwagon too fast and baked 2-byte wchar_t into the API so that using UTF-16 is now the only option to support unicode after 2.0 there.

...

and narrow (char, UTF-8) encodings and I want these logs to be stored in a UTF-16LE encoded file. For simplicity, let's assume that I write log files with std::wofstream. Now, the standard says that the file stream buffer is supposed to convert wide characters to byte sequences using the locale imbued into the buffer.

Yes, right. And the `operator<<(std::wostream &, const char *)` uses the locale imbued in the stream.

...

However, it seems that the locale should be the same as the one imbued into the stream (basic_ostream::imbue makes sure of that).

Now why do you think? basic_ios::imbue makes it the *default*, but I don't think it forbids overriding the buffer locale.

...

What this leads to is that in order to achieve my goal the locale should be able to convert narrow characters of UTF-8 to wide characters of UTF-16 and wide characters of UTF-16 to narrow characters representing byte sequence of UTF16LE. Is it possible to make such an asymmetric locale with Boost.Locale? Or maybe there is another way of doing this?

It's not needed. Just imbue two different locales. You only have to be careful about the order, because the stream overwrites the buffer's locale. As I said above, wchar_t does not have to be utf-16, so the buffer needs to use locale with codecvt_utf16 facet and the stream needs to use locale with codecvt_utf8 facet. Alternatively you can use boost::iostreams::file_sink wrapped in explicit boost::iostreams::code_converter using codecvt_utf16 and imbue the outer stream with codecvt_utf8.

...

An additional question. Is it possible to to achieve my goal with std::ofstream (as opposed to std::wofstream)? I have a very strong suspicion that the answer is no because the narrow characters will pass on unconverted to the file instead of being translated from UTF-8 to UTF-16LE, but maybe I'm missing something.

All streams accept their character type and plain char, but not other character types. So you can't write wide string into narrow stream at all. -- Jan 'Bulb' Hudec <bulb@ucw.cz>

Andrey Semashev

5 Mar 5 Mar

8:13 a.m.

New subject: [locale] Composing asymmetric locale for character encoding conversion

On Mon, Mar 4, 2013 at 1:32 AM, Jan Hudec <bulb@ucw.cz> wrote:

...

On Sat, Mar 02, 2013 at 14:56:52 +0400, Andrey Semashev wrote:

...
Suppose I have a logging application that writes log records in wide (wchar_t, UTF-16)

wchar_t does not have to be UTF-16. On most non-Windows platforms it is UCS-4.

The standard also seems to expect each wchar_t to contain complete codepoint, which isn't the case with UTF-16, so UTF-16 isn't supported. That said everybody uses it as UTF-16 on Windows, because Microsoft jumped on the Unicode bandwagon too fast and baked 2-byte wchar_t into the API so that using UTF-16 is now the only option to support unicode after 2.0 there.

Yes, I'm aware of that. I have Windows in mind.

...

...
However, it seems that the locale should be the same as the one imbued into the stream (basic_ostream::imbue makes sure of that).

Now why do you think? basic_ios::imbue makes it the *default*, but I don't think it forbids overriding the buffer locale.

Come to think of it, you may be right. I cannot find any further indication of that the same locale is expected.

...

...
What this leads to is that in order to achieve my goal the locale should be able to convert narrow characters of UTF-8 to wide characters of UTF-16 and wide characters of UTF-16 to narrow characters representing byte sequence of UTF16LE. Is it possible to make such an asymmetric locale with Boost.Locale? Or maybe there is another way of doing this?

It's not needed. Just imbue two different locales. You only have to be careful about the order, because the stream overwrites the buffer's locale.

As I said above, wchar_t does not have to be utf-16, so the buffer needs to use locale with codecvt_utf16 facet and the stream needs to use locale with codecvt_utf8 facet.

Alternatively you can use boost::iostreams::file_sink wrapped in explicit boost::iostreams::code_converter using codecvt_utf16 and imbue the outer stream with codecvt_utf8.

All these assume the availability of codecvt_utf16 from C++11 (codecvt_utf8 can be replaced with Boost.Locale-generated facet, I guess). Also, there seem to be no codecvt_utf32 for some reason, in case if I wanted to write UTF-32 encoded files. As far as I can see, Boost.Locale does not provide C++11 codecvt facets. Is that right? Is this support planned?

4545

Age (days ago)

4548

Last active (days ago)

List overview

Download

4 comments

3 participants

participants (3)

Andrey Semashev
Artyom Beilis
Jan Hudec