Boost's direction regarding UTF8 -> UTF32 and UTF32 -> UTF8

newer
[boost][exception] Wide-character...

Rodrigo Madera

23 Jun 2010 23 Jun '10

7:40 p.m.

Dear boosters, What is the consensus on best-practice regarding the conversion of UTFx to/from UTFy? I see that Boost Regex uses ICU for some conversions, but is there any user exported interface for conversions? There are several points in which being able to convert between UTFs is interesting. While I could stick to one of the several solutions and façades on the Web (including some Boost Vault projects), I would rather know the current community opinion on the subject. Any sandbox and/or vault projects that catched any veterans eyes? Kind regards, Rodrigo

Attachments:

attachment.html (text/html — 743 bytes)

Show replies by date

Klaim

23 Jun 23 Jun

10:55 p.m.

New subject: Boost's direction regarding UTF8 -> UTF32 and UTF32 -> UTF8

(noob here) Maybe boost::locale ? http://cppcms.sourceforge.net/boost_locale/html/index.html On Wed, Jun 23, 2010 at 21:40, Rodrigo Madera <rodrigo.madera@gmail.com>wrote:

...

John Maddock

24 Jun 24 Jun

8:50 a.m.

New subject: Boost's direction regarding UTF8 -> UTF32 and UTF32-> UTF8

...

What is the consensus on best-practice regarding the conversion of UTFx to/from UTFy?

Consensus? Hardly any yet I'm afraid.

...

I see that Boost Regex uses ICU for some conversions, but is there any user exported interface for conversions?

Actually regex doesn't need ICU for that - it uses the iterators in boost/regex/pending/unicode_iterator.hpp most of the available conversions are supported there I think. HTH, John.

Joel de Guzman

9:36 a.m.

New subject: Boost's direction regarding UTF8 -> UTF32 and UTF32-> UTF8

On 6/24/10 4:50 PM, John Maddock wrote:

...

John, we use those iterators a lot now. Yet, it's been "pending" for many years. Isn't it about time for them to graduate from the "pending" state? :-) Regards, -- Joel de Guzman http://www.boostpro.com http://spirit.sf.net

John Maddock

3:58 p.m.

New subject: Boost's direction regarding UTF8 -> UTF32 andUTF32-> UTF8

...

Sigh. Well I've been hoping that someone else would produce a "proper" Unicode library to take care of this... John.

Mathias Gaunard

9:49 a.m.

New subject: Boost's direction regarding UTF8 -> UTF32 and UTF32 -> UTF8

Rodrigo Madera wrote:

...

There are John Maddock's iterator adapters, those of the Boost.Unicode library under development that are similar to them, and there is also Boost.Locale, another library under development that is a frontend to ICU, that provides function to do conversions (but it works with memory buffers).

Rodrigo Madera

12:17 p.m.

New subject: Boost's direction regarding UTF8 -> UTF32 and UTF32 -> UTF8

...

I see that Maddock's adapters does the job, but does it provide the reliability of ICU? For example, when it comes to round-trip conversions? Not that I doubt the library, it's just that it looks like a subset of a bigger domain. If it's been tested and deemed appropriate, I think it suffices for me. Any comments on the review schedule for Boost.Locale and Boost.Unicode? It seemed as if they compete on some points. Thank you for your kind input, Rodrigo

Cory Nelson

1:58 p.m.

New subject: Boost's direction regarding UTF8 -> UTF32 and UTF32 -> UTF8

On Thu, Jun 24, 2010 at 5:17 AM, Rodrigo Madera <rodrigo.madera@gmail.com> wrote:

...

Each code point only has a single valid representation in any of the UTF encodings, so anything but perfect round-trip transcoding would be a bug. Overlong encodings are invalid, and normalization forms are a separate issue outside of UTF transcoding. -- Cory Nelson http://int64.org

Mathias Gaunard

2:28 p.m.

New subject: Boost's direction regarding UTF8 -> UTF32 and UTF32 -> UTF8

Rodrigo Madera wrote:

...

They could result in undefined behaviour when given invalid UTF input at the end of the string. So not very reliable, no.

...

Any comments on the review schedule for Boost.Locale and Boost.Unicode? It seemed as if they compete on some points.

Boost.Unicode will be added to the review queue in September (at least that's the plan). When it gets reviewed will be whenever a willing review manager presents itself after that. I can't speak for Locale as I'm not the author, but it seems pretty advanced and could be competing at around the same time if not before.

John Maddock

4:03 p.m.

New subject: Boost's direction regarding UTF8 -> UTF32 andUTF32 -> UTF8

...

Really? That would be a bug, the intention is that they should always throw an exception when given invalid input. Of course a more complete solution would always be welcome.... John.

Mathias Gaunard

7:16 p.m.

New subject: Boost's direction regarding UTF8 -> UTF32 andUTF32 -> UTF8

John Maddock wrote:

...

Really? That would be a bug, the intention is that they should always throw an exception when given invalid input.

The iterator adapter has no way of knowing it has reached the end. Consider this in u16_to_u32_iterator: void increment() { // skip high surrogate first if there is one: if(detail::is_high_surrogate(*m_position)) ++m_position; ++m_position; m_value = pending_read; } If the last character is a high surrogate, you increment the iterator twice, while it is only allowed to do it once. Fixing the bug means making the iterator adapter have knowledge of the beginning, the end, and the current position.

...

Of course a more complete solution would always be welcome....

My library deals with this.

Joel de Guzman

10:46 p.m.

New subject: Boost's direction regarding UTF8 -> UTF32 andUTF32 -> UTF8

On 6/25/10 3:16 AM, Mathias Gaunard wrote:

...

How? By storing the beginning, the end, and the current position? Regards, -- Joel de Guzman http://www.boostpro.com http://spirit.sf.net

Mathias Gaunard

25 Jun 25 Jun

12:36 a.m.

New subject: Boost's direction regarding UTF8 -> UTF32 andUTF32 -> UTF8

Le 24/06/2010 23:46, Joel de Guzman a écrit :

...

...
My library deals with this.

How? By storing the beginning, the end, and the current position?

Yes. Storing the beginning should not be necessary for non-bidirectional iterators though.

Joel de Guzman

1:28 a.m.

New subject: Boost's direction regarding UTF8 -> UTF32 andUTF32 -> UTF8

On 6/25/10 8:36 AM, Mathias Gaunard wrote:

...

Ouch. Just because of that one exceptional case (invalid UTF input at the end of the string.)? Regards, -- Joel de Guzman http://www.boostpro.com http://spirit.sf.net

Mathias Gaunard

12:58 p.m.

New subject: Boost's direction regarding UTF8 -> UTF32 andUTF32 -> UTF8

Joel de Guzman wrote:

...

Ouch. Just because of that one exceptional case (invalid UTF input at the end of the string.)?

I think either it should be safe to use for any string, or only safe to use for a valid one. No need for any in-between like safe to use for any string except if invalid at the end. That's why my library will provide two variants (safe and unsafe), albeit I haven't done that yet. I wonder however if the "Ouch" is really justified. Surely the fact that iterators are a bit "thick" shouldn't add that much overhead? I really need to benchmark this.

John Maddock

8:11 a.m.

New subject: Boost's direction regarding UTF8 -> UTF32andUTF32 -> UTF8

...

Ah, guilty as charged :-( The fix is horrible though :-(

...

...
Of course a more complete solution would always be welcome....

My library deals with this.

Good! As I keep saying this was only supposed to be an interim solution until someone did it properly... I'll look forward to seeing yours being reviewed! Cheers, John.

5512

Age (days ago)

5514

Last active (days ago)

List overview

Download

15 comments

6 participants

participants (6)

Cory Nelson
Joel de Guzman
John Maddock
Klaim
Mathias Gaunard
Rodrigo Madera