Boost's direction regarding UTF8 -> UTF32 and UTF32 -> UTF8

Dear boosters, What is the consensus on best-practice regarding the conversion of UTFx to/from UTFy? I see that Boost Regex uses ICU for some conversions, but is there any user exported interface for conversions? There are several points in which being able to convert between UTFs is interesting. While I could stick to one of the several solutions and façades on the Web (including some Boost Vault projects), I would rather know the current community opinion on the subject. Any sandbox and/or vault projects that catched any veterans eyes? Kind regards, Rodrigo

(noob here)
Maybe boost::locale ?
http://cppcms.sourceforge.net/boost_locale/html/index.html
On Wed, Jun 23, 2010 at 21:40, Rodrigo Madera
Dear boosters,
What is the consensus on best-practice regarding the conversion of UTFx to/from UTFy?
I see that Boost Regex uses ICU for some conversions, but is there any user exported interface for conversions?
There are several points in which being able to convert between UTFs is interesting.
While I could stick to one of the several solutions and façades on the Web (including some Boost Vault projects), I would rather know the current community opinion on the subject.
Any sandbox and/or vault projects that catched any veterans eyes?
Kind regards, Rodrigo
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

What is the consensus on best-practice regarding the conversion of UTFx to/from UTFy?
Consensus? Hardly any yet I'm afraid.
I see that Boost Regex uses ICU for some conversions, but is there any user exported interface for conversions?
Actually regex doesn't need ICU for that - it uses the iterators in boost/regex/pending/unicode_iterator.hpp most of the available conversions are supported there I think. HTH, John.

On 6/24/10 4:50 PM, John Maddock wrote:
Actually regex doesn't need ICU for that - it uses the iterators in boost/regex/pending/unicode_iterator.hpp most of the available conversions are supported there I think.
John, we use those iterators a lot now. Yet, it's been "pending" for many years. Isn't it about time for them to graduate from the "pending" state? :-) Regards, -- Joel de Guzman http://www.boostpro.com http://spirit.sf.net

Actually regex doesn't need ICU for that - it uses the iterators in boost/regex/pending/unicode_iterator.hpp most of the available conversions are supported there I think.
John, we use those iterators a lot now. Yet, it's been "pending" for many years. Isn't it about time for them to graduate from the "pending" state? :-)
Sigh. Well I've been hoping that someone else would produce a "proper" Unicode library to take care of this... John.

Rodrigo Madera wrote:
Dear boosters,
What is the consensus on best-practice regarding the conversion of UTFx to/from UTFy?
There are John Maddock's iterator adapters, those of the Boost.Unicode library under development that are similar to them, and there is also Boost.Locale, another library under development that is a frontend to ICU, that provides function to do conversions (but it works with memory buffers).

There are John Maddock's iterator adapters, those of the Boost.Unicode library under development that are similar to them, and there is also Boost.Locale, another library under development that is a frontend to ICU, that provides function to do conversions (but it works with memory buffers).
I see that Maddock's adapters does the job, but does it provide the reliability of ICU? For example, when it comes to round-trip conversions? Not that I doubt the library, it's just that it looks like a subset of a bigger domain. If it's been tested and deemed appropriate, I think it suffices for me. Any comments on the review schedule for Boost.Locale and Boost.Unicode? It seemed as if they compete on some points. Thank you for your kind input, Rodrigo

On Thu, Jun 24, 2010 at 5:17 AM, Rodrigo Madera
There are John Maddock's iterator adapters, those of the Boost.Unicode library under development that are similar to them, and there is also Boost.Locale, another library under development that is a frontend to ICU, that provides function to do conversions (but it works with memory buffers).
I see that Maddock's adapters does the job, but does it provide the reliability of ICU? For example, when it comes to round-trip conversions?
Each code point only has a single valid representation in any of the UTF encodings, so anything but perfect round-trip transcoding would be a bug. Overlong encodings are invalid, and normalization forms are a separate issue outside of UTF transcoding. -- Cory Nelson http://int64.org

Rodrigo Madera wrote:
There are John Maddock's iterator adapters, those of the Boost.Unicode library under development that are similar to them, and there is also Boost.Locale, another library under development that is a frontend to ICU, that provides function to do conversions (but it works with memory buffers).
I see that Maddock's adapters does the job, but does it provide the reliability of ICU?
They could result in undefined behaviour when given invalid UTF input at the end of the string. So not very reliable, no.
Any comments on the review schedule for Boost.Locale and Boost.Unicode? It seemed as if they compete on some points.
Boost.Unicode will be added to the review queue in September (at least that's the plan). When it gets reviewed will be whenever a willing review manager presents itself after that. I can't speak for Locale as I'm not the author, but it seems pretty advanced and could be competing at around the same time if not before.

Rodrigo Madera wrote:
There are John Maddock's iterator adapters, those of the Boost.Unicode library under development that are similar to them, and there is also Boost.Locale, another library under development that is a frontend to ICU, that provides function to do conversions (but it works with memory buffers).
I see that Maddock's adapters does the job, but does it provide the reliability of ICU?
They could result in undefined behaviour when given invalid UTF input at the end of the string. So not very reliable, no.
Really? That would be a bug, the intention is that they should always throw an exception when given invalid input. Of course a more complete solution would always be welcome.... John.

John Maddock wrote:
Really? That would be a bug, the intention is that they should always throw an exception when given invalid input.
The iterator adapter has no way of knowing it has reached the end. Consider this in u16_to_u32_iterator: void increment() { // skip high surrogate first if there is one: if(detail::is_high_surrogate(*m_position)) ++m_position; ++m_position; m_value = pending_read; } If the last character is a high surrogate, you increment the iterator twice, while it is only allowed to do it once. Fixing the bug means making the iterator adapter have knowledge of the beginning, the end, and the current position.
Of course a more complete solution would always be welcome....
My library deals with this.

On 6/25/10 3:16 AM, Mathias Gaunard wrote:
John Maddock wrote:
Really? That would be a bug, the intention is that they should always throw an exception when given invalid input.
The iterator adapter has no way of knowing it has reached the end.
Consider this in u16_to_u32_iterator:
void increment() { // skip high surrogate first if there is one: if(detail::is_high_surrogate(*m_position)) ++m_position; ++m_position; m_value = pending_read; }
If the last character is a high surrogate, you increment the iterator twice, while it is only allowed to do it once.
Fixing the bug means making the iterator adapter have knowledge of the beginning, the end, and the current position.
Of course a more complete solution would always be welcome....
My library deals with this.
How? By storing the beginning, the end, and the current position? Regards, -- Joel de Guzman http://www.boostpro.com http://spirit.sf.net

Le 24/06/2010 23:46, Joel de Guzman a écrit :
My library deals with this.
How? By storing the beginning, the end, and the current position?
Yes. Storing the beginning should not be necessary for non-bidirectional iterators though.

On 6/25/10 8:36 AM, Mathias Gaunard wrote:
Le 24/06/2010 23:46, Joel de Guzman a écrit :
My library deals with this.
How? By storing the beginning, the end, and the current position?
Yes. Storing the beginning should not be necessary for non-bidirectional iterators though.
Ouch. Just because of that one exceptional case (invalid UTF input at the end of the string.)? Regards, -- Joel de Guzman http://www.boostpro.com http://spirit.sf.net

Joel de Guzman wrote:
Ouch. Just because of that one exceptional case (invalid UTF input at the end of the string.)?
I think either it should be safe to use for any string, or only safe to use for a valid one. No need for any in-between like safe to use for any string except if invalid at the end. That's why my library will provide two variants (safe and unsafe), albeit I haven't done that yet. I wonder however if the "Ouch" is really justified. Surely the fact that iterators are a bit "thick" shouldn't add that much overhead? I really need to benchmark this.

The iterator adapter has no way of knowing it has reached the end.
Consider this in u16_to_u32_iterator:
void increment() { // skip high surrogate first if there is one: if(detail::is_high_surrogate(*m_position)) ++m_position; ++m_position; m_value = pending_read; }
If the last character is a high surrogate, you increment the iterator twice, while it is only allowed to do it once.
Fixing the bug means making the iterator adapter have knowledge of the beginning, the end, and the current position.
Ah, guilty as charged :-( The fix is horrible though :-(
Of course a more complete solution would always be welcome....
My library deals with this.
Good! As I keep saying this was only supposed to be an interim solution until someone did it properly... I'll look forward to seeing yours being reviewed! Cheers, John.
participants (6)
-
Cory Nelson
-
Joel de Guzman
-
John Maddock
-
Klaim
-
Mathias Gaunard
-
Rodrigo Madera