Re: [boost] [General] Always treat std::strings as UTF-8

15 Jan 2011

      ...
From: Patrick Horgan <phorgan1@gmail.com>
On 01/14/2011 02:05 PM, Peter Dimov wrote:
...
John B. Turpish  wrote:
...
By the way, I disagree with Peter's assessment that, "you  rarely, if 
ever,
need to access the Nth character," but I will gladly cede that
this depends on your problem domain.
It obviously depends on  the problem domain :-) but, when 
talking about Unicode, you can't reliably  access the Nth character,
in general, even with UCS-32. (As far as I know.)
I  don't understand.  UCS-32 (I assume you meant encoded as UTF-32)
is a fixed  width encoding so the n-th character is just
4n away from the beginning of the  string.   Right?
No,

Nth Unicode code-point is at nth position not a character.

For example in word "שָלוֹם" as 4 characters "שָ"‎, "ל"‎, "וֹ"‎, "ם"‎  and 6 
code points: ש‎ ָ‎ ל‎ ו‎ ֹ‎ מ
Where two code points are diacritic marks.

Boost.Locale has special character iterator to handle characters for this 
purpose and it
works on characters and not code points.

See: 
http://cppcms.sourceforge.net/boost_locale/html/tutorial.html#8e296a067a3756...

Artyom