Character Encoding (was: General C++: Class Method Signatures)

4 May 2009

      ...
Date: Fri, 1 May 2009 17:51:03 -0500
From: Dominique Devienne <ddevienne@gmail.com>
Subject: Re: [Boost-users] General C++: Class Method Signatures
  Miss-Match
To: boost-users@lists.boost.org
Message-ID:
  <255d8d690905011551v108b95c4sbfeaa21c443ac23e@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
On Fri, May 1, 2009 at 10:38 AM, Etienne Philip Pretorius
<icewolfhunter@gmail.com> wrote:
...
Wide chars as far as I know are different from platform to platform.
Microsoft uses 16 bit wide chars while *nix uses 32 bit wide chars.
And for
unicode I need at least 21 bits.
It's not only a question of char size, the encoding matters. I don't
know for sure, but I bet Windows' can represent all code points by
using pairs of 16-bit wide chars (surrogate pairs). As long as you
have a way to convert the current wstring instances to a know encoding
like UTF-8, UTF-16 (bom, le, be), or UTF-32 for wire transport or
persistence, the actual representation of wstring doesn't matter.
The Windows API uses 16-bit characters and calls it "Unicode".
Originally it was UCS2, meaning each code point was simply a 16-bit
unsigned integer.  Later, they added some support for surrogate pairs,
which allows codes >64K to be represented as a pair of values.  This is
now known as UTF-16.  This only works for some of the text functions,
though.  Much of the API and any casual handling functions will simply
count each 16-bit cell as 1 character unit.

So yes, if you are using the font-related functions to draw text, you
can represent any code point using UTF-16.  But for opening a file, it
treats the name as a sequence of 16-bit values without any understanding
of surrogate pairs or duplicate meanings.

For manipulating wstrings in C++, neither the standard library wcs*
functions nor the wstring class deals with surrogate pairs.  So if you
call wsclen, for example, you get the number of elements in the wchar_t
array, not the number of Unicode code points, if some of them are
encoded pairs.
...
Of course I don't know what the encoding conversion methods would be
for Windows and Linux and *nix in general. Does Boost.IOStream provide
those in a portable fashion? I'd be interested on pointers on this
topic if you know about those methods. Thanks, --DD
My limited knowledge is that Linux can use UTF-8 as the code page.  Some
distros do that by default.

General pointer on this topic:  When specifying a file or interchange
format, (1) use Unicode, and (2) specify the encoding or options
thereof.

IMHO, manipulating strings as UTF-16 is neither here nor there.  You
bloat the size if you are dealing with mostly Western language text, but
*still* have to deal with multi-unit sequences.  So if you care about
characters, really, as opposed to just how much room do you need for the
representation, then use 32-bit characters (I call it "xstring") or
stick with UTF-8 if that's what's loaded/saved.

--John
(mind the footer...)

TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

John Dlugosz

tags

participants (1)