Character Encoding (was: General C++: Class Method Signatures)
Date: Fri, 1 May 2009 17:51:03 -0500 From: Dominique Devienne
Subject: Re: [Boost-users] General C++: Class Method Signatures Miss-Match To: boost-users@lists.boost.org Message-ID: <255d8d690905011551v108b95c4sbfeaa21c443ac23e@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1 On Fri, May 1, 2009 at 10:38 AM, Etienne Philip Pretorius
wrote: Wide chars as far as I know are different from platform to platform. Microsoft uses 16 bit wide chars while *nix uses 32 bit wide chars. And for unicode I need at least 21 bits.
It's not only a question of char size, the encoding matters. I don't know for sure, but I bet Windows' can represent all code points by using pairs of 16-bit wide chars (surrogate pairs). As long as you have a way to convert the current wstring instances to a know encoding like UTF-8, UTF-16 (bom, le, be), or UTF-32 for wire transport or persistence, the actual representation of wstring doesn't matter.
The Windows API uses 16-bit characters and calls it "Unicode". Originally it was UCS2, meaning each code point was simply a 16-bit unsigned integer. Later, they added some support for surrogate pairs, which allows codes >64K to be represented as a pair of values. This is now known as UTF-16. This only works for some of the text functions, though. Much of the API and any casual handling functions will simply count each 16-bit cell as 1 character unit. So yes, if you are using the font-related functions to draw text, you can represent any code point using UTF-16. But for opening a file, it treats the name as a sequence of 16-bit values without any understanding of surrogate pairs or duplicate meanings. For manipulating wstrings in C++, neither the standard library wcs* functions nor the wstring class deals with surrogate pairs. So if you call wsclen, for example, you get the number of elements in the wchar_t array, not the number of Unicode code points, if some of them are encoded pairs.
Of course I don't know what the encoding conversion methods would be for Windows and Linux and *nix in general. Does Boost.IOStream provide those in a portable fashion? I'd be interested on pointers on this topic if you know about those methods. Thanks, --DD
My limited knowledge is that Linux can use UTF-8 as the code page. Some distros do that by default. General pointer on this topic: When specifying a file or interchange format, (1) use Unicode, and (2) specify the encoding or options thereof. IMHO, manipulating strings as UTF-16 is neither here nor there. You bloat the size if you are dealing with mostly Western language text, but *still* have to deal with multi-unit sequences. So if you care about characters, really, as opposed to just how much room do you need for the representation, then use 32-bit characters (I call it "xstring") or stick with UTF-8 if that's what's loaded/saved. --John (mind the footer...) TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.
participants (1)
-
John Dlugosz