Re: [boost] RFC: interest in Unicode codecs?

14 Feb 2009


      Esben,
...
I think you have gotten something mixed up. UTF-8 and UTF-32 (aka UCS4)
are 
just two encodings of the same character set, including the combining
you 
mentioned (which are really not that uncommon, e.g. m?l?e contains 2 
characters which could be written by combining glyphs. In practical
terms, 
UTF-32 is somewhat useless. (A case might be made for UTF-16, though)
Kind regards, Esben
Having written both basic text editors and Unicode text editors, I can
say that if you are going Western Hemisphere then may be more efficient
to go UTF-8. If you stick to Unicode Code Page 0 then UTF-16 might be
appropriate if you have no formatting bits, but by the time you want to
do a full Unicode text editor you end up with [from memory] 21 or 22
bits of the UTF-32 encoding, and the remaining bits for your own
formatting info if you need it [font/ colour etc]. With surrogates, you
are still [very] slightly encoded in a 32 bit width, but this is a very
acceptable trade off for simplicity. In that sense UTF-32 is a misnomer
as it does not occupy a full 32 bits, but it is still an encoding !

Yours,

Graham