Re: [boost] RFC: interest in Unicode codecs?

Dear Phil,
Having said all that, I must say that I actually use the code that I wrote quite rarely. I now tend to use UTF8 everywhere and treat it as a sequence of bytes. Because of the properties of UTF8 I find it's rare to need to identify individual code points. For example, if I'm scanning for a matching " or ) I can just look for the next matching byte, without worrying about where the character boundaries are.
Using UTF-8 can work well if you are only targeting American and Western Europe for non-literary use. If you need to support the rest of the world you really need to move to UTF-32 due to the large number of characters and the grapheme and glyph handling [e.g. in Urdu you can type 3 characters and they are displayed as a single combined glyph, and the cursor should never be placed between them]. Even in UTF-8 things can get a bit tricky. For example, where do you break the line if you needed in the middle of: joe)jack -> joe) <br> jack joe(jack -> joe <br> (jack joe+jack -> guess which is the standard ! For programmers we don't mind too much, but when you are writing text editors this can be really important. Now think how many characters there are with special rules on whether they can be split before, after, or never split, and you start to touch on the reason for the Unicode standard and why you need character properties. Yours, Graham

On Saturday 14 February 2009 11:53:20 Graham wrote:
Using UTF-8 can work well if you are only targeting American and Western Europe for non-literary use.
If you need to support the rest of the world you really need to move to UTF-32 due to the large number of characters and the grapheme and glyph handling [e.g. in Urdu you can type 3 characters and they are displayed as a single combined glyph, and the cursor should never be placed between them].
I think you have gotten something mixed up. UTF-8 and UTF-32 (aka UCS4) are just two encodings of the same character set, including the combining you mentioned (which are really not that uncommon, e.g. mêlée contains 2 characters which could be written by combining glyphs. In practical terms, UTF-32 is somewhat useless. (A case might be made for UTF-16, though) -- Kind regards, Esben

Graham wrote:
Dear Phil,
Having said all that, I must say that I actually use the code that I wrote quite rarely. I now tend to use UTF8 everywhere and treat it as a sequence of bytes. Because of the properties of UTF8 I find it's rare to need to identify individual code points.
Note that I said "rare" not "never", and that in the part that you didn't quote I explained that I do have code to extract code points from UTF-8 byte sequences.
For example, if I'm scanning for a matching " or ) I can just look for the next matching byte, without worrying about where the character boundaries are.
Using UTF-8 can work well if you are only targeting American and Western Europe for non-literary use.
If you need to support the rest of the world you really need to move to UTF-32 due to the large number of characters and the grapheme and glyph handling
UTF-8 encodes the same characters as UTF-32. I wonder if you miss-read "UTF-8" as "ISO-8859-1"?
[e.g. in Urdu you can type 3 characters and they are displayed as a single combined glyph, and the cursor should never be placed between them].
Right. This is a very complex area. But I don't think the choice of UTF-8 or UTF-32 makes much difference. If you use UTF-32 you can have efficient random access which you can't with UTF-8. UTF-8 will be more compact than UTF-32 in all but the most contrived cases. Whether compactness or efficiency of random access matters to you will depend on your application. These are almost the only ways in which the choice of encoding matters.
Even in UTF-8 things can get a bit tricky. For example, where do you break the line if you needed in the middle of: joe)jack -> joe) <br> jack joe(jack -> joe <br> (jack joe+jack -> guess which is the standard !
I don't see how this influences your choice of UTF variant.
For programmers we don't mind too much, but when you are writing text editors this can be really important.
Now think how many characters there are with special rules on whether they can be split before, after, or never split, and you start to touch on the reason for the Unicode standard and why you need character properties.
Yes, a Unicode character properties library is important to those who are writing text editors and similar applications. Perhaps Boost should have one. I have personally used the Unicode properties tables for doing "approximate matching" of e.g. accented characters with their base characters when searching. But I can do that equally well in UTF-8 as in UTF-32. Regards, Phil.

On Sat, Feb 14, 2009 at 10:07 AM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
/* snip */ Yes, a Unicode character properties library is important to those who are writing text editors and similar applications. Perhaps Boost should have one. I have personally used the Unicode properties tables for doing "approximate matching" of e.g. accented characters with their base characters when searching. But I can do that equally well in UTF-8 as in UTF-32.
If you are all interested in other opinions, I would love for boost to have a UTF8(16/32) helper library. I use ICU for many things, but it is too bulky for the little I use of it, as well as the syntax does not match anything else I use (a rather boost inspired design my projects stay). A boost library would be wonderful. And yes, UTF8/16/32 all encode the same character set, just different levels of compressibility and searching speed.

OvermindDL1 wrote:
On Sat, Feb 14, 2009 at 10:07 AM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
/* snip */ Yes, a Unicode character properties library is important to those who are writing text editors and similar applications. Perhaps Boost should have one. I have personally used the Unicode properties tables for doing "approximate matching" of e.g. accented characters with their base characters when searching. But I can do that equally well in UTF-8 as in UTF-32.
If you are all interested in other opinions, I would love for boost to have a UTF8(16/32) helper library.
There is a google summer of code project for a unicode library which I'm working on. It allows handling of unicode text in any of UTF-8, UTF-16 or UTF-32 encodings, bundles a small-ish unicode character database, supports grapheme boundaries, composition/decomposition and normalization, but not "approximate matching", collation or case folding (at least it won't for the time being).
participants (5)
-
Esben Mose Hansen
-
Graham
-
Mathias Gaunard
-
OvermindDL1
-
Phil Endecott