
Rogier van Dalen wrote:
On Thu, 17 Mar 2005 17:52:25 +0100, Erik Wien <wien@start.no> wrote:
What exactly do mean by the term "character"? Abstract characters?
I really need to remember the correct terminology - what I mean is the thing "a user thinks of as a character", a "grapheme cluster", of which the Unicode standard says:
"[T]here is a core concept of "characters that should be kept together" that can be defined for the Unicode Standard in a language-independent way. This core concept is known as a grapheme cluster, and it consists of any combining character sequence that contains only nonspacing combining marks, or any sequence of characters that constitutes a Hangul syllable (possibly followed by one or more nonspacing marks)."
I believe this is what a Unicode library should use as its basic unit.
Be careful with making a global assertion. Different users of a Unicode library will need to access the data at different levels. Some will need the raw encoding bytes or words, some will need code points, and some will need 'grapheme clusters'. The library should support working at the level that each particular user needs, and different parts of an application or library may need to work at multiple levels. -- Jon Biggar Levanta jon@levanta.com