
On Thu, 24 Mar 2011 11:14:08 +0800 Soares Chen <crf@hypershell.org> wrote:
[...] What should be the type for single Unicode combine character and grapheme? Unicode combine characters and graphemes (aka the abstract characters) can consist of arbitrary number of code points. This means that unlike basic types such as char that can be placed on the stack, the value for even single abstract character must stay at the heap due to it's variable size. [...]
Maybe not. The "Stream-Safe Text Format" is designed specifically for this. From <http://www.unicode.org/reports/tr15/index.html#Stream_Safe_Text_Format>: A Unicode string is said to be in Stream-Safe Text Format if it would not contain any sequences of non-starters longer than 30 characters in length when normalized to NFKD. Such a string can be normalized in buffered serialization with a buffer size of 32 characters, which would require no more than 128 bytes in any Unicode Encoding Form. It might be feasible to require graphemes to be in this format. I was planning to do so if I ever wrote a grapheme iterator. Of course, it still might not be feasible to use a fixed-size structure for graphemes, depending on how many you need to store at once, but for an iterator it would be reasonable. -- Chad Nelson Oak Circle Software, Inc. * * *