
Hi Erik! I'm glad to see you've made a lot of progress in these months of silence. I've got a few comments for now. Of course there isn't much documentation yet, but now that the library is out in the open, writing a Unicode primer might be a good thing to do now. Issues that I don't think many programmers are aware of include (off the top of my head) what code points are (21 bits), what Unicode characters are, why you need combining characters, why UTF-32 is not usually optimal. The library will once need these docs anyway. I'd gladly help out with this, though I'm not sure this would fit your university's requirements. Some speculation on the Unicode database: do you really need the character names? Maybe you should use multi_index, probably with hashing. Maybe you could use Boost.Serialisation for loading the file. I think that in general you would need to separate input/output from other Unicode processing. For example: endianness only matters when portably reading/writing files; IMO strings in memory should have your platform's endianness. (I second Thorsten's proposal of having utf8_string, utf16_string, utf32_string, utf_string.) For reading code points from files, a codecvt could be used. This can be fast because its virtual functions are called only once per so many bytes. I think there's an implementation floating around in the yahoo files section that can automatically figure out the file's encoding and convert to and from any endianness. I also think you should separate code points and Unicode characters. In normal situations, the user should not have to deal with code points. The discussion should not focus on that for now; it's an implementation detail. I strongly object to your typedef encoded_string<unicode_tag> unicode_string; because I think a Unicode string should contain characters. For example, a regular expression on Unicode strings should support level 2 (see <http://www.unicode.org/reports/tr18/>). Why go for anything less? Hoping this will be useful, Rogier