
Erik Wien wrote:
Hi. I am in the process of planning a library for handling unicode strings in C++, and would like to probe the interest in the boost community for something like that. I read through the unicode dicussion that was up back in april, and from what I could gather there was some amount of interest, but no one felt comfortable taking on the task as of yet.
I am hoping to be able to run this project as my Bachelor's Thesis in Computer Engineering (Not sure if that is the correct translation from Norwegian.) and if it gets approved by my college, myself and two other programmers will spend one semester working exclusively on this. (of course in collaboration with the boost community) At the end of that semester I hope the library (Or at least parts of it) will be in such a state it can submitted for review by boost.
The library should ultimately have suppport for at least basic handling of unicode strings (in all encodings), collation of strings and other locale specific operations. The library should also be (to the extent that is possible) integrated with the standard C++ library (and boost) to get as much functionality as possible "for free". I'm here thinking of, among other things, the std::locale class and compabillity with iostreams. How these requirements are fulfilled will be determined as the project (hopefully) moves forward.
A few points you probably already know: 1) Wide characters and Unicode characters are not necessarily the same thing for any given implementation. 2) There are quite a few Unicode encodings. 3) The idea is to be able to plug in a Unicode encoding into the same standard library templates and boost templates which now support 'char' and wchar_t'. In other words ideally you want to treat your Unicode encoding as just another character type, with extra smarts depending on the encoding. The extra smarts would be used in specializations. In the past in comp.std.c++ I attempted to promote the idea that all standard library functionality which dealt generally in characters and strings should be parameterized on the character type for the sake of orthogonality and the future. While most are, there is still some functionality which does not, ie exceptions and file names and locale message files, and assume that only narrow characters exist in its usage. I am still amazed that programmers from countries which would normally use wide characters as Unicode encodings, such as the Japanese, have not made more of an issue with this, but perhaps they are so used to their far more difficult DBCS roots that pursuing wide characters everywhere, much less a real Unicode encoding, is a minor issue with them.