Re: [boost] Call for interest for native unicode character and string support in boost

Hi, I enclose a basic set of headers [in a single file for simplicity at this stage] so that the set of functions to provide Unicode character data can be discussed. I believe that the core functions calls will probably end up using functions pointers so that different DLLs can use the same functionality and still have the implementation be transparent, and that the function pointers will the retrieved during an initialise call. Does anybody have any strong opinions on this? Once the Unicode.h contents are agreed then I hope - if it is agreed quickly - to implement the necessary data generators to implement it. Once that is done then a string implementation should be fairly quick to implement. I welcome comments. Yours, Graham Barnett BEng, MCSD/ MCAD .Net, MCSE/ MCSA 2003, CompTIA Sec+

Hi, Great, this seems a good first step. Glad to see things moving. I'll give my comments, but I hope Erik will step in so we can see what he's got.
Once the Unicode.h contents are agreed then I hope - if it is agreed quickly - to implement the necessary data generators to implement it.
Once that is done then a string implementation should be fairly quick to implement.
But maybe not quick to specify... :-)
I welcome comments.
I agree with the general idea. First, http://www.boost.org/more/lib_guide.htm#Guidelines has coding guidelines. In general, your code looks slightly C-ish. The Boost habit is to use the ".hpp" extension for C++ headers. You attached a file "unicode.hpp" but talk about "Unicode.hpp": note that these are different names. I suggest we make a namespace "unicode" rather than prepending everything with "uni". The enums had probably better be put in structs. namespace unicode { struct range { enum type { latin1_supplement, latin_extended_a, latin_extended_a, ipa_extensions, // ... } }; } The fact that I find "Hungarian notation" ugly and meaningless is probably irrelevant, but it's not the way it's generally done in Boost. char32_t is not yet a part of the C++ standard, I believe. I'm not sure, maybe we'd better call it "codepoint" anyway, and use #ifdef'ed typedef's. BOOL is not C++; it is spelled "bool". DWORD doesn't exist either; I believe you mean uint16_t (sic) for the collation data, if I understand correctly what the methods are doing. But I think collation should not be in this header yet, but rather be inserted later, when the string classes are defined. Case conversion should probably take output iterators. That'll get rid of the complex/simple division. The methods should probably be templated as well, and take ranges rather than counts. template <class InputIterator, class Outputiterator> lowercase (InputIterator first, InputIterator last, OutputIterator result); template <class InputIterator, class Outputiterator> uppercase (InputIterator first, InputIterator last, OutputIterator result); The break functions: Couldn't these take iterators as well? For all use cases I can think of, this would be a much easier version to use: template <class InputIterator> InputIterator advance_grapheme (InputIterator position, InputIterator last); (etc.) Finally, just thinking out loud: both the case mappings and collation have default (non-locale-specific) and tailored modes. Shouldn't those best be represented by classes rather than free functions, and shouldn't there thus be a global variable "default" that provides default operations, and other objects for locale-specific operations? Regards, Rogier
participants (2)
-
Graham
-
Rogier van Dalen