Re: [boost] Call for interest for native unicode character and string support in boost

From: Rogier van Dalen <rogiervd@gmail.com> Subject: Re: [boost] Call for interest for native unicode character and string support in boost
Great, this seems a good first step. Glad to see things moving. I'll give my comments, but I hope Erik will step in so we can see what he's got.
I welcome comments.
I agree with the general idea. First, http://www.boost.org/more/lib_guide.htm#Guidelines has coding guidelines. In general, your code looks slightly C-ish. The Boost habit is to use the ".hpp" extension for C++ headers. You attached a file "unicode.hpp" but talk about "Unicode.hpp": note that these are different names. I suggest we make a namespace "unicode" rather than prepending everything with "uni". The enums had probably better be put in structs.
namespace unicode { struct range { enum type { latin1_supplement, latin_extended_a, latin_extended_a, ipa_extensions, // ... } }; } Yes - it should be namespaced - I had omitted it for clarity. I still think that the uni prefix might be useful to remind those programmers using 'using unicode' that these are Unicode functions - but I am happy to lose that argument.
The fact that I find "Hungarian notation" ugly and meaningless is probably irrelevant, but it's not the way it's generally done in Boost. char32_t is not yet a part of the C++ standard, I believe. I'm not sure, maybe we'd better call it "codepoint" anyway, and use #ifdef'ed typedef's. BOOL is not C++; it is spelled "bool". DWORD doesn't exist either; I believe you mean uint16_t (sic) for the collation data, if I understand correctly what the methods are doing. But I think collation should not be in this header yet, but rather be inserted later, when the string classes are defined.
Oops - caught - I was attempting to write it in such a way that it could be used from C as well as C++ - hence BOOL not bool. DWORD is actually uint32_t. I believe collation must be here as there will be probably be several containers with Unicode characteristics and this is a good level for them to work on.
Case conversion should probably take output iterators. That'll get rid of the complex/simple division. The methods should probably be templated as well, and take ranges rather than counts. template <class InputIterator, class Outputiterator> lowercase (InputIterator first, InputIterator last, OutputIterator result); template <class InputIterator, class Outputiterator> uppercase (InputIterator first, InputIterator last, OutputIterator result); I like this but we will still need to have a complex/simple division. However using iterators the complex can do both, and the simple then becomes GetSimpleLowercase for case conversion without changing length, but it can again take an output iterator.
The break functions: Couldn't these take iterators as well? For all use cases I can think of, this would be a much easier version to use: template <class InputIterator> InputIterator advance_grapheme (InputIterator position, InputIterator >last); (etc.) When I did my original coding I coding each of following: GetStartOfGrapheme GetPreviousGrapheme GetNextGrapheme I found that just be having IsStartOfGrapheme all these became really simple routines. I therefore believe extremely strongly that it is necessary to have StartOfGrapheme and that the others like GetNextGrapheme or advancegrapheme will then be simple/inline wrappers that use StartOfGrapheme. I also found that there was a coding hit if you have to test start and end iterator positions when processing the grapheme, hence I was passing in three DWORDs. Having said that, allowing inline versions that take iterators to call the core uint32_t/ [DWORD] functions would a good thing, and I would expect this to happen.
Finally, just thinking out loud: both the case mappings and collation have default (non-locale-specific) and tailored modes. Shouldn't those best be represented by classes rather than free functions, and shouldn't there thus be a global variable "default" that provides default operations, and other objects for locale-specific operations?
Unicode case mappings are locale inspecific. I do not intend to handle any code page conversions at this stage - that can be added on later and should be handled separately in a separate discussion. Those conversions would not be Unicode conversions and I believe that discussion should be postponed for a later date. Yours, Graham

Hi,
...
BOOL is not C++; it is spelled "bool". DWORD doesn't exist either; I believe you mean uint16_t (sic) for the collation data, if I understand correctly what the methods are doing. But I think collation should not be in this header yet, but rather be inserted later, when the string classes are defined.
Oops - caught - I was attempting to write it in such a way that it could be used from C as well as C++ - hence BOOL not bool. DWORD is actually uint32_t. I know, but - correct me if I'm wrong - I think the sort keys consist of 16-bit integers.
I believe collation must be here as there will be probably be several containers with Unicode characteristics and this is a good level for them to work on. I'm not sure - seeing as I'd expect us to come up with at least the option to keep strings in some Normalisation Form, the first step of the collation algorithm should be skipped in some cases. Furthermore, individual comparisons should probably be done without producing sort keys.
Case conversion should probably take output iterators. That'll get rid of the complex/simple division. The methods should probably be templated as well, and take ranges rather than counts. template <class InputIterator, class Outputiterator> lowercase (InputIterator first, InputIterator last, OutputIterator result); template <class InputIterator, class Outputiterator> uppercase (InputIterator first, InputIterator last, OutputIterator result); I like this but we will still need to have a complex/simple division. However using iterators the complex can do both, and the simple then becomes GetSimpleLowercase for case conversion without changing length, but it can again take an output iterator.
The break functions: Couldn't these take iterators as well? For all use cases I can think of, this would be a much easier version to use: template <class InputIterator> InputIterator advance_grapheme (InputIterator position, InputIterator >last); (etc.) When I did my original coding I coding each of following: GetStartOfGrapheme GetPreviousGrapheme GetNextGrapheme I found that just be having IsStartOfGrapheme all these became really simple routines. I therefore believe extremely strongly that it is necessary to have StartOfGrapheme and that the others like GetNextGrapheme or advancegrapheme will then be simple/inline wrappers that use StartOfGrapheme. I also found that there was a coding hit if you have to test start and end iterator positions when processing the grapheme, hence I was passing in three DWORDs.
Not being a native speaker of English I don't think I understand what a "coding hit" is. I'm assuming that the caller usually knows the start and end. (I'm obviously assuming a C++-style string and not a C-style zero-terminated one.) In this case, the caller must check whether the end has been reached and whether it's at begin, pass in 0 in those cases, and then start_of_grapheme has to check whether 0 was passed in. To me it seem this would be slower, both to code and to execute. Furthermore, one'd better abstract away from the implementation as much as possible, and I don't think it is relevant to the caller how far the functions want to look ahead. (I don't think "next" is needed for the grapheme_cluster case anyway.)
Having said that, allowing inline versions that take iterators to call the core uint32_t/ [DWORD] functions would a good thing, and I would expect this to happen.
Finally, just thinking out loud: both the case mappings and collation have default (non-locale-specific) and tailored modes. Shouldn't those best be represented by classes rather than free functions, and shouldn't there thus be a global variable "default" that provides default operations, and other objects for locale-specific operations?
Unicode case mappings are locale inspecific.
Ermmm... what's this doing then on p. 137 of Unicode standard 4.0.0? "Characters may have case mappings that depend on the locale. The principal example is Turkish, ..."
...
Regards, Rogier

Rogier van Dalen <rogiervd@gmail.com> writes:
namespace unicode { struct range { enum type { latin1_supplement, latin_extended_a, latin_extended_a, ipa_extensions, // ... } }; } Yes - it should be namespaced - I had omitted it for clarity. I still think that the uni prefix might be useful to remind those programmers using 'using unicode' that these are Unicode functions - but I am happy to lose that argument.
...
Oops - caught - I was attempting to write it in such a way that it could be used from C as well as C++ - hence BOOL not bool. DWORD is actually uint32_t. I know, but - correct me if I'm wrong - I think the sort keys consist of 16-bit integers.
Rogier and Graham, Please try to leave a space between the text you quote and your reply, so that your very important discussion has the impact it deserves. http://www.boost.org/more/discussion_policy.htm#effective -- Dave Abrahams Boost Consulting www.boost-consulting.com
participants (3)
-
David Abrahams
-
Graham
-
Rogier van Dalen