
Hi Phil, On Aug 9, 2011, Phil Endecott wrote:
I think there are probably as many ways to implement a "better" string as there are potential users, and previous long discussions here have considered those possibilities at great length. In summary your proposal is for a string that is:
- Immutable. - Reference counted. - Iterated by default over unicode code points.
I think you misunderstood my point. Boost.Ustr does not attempt to redesign another string class to begin with. Instead it wraps existing string class that is provided through the template parameter and rely on that string class for actual container operations. The immutability of the string adapter is actually achieved by holding a smart pointer to the const version of the raw string.
- Provides access to the code units via operator* and operator->, i.e. s.begin() // Returns a code point iterator. s->begin() // Returns a code unit iterator.
I won't comment about the merits or otherwise of those points, apart from the last, where I'll note that it is not to my taste. It looks like it's "over clever". Imagine that I wrote some code using your library, and then a colleague who was not familiar with it had to look at it later. Would they have any idea about the difference between those two cases? No, not unless I added a comment every time I used it. Please let's have an obvious syntax like:
s.begin() // Code points. s.impl.begin() // Code units. or s.units_begin() // Code units.
The actual intention of operator ->() is not actually to provide access to code unit iterator, instead it is used for programmers to access some raw string functionalities that unicode_string_adapter is not able to provide. After all, unicode_string_adapter is a wrapper/decorator to the raw string class so it is supposed to add but not subtract functionality from the original string class. The availability of str->begin() is actually the side effect of enabling access of methods in the raw string class. I'm sorry that the example provided in the documentation probably confused the intention of operator ->(). In the documentation I did specify that it is strongly not recommended to call str->begin() but it looks like counterexample is bad. I'll change the usage example to str->c_str() to illustrate the usefulness of accessing raw string methods in some cases. while back to your question of method for accessing code unit iterator, I did initially consider on making two distinct methods of str.codepoint_begin() and str.codeunit_begin() for different level of access. But I finally concluded that reading and comparing code units would make the code far less portable so there should not be official support for accessing code units. For example, if a developer writes a function that checks just the first UTF-8 code unit to determine if the first code point belongs to the Basic Multilingual Plane, the same function would not work well if he later decides to allow checking for UTF-16 strings as well. The other problem of supporting code unit read access is that Boost.Ustr will then not be able to handle the errors of malformed encoding. Currently the default implementation of unicode_string_adapter returns the replacement character � on the fly when malformed code unit is found during decoding and it does not alter the original malformed raw string. So it is troublesome to handle these malformed strings in code unit iterators unless Boost.Ustr leaves the error handling to the caller, or Boost.Ustr explicitly make a copy of properly encoded raw string during construction which would also cause performance slow down. Though, in the end I also added append_codeunit() and codeunit_begin() methods for the unicode_string_adapter_builder class because I thought of the use case of reading encoded text from I/O and storing it directly into strings without decoding and re-encoding the text again. In this case it has to be carefully assumed that the encoding of the incoming text has already been determined, and even so I'm still worried if exposing this code unit output iterator would eventually introduce numerous bugs.
Personally, I don't want a new clever string class. What I want is a few well-written building-blocks for Unicode. For example, I'd like to be able to iterate over the code points in a block of UTF-8 data in raw memory, so some sort of iterator adaptor is needed.
Boost.Ustr's design objective is not to provide complete toolset of processing arbitrary Unicode data, if you are looking for features such as decoding Unicode from raw memory I think Mathias' Boost.Unicode library has already provide excellent support on this. Instead, Boost.Ustr's main objective is to allow developers who don't care about encoding issues to add encoding awareness to existing string class. For example, the use case scenarios are: - Here is a Unicode string with this given content. I don't care how it is encoded but I want to pass this string to any Unicode-enabled functions. - I'd like to write a function that accepts a Unicode string, I don't care whether it is UTF-8 or UTF-16 encoded but I want to know if the decoded code point sequence of this string match a certain pattern.
Your library does have this functionality, but it is hidden in an implementation detail. Please can you consider bringing out your core UTF encoding and decoding functions to the public interface?
My encoder/decoder functions are actually quite similar to Mathias' implementation. (in fact I referred to his design before implementing my own) However these function interfaces are specifically designed to fit the internal usage of Boost.Ustr, albeit I made them generic enough. The reason I did not directly use/copy Mathias' implementation is because the interfaces are slightly different and I wanted to avoid obscured bugs, and because the algorithm is simple enough to re-implement, and also because I wanted to take this chance to learn the encoding algorithms (and I did learn something). :) But I'd agree that it shouldn't be hard to refactor the encoders and marge with Mathias' implementation when the time comes. Currently I do not have plan to make iterator adapters on top of these encoding/decoding functions, and I think it is also a bit redundant as Mathias has already gone through the mess of generating these functions using macros and template metaprogramming. ;)
I would also like to see some benchmarks for the core UTF conversion functions. If you post some benchmarks that decouple the UTF conversion from the rest of the string class, I will compare the performance with my own code.
At this time I am focusing on design issues rather than optimizations, so I didn't think much about benchmarks. I'd guess that the encoding/decoding speed is probably inferior to other encoder/decoder functions. You can see in my implementation that I did not use obscured hacks that can shorten the code while mathematically remain the same. Instead I focused on readability first so that even amateurs can read the code and easily learn how the encoding/decoding process works. So if you are writing performance critical application that encode/decode huge amount of Unicode text, I'd say that Boost.Ustr is probably not for you (yet). Thanks for the feedback. Hope this answers your questions. cheers, Soares