Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter

10 Aug 2011

      Soares Chen Ruo Fei wrote:
...
Hi Phil,
On Aug 9, 2011, Phil Endecott wrote:
...
I think there are probably as many ways to implement a "better" string as
there are potential users, and previous long discussions here have
considered those possibilities at great length. ?In summary your proposal is
for a string that is:
- Immutable.
- Reference counted.
- Iterated by default over unicode code points.
I think you misunderstood my point.
No, I believe I understand what you are doing.
...
Boost.Ustr does not attempt to
redesign another string class to begin with. Instead it wraps existing
string class that is provided through the template parameter and rely
on that string class for actual container operations.
No, because:
...
The immutability
of the string adapter is actually achieved by holding a smart pointer
to the const version of the raw string.
If you were just wrapping an existing string class, you wouldn't do 
that; you'd just wrap the existing string class.  By adding this extra 
bit, you're making a string that is immutable, copy-on-write and 
reference counted - whether or not the underlying string is or not.
...
...
- Provides access to the code units via operator* and operator->, i.e.
? ?s.begin() ?// Returns a code point iterator.
? ?s->begin() // Returns a code unit iterator.
I won't comment about the merits or otherwise of those points, apart from
the last, where I'll note that it is not to my taste. ?It looks like it's
"over clever". ?Imagine that I wrote some code using your library, and then
a colleague who was not familiar with it had to look at it later. ?Would
they have any idea about the difference between those two cases? ?No, not
unless I added a comment every time I used it. ?Please let's have an obvious
syntax like:
? ?s.begin() ? ? ? // Code points.
? ?s.impl.begin() ?// Code units.
?or s.units_begin() // Code units.
The actual intention of operator ->() is not actually to provide
access to code unit iterator, instead it is used for programmers to
access some raw string functionalities that unicode_string_adapter is
not able to provide.
Whatever.  The point is that you have this operator* and operator-> 
overload whose purpose is non-obvious to someone looking at code that 
uses it.  What is your rationale for doing that, rather than providing 
e.g. an impl() or base() or similar accessor?  Can you give examples of 
any precedents for this usage?  What names or syntax do other 
wrapper/adaptor/facade implementations use?
...
...
Your library does have [raw UTF encoding and decoding functions]
, but it is hidden in an implementation detail. ?Please can you
consider bringing out your core UTF encoding and decoding functions to the
public interface?
My encoder/decoder functions are actually quite similar to Mathias'
implementation. (in fact I referred to his design before implementing
my own) However these function interfaces are specifically designed to
fit the internal usage of Boost.Ustr, albeit I made them generic
enough. The reason I did not directly use/copy Mathias' implementation
is because the interfaces are slightly different and I wanted to avoid
obscured bugs, and because the algorithm is simple enough to
re-implement, and also because I wanted to take this chance to learn
the encoding algorithms (and I did learn something). :) But I'd agree
that it shouldn't be hard to refactor the encoders and marge with
Mathias' implementation when the time comes.
Currently I do not have plan to make iterator adapters on top of these
encoding/decoding functions, and I think it is also a bit redundant as
Mathias has already gone through the mess of generating these
functions using macros and template metaprogramming. ;)
Well I don't really care who does it, but I think we should have these 
UTF encoding and decoding functions somewhere in Boost that is not an 
implementation detail of some other library.
...
...
I would also like to see some benchmarks for the core UTF conversion
functions. ?If you post some benchmarks that decouple the UTF conversion
from the rest of the string class, I will compare the performance with my
own code.
At this time I am focusing on design issues rather than optimizations,
so I didn't think much about benchmarks. I'd guess that the
encoding/decoding speed is probably inferior to other encoder/decoder
functions. You can see in my implementation that I did not use
obscured hacks that can shorten the code while mathematically remain
the same. Instead I focused on readability first so that even amateurs
can read the code and easily learn how the encoding/decoding process
works. So if you are writing performance critical application that
encode/decode huge amount of Unicode text, I'd say that Boost.Ustr is
probably not for you (yet).
OK, it's not for me, that's a shame.  Maybe if you're lucky someone who 
DOES want this functionality will now post a reply to your request for comments...

Regards,  Phil.