Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter

9 Aug 2011

      Hi Phil,

On Aug 9, 2011, Phil Endecott wrote:
...
I think there are probably as many ways to implement a "better" string as
there are potential users, and previous long discussions here have
considered those possibilities at great length.  In summary your proposal is
for a string that is:
- Immutable.
- Reference counted.
- Iterated by default over unicode code points.
I think you misunderstood my point. Boost.Ustr does not attempt to
redesign another string class to begin with. Instead it wraps existing
string class that is provided through the template parameter and rely
on that string class for actual container operations. The immutability
of the string adapter is actually achieved by holding a smart pointer
to the const version of the raw string.
...
- Provides access to the code units via operator* and operator->, i.e.
   s.begin()  // Returns a code point iterator.
   s->begin() // Returns a code unit iterator.
I won't comment about the merits or otherwise of those points, apart from
the last, where I'll note that it is not to my taste.  It looks like it's
"over clever".  Imagine that I wrote some code using your library, and then
a colleague who was not familiar with it had to look at it later.  Would
they have any idea about the difference between those two cases?  No, not
unless I added a comment every time I used it.  Please let's have an obvious
syntax like:
   s.begin()       // Code points.
   s.impl.begin()  // Code units.
 or s.units_begin() // Code units.
The actual intention of operator ->() is not actually to provide
access to code unit iterator, instead it is used for programmers to
access some raw string functionalities that unicode_string_adapter is
not able to provide. After all, unicode_string_adapter is a
wrapper/decorator to the raw string class so it is supposed to add but
not subtract functionality from the original string class. The
availability of str->begin() is actually the side effect of enabling
access of methods in the raw string class.

I'm sorry that the example provided in the documentation probably
confused the intention of operator ->(). In the documentation I did
specify that it is strongly not recommended to call str->begin() but
it looks like counterexample is bad. I'll change the usage example to
str->c_str() to illustrate the usefulness of accessing raw string
methods in some cases.

while back to your question of method for accessing code unit
iterator, I did initially consider on making two distinct methods of
str.codepoint_begin() and str.codeunit_begin() for different level of
access. But I finally concluded that reading and comparing code units
would make the code far less portable so there should not be official
support for accessing code units. For example, if a developer writes a
function that checks just the first UTF-8 code unit to determine if
the first code point belongs to the Basic Multilingual Plane, the same
function would not work well if he later decides to allow checking for
UTF-16 strings as well.

The other problem of supporting code unit read access is that
Boost.Ustr will then not be able to handle the errors of malformed
encoding. Currently the default implementation of
unicode_string_adapter returns the replacement character � on the fly
when malformed code unit is found during decoding and it does not
alter the original malformed raw string. So it is troublesome to
handle these malformed strings in code unit iterators unless
Boost.Ustr leaves the error handling to the caller, or Boost.Ustr
explicitly make a copy of properly encoded raw string during
construction which would also cause performance slow down.

Though, in the end I also added append_codeunit() and codeunit_begin()
methods for the unicode_string_adapter_builder class because I thought
of the use case of reading encoded text from I/O and storing it
directly into strings without decoding and re-encoding the text again.
In this case it has to be carefully assumed that the encoding of the
incoming text has already been determined, and even so I'm still
worried if exposing this code unit output iterator would eventually
introduce numerous bugs.
...
Personally, I don't want a new clever string class.  What I want is a few
well-written building-blocks for Unicode.  For example, I'd like to be able
to iterate over the code points in a block of UTF-8 data in raw memory, so
some sort of iterator adaptor is needed.
Boost.Ustr's design objective is not to provide complete toolset of
processing arbitrary Unicode data, if you are looking for features
such as decoding Unicode from raw memory I think Mathias'
Boost.Unicode library has already provide excellent support on this.
Instead, Boost.Ustr's main objective is to allow developers who don't
care about encoding issues to add encoding awareness to existing
string class. For example, the use case scenarios are:

- Here is a Unicode string with this given content. I don't care how
it is encoded but I want to pass this string to any Unicode-enabled
functions.
- I'd like to write a function that accepts a Unicode string, I don't
care whether it is UTF-8 or UTF-16 encoded but I want to know if the
decoded code point sequence of this string match a certain pattern.
...
Your library does have this
functionality, but it is hidden in an implementation detail.  Please can you
consider bringing out your core UTF encoding and decoding functions to the
public interface?
My encoder/decoder functions are actually quite similar to Mathias'
implementation. (in fact I referred to his design before implementing
my own) However these function interfaces are specifically designed to
fit the internal usage of Boost.Ustr, albeit I made them generic
enough. The reason I did not directly use/copy Mathias' implementation
is because the interfaces are slightly different and I wanted to avoid
obscured bugs, and because the algorithm is simple enough to
re-implement, and also because I wanted to take this chance to learn
the encoding algorithms (and I did learn something). :) But I'd agree
that it shouldn't be hard to refactor the encoders and marge with
Mathias' implementation when the time comes.

Currently I do not have plan to make iterator adapters on top of these
encoding/decoding functions, and I think it is also a bit redundant as
Mathias has already gone through the mess of generating these
functions using macros and template metaprogramming. ;)
...
I would also like to see some benchmarks for the core UTF conversion
functions.  If you post some benchmarks that decouple the UTF conversion
from the rest of the string class, I will compare the performance with my
own code.
At this time I am focusing on design issues rather than optimizations,
so I didn't think much about benchmarks. I'd guess that the
encoding/decoding speed is probably inferior to other encoder/decoder
functions. You can see in my implementation that I did not use
obscured hacks that can shorten the code while mathematically remain
the same. Instead I focused on readability first so that even amateurs
can read the code and easily learn how the encoding/decoding process
works. So if you are writing performance critical application that
encode/decode huge amount of Unicode text, I'd say that Boost.Ustr is
probably not for you (yet).

Thanks for the feedback. Hope this answers your questions.

cheers,

Soares

Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter

Soares Chen Ruo Fei