Re: [boost] RFC: interest in Unicode codecs?

21 Jul 2009

      Eric Niebler wrote:
...
Phil Endecott wrote:
...
Rogier van Dalen wrote:
...
On Sat, Jul 18, 2009 at 15:20, Phil Endecott wrote:
...
The idea of an "iterator that knows where its end is" is something that
comes up fairly often; do the Range experts have any comments about it?
I think that in this case an iterator that can be incremented and
dereferenced in some limited way beyond its end would be sufficient. 
?For
example, a std::string normally has a 0 byte beyond its end so that 
c_str()
can work, so it is safe (for some value of safe!) to keep advancing a
std::string::iterator until a 0 is seen, without looking for end(). 
?A UTF-8
decoding algorithm that processes multi-byte characters by continuing 
until
the top bit is not set would safely terminate in this case.
(I don't think I'm a Range expert.) I think there are problems with
this example. Adding '\0' at the end is not mandated by the standard,
right?
My recollection is that the standard makes it hard to implement a 
std::string that does not have a 0 after the last element, or that has 
non-contiguous storage.
I believe that as of C++03, std::string is required to have contiguous 
storage. The null-termination is another thing. The only guarantee is 
that the char* returned by c_str() is required to be null terminated. No 
such guarantee is made for the sequence traversed by std::string's 
iterators. Dereferencing the end iterator is verboten.
Here is an outline of a zero-overhead wrapper for std::string that I 
hope guarantees that *end() == 0:

struct string_with_zero_beyond_end: std::string {
   typedef const char* const_iterator;
   const_iterator begin() const { return c_str(); }
   const_iterator end() const { return c_str() + length(); }
};
...
...
These are assumptions that I would probably be 
happy to accept in my own internal-use code, but you are right to say 
that they are probably not appropriate for library code.  For library 
code you would need to wrap the container with something that guarantees 
this behaviour in a more solid way.  Out of interest, does anyone know 
if a std::vector that has been reserve()d guarantees anything about 
dereferencing beyond-the-end iterators?  It would be great if they were 
allowed to be undefined yet certain not to segfault.
Undefined behavior. Don't do it.
Something similar is possible.
...
...
...
Also, '\0' could also occur in the middle of the sequence.
That doesn't cause a problem for this application.
But not in general.
We're not concerned with "in general", we're concerned with this 
particular problem of traversing bytes of UTF-8 efficiently with the 
possibility of malformed (e.g. malicious) input, without having to 
constantly check for end-of-input.
...
...
...
...
For iterators that don't offer this sort of behaviour you can provide a
wrapper that knows where end is and returns a sentinel 0 in that case.
Wouldn't this end up requiring two if-statements? In general, a
sentinel which is in the valid range of the value type would be an
awkward sentinel.
No, the point is that it doesn't need any extra if statements at all.  
The code is already looking for a top-bit-clear byte to indicate the end 
of the multibyte character, and the 0 byte does that.
...
However, I can see where you're coming from. Being able to tell from
an iterator whether it's at the end of its range is often useful. Is
operator*() is the right place to implement this functionality?
Wouldn't a free function
    is_at_end(Iterator)
make more sense?
Then you do have the extra if statement.
If you write a generic algorithm that takes a ForwardRange or two 
ForwardIterators, you're not allowed to assume that the sequence is 
terminated with a sentry. Why? Because that's not part of the 
ForwardRange/ForwardIterator concept.
You would be allowed to define a refinement, say 
ContiguousNullTerminatedIterator concept such that &*it is required to 
point to a contiguous block of memory that is terminated with a sentry 
at the position specified by the end iterator. You could also define a 
trait to detect conformance to the concept. Then you could use this 
trait to dispatch to an algorithm optimized for that trait.
I had something to detect contiguous storage to do the trick of casting 
to an int and checking 4 or 8 top bits simultaneously.
...
This seems like a lot of work for little gain to me, though. I'd like to 
see performance numbers to justify a design like that.
Cory Nelson reported that Windows 7 can decode UTF-8 3 times faster 
than his code.  I don't want Boost to end up with UTF-8 features that 
are all lovely and generic and full of concepts but that run 3 times 
slower on common types (strings, vector<char>, points) than they could 
if they were less generic.

Phil.