
Eric Niebler wrote:
Phil Endecott wrote:
Rogier van Dalen wrote:
On Sat, Jul 18, 2009 at 15:20, Phil Endecott wrote:
The idea of an "iterator that knows where its end is" is something that comes up fairly often; do the Range experts have any comments about it?
I think that in this case an iterator that can be incremented and dereferenced in some limited way beyond its end would be sufficient. ?For example, a std::string normally has a 0 byte beyond its end so that c_str() can work, so it is safe (for some value of safe!) to keep advancing a std::string::iterator until a 0 is seen, without looking for end(). ?A UTF-8 decoding algorithm that processes multi-byte characters by continuing until the top bit is not set would safely terminate in this case.
(I don't think I'm a Range expert.) I think there are problems with this example. Adding '\0' at the end is not mandated by the standard, right?
My recollection is that the standard makes it hard to implement a std::string that does not have a 0 after the last element, or that has non-contiguous storage.
I believe that as of C++03, std::string is required to have contiguous storage. The null-termination is another thing. The only guarantee is that the char* returned by c_str() is required to be null terminated. No such guarantee is made for the sequence traversed by std::string's iterators. Dereferencing the end iterator is verboten.
Here is an outline of a zero-overhead wrapper for std::string that I hope guarantees that *end() == 0: struct string_with_zero_beyond_end: std::string { typedef const char* const_iterator; const_iterator begin() const { return c_str(); } const_iterator end() const { return c_str() + length(); } };
These are assumptions that I would probably be happy to accept in my own internal-use code, but you are right to say that they are probably not appropriate for library code. For library code you would need to wrap the container with something that guarantees this behaviour in a more solid way. Out of interest, does anyone know if a std::vector that has been reserve()d guarantees anything about dereferencing beyond-the-end iterators? It would be great if they were allowed to be undefined yet certain not to segfault.
Undefined behavior. Don't do it.
Something similar is possible.
Also, '\0' could also occur in the middle of the sequence.
That doesn't cause a problem for this application.
But not in general.
We're not concerned with "in general", we're concerned with this particular problem of traversing bytes of UTF-8 efficiently with the possibility of malformed (e.g. malicious) input, without having to constantly check for end-of-input.
For iterators that don't offer this sort of behaviour you can provide a wrapper that knows where end is and returns a sentinel 0 in that case.
Wouldn't this end up requiring two if-statements? In general, a sentinel which is in the valid range of the value type would be an awkward sentinel.
No, the point is that it doesn't need any extra if statements at all. The code is already looking for a top-bit-clear byte to indicate the end of the multibyte character, and the 0 byte does that.
However, I can see where you're coming from. Being able to tell from an iterator whether it's at the end of its range is often useful. Is operator*() is the right place to implement this functionality? Wouldn't a free function is_at_end(Iterator) make more sense?
Then you do have the extra if statement.
If you write a generic algorithm that takes a ForwardRange or two ForwardIterators, you're not allowed to assume that the sequence is terminated with a sentry. Why? Because that's not part of the ForwardRange/ForwardIterator concept.
You would be allowed to define a refinement, say ContiguousNullTerminatedIterator concept such that &*it is required to point to a contiguous block of memory that is terminated with a sentry at the position specified by the end iterator. You could also define a trait to detect conformance to the concept. Then you could use this trait to dispatch to an algorithm optimized for that trait.
I had something to detect contiguous storage to do the trick of casting to an int and checking 4 or 8 top bits simultaneously.
This seems like a lot of work for little gain to me, though. I'd like to see performance numbers to justify a design like that.
Cory Nelson reported that Windows 7 can decode UTF-8 3 times faster than his code. I don't want Boost to end up with UTF-8 features that are all lovely and generic and full of concepts but that run 3 times slower on common types (strings, vector<char>, points) than they could if they were less generic. Phil.