Re: [boost] [string] proposal

24 Jan 2011


      On 01/23/2011 06:16 PM, Dean Michael Berris wrote:
...
...
On 21 January 2011 06:07, Dean Michael Berris<mikhailberis@gmail.com>wrote:
...
4. Looks like a real STL container except the iterator type is smarter
than your average iterator.
What does "smarter" mean?
The way I was thinking about it, "smarter" would mean something along
On Sat, Jan 22, 2011 at 5:01 AM, Nevin Liber<nevin@eviloverlord.com>  wrote:
the lines of "knows more than your average<thing>" where<thing>  is a
bare iterator.
In the context of strings, I was thinking it should be able to know
what string it came from, what encoding is the string supposed to be
interpreted in, or whether there are special computations that an
iterator for string might need. One example that comes to mind is
having a tokenizing iterator which returns a string when dereferenced
and knows what the delimiters of the string are -- to do that
correctly your iterator would need to know which string it came from
and where in the string its internal "counter" is already "parked" at
from the last dereference.
This would require that iterators be built externally from the string,
something like:
auto it = encoded<utf8_encoding>(original_string), end =
encoded<utf8_encoding>();
I like that idea but was toying with a different paradigm.  A template 
argument similar to a locale in that it would contain the information 
needed to compare elements and to iterate elements.  If it made sense to 
change things an imbue idea for comparisons and iterators could work.
...
Here `it` could interpret the original string as UTF-8 and you can
possibly assume that dereferencing this iterator can return an
appropriate (possibly variant) type that is convertible to the
appropriate holder (char, wchar_t, uint32_t (for utf32)). From here
you can build ranges appropriately and deal with ranges and just know
that the encoding is explicitly defined in the iterator.
I like the idea of segmenting an encoded string into ranges where a 
"character" would be a range capturing one or more of the underlying 
encoding's characters and combining characters.  It solves the problem 
of what to return when dereferencing the iterators.  Of course you'd 
have to be able to compare two (for example) utf-16 ranges meaningfully 
based on some locale, just as if a human who knew the symbols was 
comparing the glyphs that would be drawn for each range.  Another idea I 
like though, is that dereferencing an iterator would return one UCS 
codepoint and it would be up to a higher level of abstraction to fetch 
the combining characters and form the final glyph.  That way, any string 
that encoded UCS, whether it was utf-32, utf-16, or utf-8, could return 
char32_t from dereferencing an iterator.  I suspect that either or both 
of these as well as other variations would at times be the better idea, 
because the interpretation of the underlying code varies so much.  Lots 
of places share the same scripts but with quite different rules about 
what to do with them, and how to combine or compare them.  Beware of a 
naive solution if the intent is to make a completely general solution.  
I'm not even sure if it's possible without doing a layered approach.
Patrick

Re: [boost] [string] proposal

Patrick Horgan