
On 01/23/2011 06:16 PM, Dean Michael Berris wrote:
On 21 January 2011 06:07, Dean Michael Berris<mikhailberis@gmail.com>wrote:
4. Looks like a real STL container except the iterator type is smarter than your average iterator.
What does "smarter" mean? The way I was thinking about it, "smarter" would mean something along
On Sat, Jan 22, 2011 at 5:01 AM, Nevin Liber<nevin@eviloverlord.com> wrote: the lines of "knows more than your average<thing>" where<thing> is a bare iterator.
In the context of strings, I was thinking it should be able to know what string it came from, what encoding is the string supposed to be interpreted in, or whether there are special computations that an iterator for string might need. One example that comes to mind is having a tokenizing iterator which returns a string when dereferenced and knows what the delimiters of the string are -- to do that correctly your iterator would need to know which string it came from and where in the string its internal "counter" is already "parked" at from the last dereference.
This would require that iterators be built externally from the string, something like:
auto it = encoded<utf8_encoding>(original_string), end = encoded<utf8_encoding>(); I like that idea but was toying with a different paradigm. A template argument similar to a locale in that it would contain the information needed to compare elements and to iterate elements. If it made sense to change things an imbue idea for comparisons and iterators could work.
Here `it` could interpret the original string as UTF-8 and you can possibly assume that dereferencing this iterator can return an appropriate (possibly variant) type that is convertible to the appropriate holder (char, wchar_t, uint32_t (for utf32)). From here you can build ranges appropriately and deal with ranges and just know that the encoding is explicitly defined in the iterator. I like the idea of segmenting an encoded string into ranges where a "character" would be a range capturing one or more of the underlying encoding's characters and combining characters. It solves the problem of what to return when dereferencing the iterators. Of course you'd have to be able to compare two (for example) utf-16 ranges meaningfully based on some locale, just as if a human who knew the symbols was comparing the glyphs that would be drawn for each range. Another idea I like though, is that dereferencing an iterator would return one UCS codepoint and it would be up to a higher level of abstraction to fetch the combining characters and form the final glyph. That way, any string that encoded UCS, whether it was utf-32, utf-16, or utf-8, could return char32_t from dereferencing an iterator. I suspect that either or both of these as well as other variations would at times be the better idea, because the interpretation of the underlying code varies so much. Lots of places share the same scripts but with quite different rules about what to do with them, and how to combine or compare them. Beware of a naive solution if the intent is to make a completely general solution. I'm not even sure if it's possible without doing a layered approach.
Patrick