Re: [boost] [string] proposal

On Jan 29, 2011, at 8:43 AM, David Bergman wrote:
There are four layers in play here: 1. The sequence of characters/symbols, as in CS string; totally abstract but precise... (one such CS string can be represented as Unicode or Latin-1 at #2, for instance.) 2. The sequence of code points, in a given character set. Yes, one CS string (as in #1) can have multiple distinct manifestations at this level. They could be identical in >integral sequences. 3. The sequence of code values, using an encoding form such as UTF-8 or UTF-16 for a Unicode code point. 4. The byte storage representing the code values; could be a contiguous sequence of bytes or chunks, etc.
Which, IMHO, is exactly why it's (past) time to radically rethink the standard string interface. Your four layers share the properties that: (a) They can all be physically represented within the same data structure (i.e. layer #4), and (b) For any layer, there already exists a lot of legacy code that conceptualizes passed-in strings at that layer and that makes corresponding assumptions. To me, these facts imply that a 'view'-based approach is the right way to go. Furthermore, views should not be concrete data structures themselves, but simply iterator pairs (ranges) that internally point to some arbitrary data structure, and externally present a model of a linear sequence of atomic characters/codepoints/etc. In software development Nirvana, APIs and algorithms will operate on iterator_ranges at the appropriate conceptual level, rather than intruding into the underlying data structures, and at the interfaces with reality we'll just copy data to and from more concrete representations using the chosen encoding. Some of those interfaces (with low-level device drivers or with memory-mapped hardware itself) might never go away, but many (e.g. a GUI API that accepts a string and returns a texture of its glyphs) would hopefully eventually migrate to the iterator model. For example, one might write std::string underlying_data = obtain_validated_data_from(somewhere); text<ucs4, codept_view, decltype(underlying_data)> codepoints(assume_utf8(underlying_data)); text<utf8, char_view, decltype(underlying_data)> characters(assume_utf8(underlying_data)); if (codepoints.length() > characters.length()) contains_multipoint_characters = true; // converse not necessarily true if there are codepoints for multi-character ligatures const char *byteptr = linear_representation(underlying_data); // may or may not accurately track later changes, especially if the underlying data is non-contiguous, but that's really a usage protocol question and not for the data structure to solve--- anyway, strdup() if you want a snapshot from a given point in time text<utf8, char_view, characters::internal_representation> substring = all_after(characters.begin(), characters.end(), "phrase to match"); std::vector<wide_char_t> char_array; std::copy(substring.begin(), substring.end(), char_array.begin()); etc. When you think your usage pattern favors a rope or a chain over a std::string, you only have to change the declaration of the underlying_data object and the rest works seamlessly. I think Dean favors immutability because that makes it easy for the iterator-adaptors to cache data in the representation they need (or in more local storage in a NUMA environment), but like specifying a particular byte encoding for the underlying data this would be more for convenience and performance optimization than a firm requirement. When the underlying representation is a mutable type, read-only view adapters are still free to cache chunks of data as long as they check for a fingerprint that writable view adapters leave on the underlying structure whenever they commit a change. Consider the underlying data structure as a sequence of arbitrary 'tokens' (bytes, characters, codepoints, variable-length compressed bitstrings... whatever makes sense); a read-only substring cache that depends on tokens i though j (inclusive) of version v of the underlying representation can be described by the tuple r(v, i, j), while a serialized write operation that changes the tokens in positions i though j from version v to version v+1 can be described by the tuple w(v+1, i, j). A write w(v, x, y) invalidates all or part of a read cache r(v', x', y') if and only if v' < v && x' <= y && y' >= x (and if partially invalidated, we know exactly which segment(s) are still OK.) The dereference operator of the read-only iterator then needs to check its cached v' against the underlying data structure's latest version, and if necessary skim through a small ring buffer of recent w(v, x, y) tuples to see if part of its cache must be invalidated (if v' + 1 is less than the earliest v in the ring, the reader can conservatively invalidate its entire cache.) This is a negligible amount of overhead (in the common case, a single fetch, compare, and predictable branch, and in the worst case a new transcoding that you would have had to perform anyway) per character/codepoint/byte access, and when even that is unacceptable you can simply cache a representation of the string in a buffer of known encoding and make everything else adapt to /your/ data layout. I'm intentionally not delving into the details of synchronizing reads and writes between different threads, as std::string and char[] don't provide that either, but if you really need some views to provide their own synchronization this is a (relatively) straightforward place to build it in.
1. Why can't this byte storage type not be used for all kinds of things; is not 'string' a quite bad name for it, since it is neither a string according to most programming >languages (see above) nor according to that CS definition that you are alluding to (unless you consider uninterpreted bytes to be symbols, but be quite aware that those >'symbols' would have nothing - or very little - to do with the symbols of the text represented through your construct.)
Wholeheartedly agree; the view should have a name like 'text' (or 'characters', 'codepoints', 'bytes', etc.) and the underlying data structure can be string, rope, vector<char>, or boost::not_invented_yet. Sure, views don't completely solve the problem of legacy APIs that require a persistent, writable reference to a block of text (e.g.a text entry widget that maps its input to a character buffer); if you're lucky to can map the widget to the underlying_representation structure and create the views you need for other parts of the program. If you're not lucky, you might have to make a separate copy for each legacy interface and check periodically or on-demand for changes, but I'd still much rather use a data structure that forces me to make explicit copies than one that encourages me to risk sharing a single buffer between APIs that may not make the same encoding assumptions.
participants (1)
-
Brent Spillner