
Eric Niebler wrote:
A single string class shall be used to store Unicode strings, i.e. logical sequences of Unicode abstract characters.
This string shall be stored in one chosen encoding, for example UTF-8. The user does not have direct access to the underlying storage, however, so it might be regarded as an implementation detail.
An invariant of the string is that it is always in one chosen normalized form. Iteration over the string gives back a sequence of char32_t abstract characters. Comparisons are defined in terms of these sequences.
Is this a fair summary?
Such a one-size-fits-all unicode_string is guaranteed to be inefficient for some applications. If it is always stored in a decomposed form, an XML library probably wouldn't want to use it, because it requires a composed form. And making the encoding an implementation detail makes it inefficient to use in situations where binary compatibility matters (serialization, for example).
This seems right, but there's a catch. Configurable encoding would help if all components of your application you the same encoding. Say XML parser wants composed form, so you use unicode_string<utf16, composed>. Now another part of your application (library written by somebody else) uses different encoding, and you have to convert the data on the interface. If there's only one encoding, you need to do conversion for code which really, really needs other encoding. If there are several encoding, then different libraries will use different encoding based on educated guesses about data, and you'll be converting everywhere.
Also, it is impossible to store an abstract unicode character in char32_t because there may be N zero-width combining characters associated with it.
Perhaps having a one-size-fits-all unicode_string might be a nice default, as long as users who care about encoding and canonical form have other types (template + policies?) with knobs they can twiddle.
Maybe, I just wish there was some efficient mechanism to prevent users who did not read the entire Unicode standard 10 times and so know what there's doing to touch the knobs ;-) - Volodya