
"Eric Niebler" <eric@boost-consulting.com> wrote in message
Such a one-size-fits-all unicode_string is guaranteed to be inefficient for some applications.
Yes... That's why I would like the encoding to be templated. Allowing the programmer to choose the encoding best suited for his/her needs.
If it is always stored in a decomposed form, an XML library probably wouldn't want to use it, because it requires a composed form. And making the encoding an implementation detail makes it inefficient to use in situations where binary compatibility matters (serialization, for example).
I think the best solution is to store the string in the form it was originally recieved (decomposed or not), and instead provide composition functions or even iterator wrappers that compose on the fly. That would allow for composed strings to be used if needed (like in a XML library, but not imposing that requirement on all other users.
Also, it is impossible to store an abstract unicode character in char32_t because there may be N zero-width combining characters associated with it.
Quite true.. Storing abstract characters would require some variable width storage facility.
Perhaps having a one-size-fits-all unicode_string might be a nice default, as long as users who care about encoding and canonical form have other types (template + policies?) with knobs they can twiddle.
I would really like to provide enough knobs to keep everyone happy! ;)