
Such a one-size-fits-all unicode_string is guaranteed to be inefficient for some applications. If it is always stored in a decomposed form, an XML library probably wouldn't want to use it, because it requires a composed form. And making the encoding an implementation detail makes it inefficient to use in situations where binary compatibility matters (serialization, for example).
This is a good point. I should think however that a codecvt facet should be responsible for serialization rather than the unicode string. Furthermore, IMO, the invariants of any unicode string should be checked when reading from a file anyway. This should happen on two levels: the UTF-8 or UTF-16 encoding must be correct, and no dangling combining characters or combining characters on control characters should occur; furthermore, the normalisation form should probably be checked as well. So I'm not sure whether using Normalisation Form C rather than D will give you any big performance gains - you may need less memory though.
Also, it is impossible to store an abstract unicode character in char32_t because there may be N zero-width combining characters associated with it.
I'm not sure what you mean here, but if you mean that one abstract character would be one codepoint: that's not true, I'm sorry to say. Especially languages for which there was no encoding before Unicode, and funny scientists like mathematicians or linguists (I count myself among the latter) will use abstract characters that have not been encoded as precomposed characters in Unicode. Nor will they be; the precomposed forms are there for backwards compatibility, mainly. Note that adding a combining mark to a precomposed character takes decomposing it and recomposing it, so that might be pretty slow.
Perhaps having a one-size-fits-all unicode_string might be a nice default, as long as users who care about encoding and canonical form have other types (template + policies?) with knobs they can twiddle.
I do agree with that; and also I seem to remember from the discussion back in April that some people felt they needed to iterate over codepoints too. So please allow me to propose an altered version of my earlier proposal, taking in various suggestions from this thread. namespace unicode { // ***** Level 1: code units ***** // The code unit sequence is not explicitly specified, but it // could be std::string, or SGI rope<char16_t>, or whatever. // I think it would be reasonable to require replace, find, // find_first_of and similar. // ***** Level 2: codepoints ***** // The codepoint sequence is templatised on the code unit // sequence. // Depending on CodeUnits::value_type the encoding will // be UTF-8, UTF-16, or UTF-32. template <class CodeUnits> class codepoint_string { CodeUnits _code_units; public: // ... // A user is not allowed to change the code unit // sequence, but it may be copied, or serialised. const CodeUnits & code_units(); // The iterator is a bidirectional iterator. // This is cheap to implement on any correct Unicode- // encoded string since the iterator is not stateful. typedef ... iterator; // A size() member function is not included; // count() may be nice though. }; // ***** Level three: characters ***** // Normalisation policies struct normalisation_form_c {}; struct normalisation_form_d {}; // Input policies struct as_utf8 {}; struct as_latin1 {}; struct as_utf16 {}; // etcetera // Error checking policies struct throw_on_encoding_error {}; struct workaround_encoding_error {}; // An abstract Unicode character // I have not given this guys' interface much though yet. template <class NormalisationForm> class character { char32_t _base; std::vector<char32_t> _marks; public: character (char32_t base); character & operator = (char32_t base); const char32_t & base() const; void add_mark (char32_t mark); // An iterator to iterate over the combining marks. // It is a const_iterator because we wouldn't want to // allow introducing non-marks in the list of marks. typedef std::vector<char32_t>::const_iterator mark_iterator; mark_iterator mark_begin() const; mark_iterator mark_end() const; // .... }; // The actual Unicode string template <class CodeUnits, class NormalisationForm, class ErrorChecking> class string { codepoints<CodeUnits> _codepoints; public: // Initialise with a utf8 string; normalise and check for errors string (const CodeUnits &, as_utf8_tag); template <class CodeUnits2, class NormalisationForm2, class ErrorChecking2> string (const string <CodeUnits2, NormalisationForm2, ErrorChecking2> &); // .... const codepoints<CodeUnits> & codepoints(); const CodeUnits & code_units(); // Another bidirectional iterator, this one iterates // over abstract characters. class iterator { public: // Returns an object with an interface equal to // unicode::character, but it changes the string. character_ref operator *() const; // ... }; }; } // namespace unicode // ***** That was all ***** Mutating operations on unicode::string may require O(n) time where n is the length of the code unit sequence, depending on CodeUnit's properties. That's why using an SGI rope would make sense. Some default template parameters for unicode::string should be thought of. Regards, Rogier