
On 28/01/2011 11:41, Artyom wrote:
b) code_point_iterator - back inserter
You could simply define a push_back(char32_t) and have it naturally be called by std::back_inserter.
3. It allows to use std::string meanwhile under the hood as storage giving high efficiency when assigning boost::string to std::string when the implementation is COW (almost all implementations with exception of MSVC)
COW implementations of std::string are not allowed anymore starting with C++0x.
4. It is full unicode aware 5. It pushes "UTF-8" idea to standard C++ 6. You don't pay for what you do not need.
What am I paying for? I don't see how I gain anything.
Proposed API: -------------
namespace boost {
// Fully bidirectional iterator template<typename UnitsIterator> class const_code_point_iterator { public:
const_code_point_iterator(UnitsIterator begin,UnitsIterator end); // begin const_code_point_iterator(UnitsIterator begin,UnitsIterator end,UnitsIterator location); // current pos const_code_point_iterator(); // end
#ifdef C++0x typedef char32_t const_code_point_type; #else typedef unsigned const_code_point_type; #endif
Just define boost::char32 once (depending on BOOST_NO_CHAR32_T) and use that instead of putting ifdefs everywhere. (that's what boost/cuchar.hpp does in my library)
// UTF validation
bool is_valid_utf() const;
See, that's what makes the whole thing pointless. Your type doesn't add any semantic value on top of std::string, it's just an agglomeration of free functions into a class. That's a terrible design. The only advantage that a specific type for unicode strings would bring is that it could enforce certain useful invariants. But your proposal doesn't even enforce the string is valid UTF-8. Enforcing that the string is in a valid UTF encoding and is normalized in a specific normalization form can make most Unicode algorithms several orders of magnitude faster. Since people seem to want this, so here is a simple proposal: template<typename T> struct ustring; where T must be a Forward Sequence of char, char16, char32 or wchar_t. The type then acts as an adaptor over that sequence but enforces that the data is encoded in UTF-X in normalization form C, with X deduced from the value type of the inner Forward Sequence. ustring would be an immutable range of code units, with whatever refinements (bidirectional or random access) the inner Forward Sequence allows. I thought it was accepted that strings should be immutable. Otherwise insertions at the front/back could be added if the underlying forward sequence allows them. Its operator+ would return a lazy join expression. And that's all there is to it. Use free functions for the rest; ustring could provide some member helpers if that really makes life easier for some people. All of this is trivial to implement quickly with my Unicode library.