Re: [boost] [string] Realistic API proposal

28 Jan 2011

      On 28/01/2011 11:41, Artyom wrote:
...
b) code_point_iterator - back inserter
You could simply define a push_back(char32_t) and have it naturally be 
called by std::back_inserter.
...
3. It allows to use std::string meanwhile under the hood as storage
    giving high efficiency when assigning boost::string to std::string
    when the implementation is COW (almost all implementations with
    exception of MSVC)
COW implementations of std::string are not allowed anymore starting with 
C++0x.
...
4. It is full unicode aware
5. It pushes "UTF-8" idea to standard C++
6. You don't pay for what you do not need.
What am I paying for? I don't see how I gain anything.
...
Proposed API:
-------------
namespace boost {
// Fully bidirectional iterator
     template<typename UnitsIterator>
     class const_code_point_iterator {
     public:
const_code_point_iterator(UnitsIterator begin,UnitsIterator end); //
begin
         const_code_point_iterator(UnitsIterator begin,UnitsIterator
end,UnitsIterator location); // current pos
         const_code_point_iterator(); // end
#ifdef C++0x
         typedef char32_t const_code_point_type;
         #else
         typedef unsigned const_code_point_type;
         #endif
Just define boost::char32 once (depending on BOOST_NO_CHAR32_T) and use 
that instead of putting ifdefs everywhere.
(that's what boost/cuchar.hpp does in my library)
...
// UTF validation
bool is_valid_utf() const;
See, that's what makes the whole thing pointless.
Your type doesn't add any semantic value on top of std::string, it's 
just an agglomeration of free functions into a class. That's a terrible 
design.
The only advantage that a specific type for unicode strings would bring 
is that it could enforce certain useful invariants.

But your proposal doesn't even enforce the string is valid UTF-8.

Enforcing that the string is in a valid UTF encoding and is normalized 
in a specific normalization form can make most Unicode algorithms 
several orders of magnitude faster.

Since people seem to want this, so here is a simple proposal:

template<typename T>
struct ustring;

where T must be a Forward Sequence of char, char16, char32 or wchar_t.
The type then acts as an adaptor over that sequence but enforces that 
the data is encoded in UTF-X in normalization form C, with X deduced 
from the value type of the inner Forward Sequence.

ustring would be an immutable range of code units, with whatever 
refinements (bidirectional or random access) the inner Forward Sequence 
allows.
I thought it was accepted that strings should be immutable. Otherwise 
insertions at the front/back could be added if the underlying forward 
sequence allows them.

Its operator+ would return a lazy join expression.

And that's all there is to it. Use free functions for the rest; ustring 
could provide some member helpers if that really makes life easier for 
some people.

All of this is trivial to implement quickly with my Unicode library.

Re: [boost] [string] Realistic API proposal

Mathias Gaunard