Re: [boost] [string] Realistic API proposal

28 Jan 2011

      Hi Artyom,

Artyom wrote:
...
I'd like to provide a realistic string API proposal:
I've been keeping out of this up to now, but since there is something 
concrete here I'll share my thoughts.
...
// Fully bidirectional iterator
    template<typename UnitsIterator>
    class const_code_point_iterator {
    public:
const_code_point_iterator(UnitsIterator begin,
                                  UnitsIterator end); //  begin
        const_code_point_iterator(UnitsIterator begin,
                                  UnitsIterator end,
                                  UnitsIterator location); // current pos
        const_code_point_iterator(); // end
#ifdef C++0x
        typedef char32_t const_code_point_type;
        #else
        typedef unsigned const_code_point_type;
        #endif
const_code_point_type operator*() const;
        ...
};
I have something broadly like this here:

http://svn.chezphil.org/libpbe/trunk/include/charset/const_character_iterato...

I attempted to do this with the character set as a template parameter 
and a "charset traits" class providing encoding and decoding 
functions.  That was probably over-complicated; making it utf-8 only 
would be fine - but in that case, it should have a name that says "utf8".

I do find it somewhat unsatisfactory that you need to store the begin 
and end of the underlying string.  This triples the size of what could 
otherwise be a single pointer.  I think these are only needed to detect 
invalid utf-8, aren't they?  In some of my code I had an error_policy 
template parameter that allowed you to specify whether the input should 
be trusted or not; if it's trusted you can avoid this overhead.  Even 
then, though, you can't avoid having begin and end in the interface, 
adding verbosity.

Another way to avoid storing begin and end is to somehow make those 
iterators empty structs (and hence also default-constructable).  
Specifically, if your underlying string is guaranteed to be 
null-terminated, the end iterator can be stateless.  I guess you could 
avoid storing the begin iterator by prepending a null, but that doesn't 
work for std::string.
...
/// Output iterator
    template<typename BackInserter>
    class code_point_iterator {
    public:
code_point_iterator(BackInserter out); // begin
        code_point_iterator(); // end
#ifdef C++0x
        typedef char32_t code_point_type;
        #else
        typedef unsigned code_point_type;
        #endif
code_point_type operator*() const;
        ...
};
So this only allows appending, right?  I have something like that here:

http://svn.chezphil.org/libpbe/trunk/include/charset/character_output_iterat...

Broadly, I would say that allowing bidirectional reading and 
append-only writing is the right thing to do for strings.  If anyone 
has an hour to spare, it's educational to try hacking your code to use 
std::list<char> instead of std::string, and see how much of it still compiles.
...
template<typename Char,typename Traits=std::char_traits<Char>,
             typename Alloc=std::allocator<Char> >
    class basic_string {
    public:
        // { boost specific
        typedef std::basic_string<Char,Traits,Alloc> std_string_type;
        // } boost specific
// All std::string standard functions based
// Deprecated interfaces that exist for backward compatibility
        // as they not Unicode aware
value_type &at(size_type indx);
        value_type &operator[](size_type indx);
        iterator begin();
        iterator end();
// { boost specific compatibility functions with std::string, they would go
        //   as std::string becode extended with boost::string new interfaces
        //
        basic_string(std_string_type const &other) : data_(other) {}
        basic_string(std_string_type const &other,size_type index,size_type len) 
: data_(other,index,len) {}
...
operator std_string_type() const
        {
            return data_;
        }
// } boost specific compatibility functions
//
        // Unicode Support
        //
        // ------------------------
        //
//
        // UTF Codepoint iteration
        //
#ifdef C++0x
        typedef char32_t code_point_type;
        #else
        typedef unsigned code_point_type;
        #endif
typedef boost::const_code_point_iterator<const_iterator> 
const_code_point_iterator;
const_code_point_iterator code_point_begin() const
        {
            return const_code_point_iterator(begin(),end());
        }
        const_code_point_iterator code_point_end() const
        {
            return const_code_point_iterator(begin(),end(),end());
        }
typedef boost::code_point_iterator<std::back_inserter<basic_string> > 
code_point_iterator;
code_point_iterator back_inserter()
        {
            return code_point_iterator(std::back_inserter<basic_string>(*this));
        }
basic_string &operator+=(code_point_type code_point);
        basic_string operator+(code_point_type code_point) const;
        void append(code_point_type code_point);
The approach that I would prefer is more like:

template <typename impl_t>
class utf8_string_adaptor {
   impl_t impl;
..
};

typedef utf8_string_adaptor<std::string> utf8_string;

In this way:
- I can wrap other containers than std::string, e.g. sgi::rope, char*, 
std::vector etc.
- utf8_string::begin() can return a utf8_character_iterator.
- Accessing the underlying bytes is possible but requires something 
explicit e.g. foo.base().begin().
...
//
        // Lexical operations on string
        //
// Case handling
basic_string upper_case(std::locale const &l=std::locale()) const;
        basic_string lower_case(std::locale const &l=std::locale()) const;
        basic_string title_case(std::locale const &l=std::locale()) const;
        basic_string fold_case() const; // locale independent
// Unicode normalization
typedef enum {
            nfc,
            nfkc,
            nfd,
            nfkd
        } normalization_mode;
basic_string normalize(normalization_mode mode = nfc) const;
// normalized string constructor
basic_string(basic_string const &,normalization_mode mode);
        basic_string(Char const *,normalization_mode mode);
        basic_string(Char const *,size_t n,normalization_mode mode);
        template<Iterator>
        basic_string(Iterator begin,Iterator end,normalization_mode mode);
void append_normalized(basic_string const &other,normalization_mode mode 
= nfc);
        void append_normalized(Char const *,normalization_mode mode = nfc);
        void append_normalized(Char const *,size_t n,normalization_mode mode = 
nfc);
basic_string concat_normalized(basic_string const 
&other,normalization_mode mode = nfc) const;
        basic_string concat_normalized(Char const *,normalization_mode mode = 
nfc) const;
        basic_string concat_normalized(Char const *,size_t n,normalization_mode 
mode = nfc) const;
// Unicode validation
bool valid_utf() const;
[snip]

Surely almost all of that should be in free functions and generic 
algorithms, no?  E.g. valid_utf8() could be an algorithm that takes a 
pair of iterators over bytes, and then it can be used on any sequence.

Regards,  Phil.

Re: [boost] [string] Realistic API proposal

Phil Endecott