
Hi Artyom, Artyom wrote:
I'd like to provide a realistic string API proposal:
I've been keeping out of this up to now, but since there is something concrete here I'll share my thoughts.
// Fully bidirectional iterator template<typename UnitsIterator> class const_code_point_iterator { public:
const_code_point_iterator(UnitsIterator begin, UnitsIterator end); // begin const_code_point_iterator(UnitsIterator begin, UnitsIterator end, UnitsIterator location); // current pos const_code_point_iterator(); // end
#ifdef C++0x typedef char32_t const_code_point_type; #else typedef unsigned const_code_point_type; #endif
const_code_point_type operator*() const; ...
};
I have something broadly like this here: http://svn.chezphil.org/libpbe/trunk/include/charset/const_character_iterato... I attempted to do this with the character set as a template parameter and a "charset traits" class providing encoding and decoding functions. That was probably over-complicated; making it utf-8 only would be fine - but in that case, it should have a name that says "utf8". I do find it somewhat unsatisfactory that you need to store the begin and end of the underlying string. This triples the size of what could otherwise be a single pointer. I think these are only needed to detect invalid utf-8, aren't they? In some of my code I had an error_policy template parameter that allowed you to specify whether the input should be trusted or not; if it's trusted you can avoid this overhead. Even then, though, you can't avoid having begin and end in the interface, adding verbosity. Another way to avoid storing begin and end is to somehow make those iterators empty structs (and hence also default-constructable). Specifically, if your underlying string is guaranteed to be null-terminated, the end iterator can be stateless. I guess you could avoid storing the begin iterator by prepending a null, but that doesn't work for std::string.
/// Output iterator template<typename BackInserter> class code_point_iterator { public:
code_point_iterator(BackInserter out); // begin code_point_iterator(); // end
#ifdef C++0x typedef char32_t code_point_type; #else typedef unsigned code_point_type; #endif
code_point_type operator*() const; ...
};
So this only allows appending, right? I have something like that here: http://svn.chezphil.org/libpbe/trunk/include/charset/character_output_iterat... Broadly, I would say that allowing bidirectional reading and append-only writing is the right thing to do for strings. If anyone has an hour to spare, it's educational to try hacking your code to use std::list<char> instead of std::string, and see how much of it still compiles.
template<typename Char,typename Traits=std::char_traits<Char>, typename Alloc=std::allocator<Char> > class basic_string { public: // { boost specific typedef std::basic_string<Char,Traits,Alloc> std_string_type; // } boost specific
// All std::string standard functions based
// Deprecated interfaces that exist for backward compatibility // as they not Unicode aware
value_type &at(size_type indx); value_type &operator[](size_type indx); iterator begin(); iterator end();
// { boost specific compatibility functions with std::string, they would go // as std::string becode extended with boost::string new interfaces // basic_string(std_string_type const &other) : data_(other) {} basic_string(std_string_type const &other,size_type index,size_type len) : data_(other,index,len) {}
...
operator std_string_type() const { return data_; }
// } boost specific compatibility functions
// // Unicode Support // // ------------------------ //
// // UTF Codepoint iteration //
#ifdef C++0x typedef char32_t code_point_type; #else typedef unsigned code_point_type; #endif
typedef boost::const_code_point_iterator<const_iterator> const_code_point_iterator;
const_code_point_iterator code_point_begin() const { return const_code_point_iterator(begin(),end()); } const_code_point_iterator code_point_end() const { return const_code_point_iterator(begin(),end(),end()); }
typedef boost::code_point_iterator<std::back_inserter<basic_string> > code_point_iterator;
code_point_iterator back_inserter() { return code_point_iterator(std::back_inserter<basic_string>(*this)); }
basic_string &operator+=(code_point_type code_point); basic_string operator+(code_point_type code_point) const; void append(code_point_type code_point);
The approach that I would prefer is more like: template <typename impl_t> class utf8_string_adaptor { impl_t impl; .. }; typedef utf8_string_adaptor<std::string> utf8_string; In this way: - I can wrap other containers than std::string, e.g. sgi::rope, char*, std::vector etc. - utf8_string::begin() can return a utf8_character_iterator. - Accessing the underlying bytes is possible but requires something explicit e.g. foo.base().begin().
// // Lexical operations on string //
// Case handling
basic_string upper_case(std::locale const &l=std::locale()) const; basic_string lower_case(std::locale const &l=std::locale()) const; basic_string title_case(std::locale const &l=std::locale()) const; basic_string fold_case() const; // locale independent
// Unicode normalization
typedef enum { nfc, nfkc, nfd, nfkd } normalization_mode;
basic_string normalize(normalization_mode mode = nfc) const;
// normalized string constructor
basic_string(basic_string const &,normalization_mode mode); basic_string(Char const *,normalization_mode mode); basic_string(Char const *,size_t n,normalization_mode mode); template<Iterator> basic_string(Iterator begin,Iterator end,normalization_mode mode);
void append_normalized(basic_string const &other,normalization_mode mode = nfc); void append_normalized(Char const *,normalization_mode mode = nfc); void append_normalized(Char const *,size_t n,normalization_mode mode = nfc);
basic_string concat_normalized(basic_string const &other,normalization_mode mode = nfc) const; basic_string concat_normalized(Char const *,normalization_mode mode = nfc) const; basic_string concat_normalized(Char const *,size_t n,normalization_mode mode = nfc) const;
// Unicode validation
bool valid_utf() const;
[snip] Surely almost all of that should be in free functions and generic algorithms, no? E.g. valid_utf8() could be an algorithm that takes a pair of iterators over bytes, and then it can be used on any sequence. Regards, Phil.