[string] Yet another Unicode string class

I'm working on yet another Unicode string class/library from another set of features and requirements. * It is designed around the codepoint concept. * It uses (currently forward-) iterators for encoding and decoding. * It has a minimal interface, mostly constructors and iterator access. * Most other functions can (hopefully) be free functions. * It uses basic_string as backend. * It has fast access to underlying basic_string. * It is (currently) using some C++0X features (mainly decltype). * It is (currently) immutable and shares data, and thus fast to copy. Some of these features and requirements may be unacceptable to some of you, but I'm open to suggestions and comments. // Copyright (c) 2011 Anders Dalvander. // // Distributed under the Boost Software License, Version 1.0. // (See accompanying file LICENSE_1_0.txt or copy at // http://www.boost.org/LICENSE_1_0.txt) template <typename encoding> class basic_text { public: typedef encoding encoding_type; typedef typename encoding_type::codeunit_type codeunit_type; typedef typename encoding_type::codepoint_type codepoint_type; typedef std::basic_string<codeunit_type> string_type; typedef typename string_type::const_iterator codeunit_iterator; typedef typename encoding_type::decode_iterator<codeunit_iterator> codepoint_iterator; typedef codepoint_iterator const_iterator; typedef codepoint_iterator iterator; basic_text() : s(std::make_shared<string_type>()) { } template <typename other_encoding> basic_text(const basic_text<other_encoding>& text) : s(std::make_shared<string_type>( encoding_type::encode_iterator<decltype(std::begin(text))> (std::begin(text), std::begin(text), std::end(text)), encoding_type::encode_iterator<decltype(std::begin(text))> (std::end(text), std::begin(text), std::end(text)))) { } // TODO: Use some default_encoding traits type. template <typename container> explicit basic_text(const container& c) : s(std::make_shared<string_type>( encoding_type::encode_iterator<decltype(std::begin(c))> (std::begin(c), std::begin(c), std::end(c)), encoding_type::encode_iterator<decltype(std::begin(c))> (std::end(c), std::begin(c), std::end(c)))) { } template <typename codepoint_iterator> basic_text(codepoint_iterator first, codepoint_iterator last) : s(std::make_shared<string_type>( encoding_type::encode_iterator<codepoint_iterator> (first, first, last), encoding_type::encode_iterator<codepoint_iterator> (last, first, last))) { } codepoint_iterator begin() const { return codepoint_iterator (codeunit_begin(), codeunit_begin(), codeunit_end()); } codepoint_iterator end() const { return codepoint_iterator (codeunit_end(), codeunit_begin(), codeunit_end()); } codeunit_iterator codeunit_begin() const { return std::begin(*s); } codeunit_iterator codeunit_end() const { return std::end(*s); } const string_type& str() const { return *s; } const codeunit_type* c_str() const { return s->c_str(); } private: typedef std::shared_ptr<const string_type> pointer_type; pointer_type s; }; typedef undefined-type utf8_encoding; typedef basic_text<utf8_encoding> u8text; typedef undefined-type utf16_encoding; typedef basic_text<utf16_encoding> u16text; typedef undefined-type utf32_encoding; typedef basic_text<utf32_encoding> u32text; typedef undefined-type wchar_encoding; typedef basic_text<wchar_encoding> wtext; typedef undefined-type ascii_encoding; typedef basic_text<ascii_encoding> ascii_text; Usage: int main() { const uint32_t cps[] = {0x41,0x42,0x80,0x800,0x10000,0x10ffff}; // construct from codepoint range u8text u8txt(std::begin(cps), std::end(cps)); // construct from encoded container, // currently treats each element as a codepoint u8text u8txt2("test"); // sharing is caring u8text u8txt3 = u8txt; // construct from codepoint range u16text u16txt(std::begin(cps), std::end(cps)); // construct from text, transcodes range u16text u16txt2 = u8txt; // construct from text, transcodes range u32text u32txt = u8txt; // using policy (possible extension) ascii_text ascii(u8txt, replace_policy(0xff)); } void OpenFileWin32(const u8text& txt) { CloseHandle(CreateFileW(wtext(txt).c_str(), ...)) } typedef undefined-type posix_encoding; typedef basic_text<posix_encoding> posixtext; void OpenFilePosix(const u8text& txt) { close(open(posixtext(txt).c_str(), ...)) } Regards, Anders Dalvander -- WWFSMD?

On 10/02/2011 21:38, Anders Dalvander wrote:
* It is (currently) using some C++0X features (mainly decltype).
I'm fairly sure that's unnecessary, aren't there range_traits that expose the type directly? Also, for inclusion in Boost, it should probably use boost::begin, not std::begin. Otherwise, there is not much to say without a real implementation to look at.

On 20:59, Mathias Gaunard wrote:
On 10/02/2011 21:38, Anders Dalvander wrote:
* It is (currently) using some C++0X features (mainly decltype).
I'm fairly sure that's unnecessary, aren't there range_traits that expose the type directly? Also, for inclusion in Boost, it should probably use boost::begin, not std::begin.
Otherwise, there is not much to say without a real implementation to look at.
Yes, you are correct, the use of decltype is totally unnecessary. I fixed that and the other issue you mentioned. I've uploaded an implementation to http://www.dalvander.com/boost_text if anyone is interested. Some notes: * The iterator classes should probably derive from iterator_adaptor. * Currently there is no way to prevent exceptions from being thrown. * Sometime the validity checking will be a bit redundant. * The documentation is currently minimalistic. Regards, Anders Dalvander -- WWFSMD?

On Thu, Feb 10, 2011 at 3:38 PM, Anders Dalvander <boost@dalvander.com> wrote:
* It is (currently) immutable and shares data, and thus fast to copy.
Is there a reason you think this is a better approach than you could have by using the small string optimization and simply doing a deep copy when the internal buffer is too small? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 20:59, Dave Abrahams wrote:
On Thu, Feb 10, 2011 at 3:38 PM, Anders Dalvander<boost@dalvander.com> wrote:
* It is (currently) immutable and shares data, and thus fast to copy.
Is there a reason you think this is a better approach than you could have by using the small string optimization and simply doing a deep copy when the internal buffer is too small?
Thanks for your comments. Actually, the only reason I have for this approach is that I listened, perhaps too much, to other voices on this list. Voices which said that immutability was very useful and could provide optimization possibilities. Currently these potential optimizations are unused, except when copying the data. It is very easy to change the design, by changing the single data member from type `boost::shared_ptr<const std::string>` to `std::string`. That way the small string optimization, if any, of `std::string` could be utilized directly. Come to think of it, the type could actually be specified as a typedef of the encoding template parameter. Regards, Anders Dalvander -- WWFSMD?

On 14/02/2011 00:52, Dave Abrahams wrote:
On Thu, Feb 10, 2011 at 3:38 PM, Anders Dalvander<boost@dalvander.com> wrote:
* It is (currently) immutable and shares data, and thus fast to copy.
Is there a reason you think this is a better approach than you could have by using the small string optimization and simply doing a deep copy when the internal buffer is too small?
SBO makes moving a string costly.

On Mon, Feb 14, 2011 at 01:53, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
SBO makes moving a string costly.
How bug a buffer does SBO usually use? For UTF-8, I figured it'd be maybe 4 or 8 bytes, to usually avoid the heap when putting individual "characters" in strings, as is needed to avoid problems like the current toupper('ß').

On 14.02.2011, at 20:15, Stephan T. Lavavej wrote:
[Scott McMurray]
How bug a buffer does SBO usually use?
VC's std::string uses 16 bytes, enough to store 15 chars or 7 wchar_ts plus a null terminator.
I believe STLPort uses twice that for char, and grows with the character type (i.e. twice that again for 16-bit code units). Whereas libc++, which was developed with move semantics in mind, has a very small buffer. Sebastian

On 14/02/11 19:08, Scott McMurray wrote:
On Mon, Feb 14, 2011 at 01:53, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
SBO makes moving a string costly.
How bug a buffer does SBO usually use?
The libc++ string uses 23 bytes (on a 64-bit architecture) by squishing the length into a single byte for short strings. I don't see how that makes moves costly, though. The move constructor for this string simply copies the bytes and zeroes out the source; it doesn't even need to branch based on whether it's a long or short string. Perhaps the concern is that such techniques are not defined behaviour? Or that the use of unions might confuse the optimizer? John Bytheway

On Feb 14, 2011, at 5:55 PM, John Bytheway wrote:
On 14/02/11 19:08, Scott McMurray wrote:
On Mon, Feb 14, 2011 at 01:53, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
SBO makes moving a string costly.
How bug a buffer does SBO usually use?
The libc++ string uses 23 bytes (on a 64-bit architecture) by squishing the length into a single byte for short strings.
I don't see how that makes moves costly, though. The move constructor for this string simply copies the bytes and zeroes out the source; it doesn't even need to branch based on whether it's a long or short string. Perhaps the concern is that such techniques are not defined behaviour? Or that the use of unions might confuse the optimizer?
It isn't so much the size of the buffer that matters, but rather the total number of words that need to be copied or otherwise manipulated (e.g. zeroed). The libc++ string is 3 words (on 32 and 64 bit) and the cost of the move is proportional to 3. If the libc++ string needed to move (for example) 6 words, then its move constructor would be twice as expensive, though still cheap compared to a copy of a long string. -Howard

On 15/02/2011 00:12, Howard Hinnant wrote:
On Feb 14, 2011, at 5:55 PM, John Bytheway wrote:
On 14/02/11 19:08, Scott McMurray wrote:
On Mon, Feb 14, 2011 at 01:53, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
SBO makes moving a string costly.
How bug a buffer does SBO usually use?
The libc++ string uses 23 bytes (on a 64-bit architecture) by squishing the length into a single byte for short strings.
I don't see how that makes moves costly, though. The move constructor for this string simply copies the bytes and zeroes out the source; it doesn't even need to branch based on whether it's a long or short string. Perhaps the concern is that such techniques are not defined behaviour? Or that the use of unions might confuse the optimizer?
It isn't so much the size of the buffer that matters, but rather the total number of words that need to be copied or otherwise manipulated (e.g. zeroed). The libc++ string is 3 words (on 32 and 64 bit) and the cost of the move is proportional to 3. If the libc++ string needed to move (for example) 6 words, then its move constructor would be twice as expensive, though still cheap compared to a copy of a long string.
I didn't know it was just 3 words, that's still very cheap. Using memcpy doesn't break the strict aliasing guarantee, so that's legal. It could be even cheaper if it could be copied using SIMD instructions, but then that would put restrictions on the alignment of string, which one might not want.

Mathias Gaunard wrote:
On 15/02/2011 00:12, Howard Hinnant wrote:
On Feb 14, 2011, at 5:55 PM, John Bytheway wrote:
On 14/02/11 19:08, Scott McMurray wrote:
On Mon, Feb 14, 2011 at 01:53, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
SBO makes moving a string costly.
How bug a buffer does SBO usually use?
The libc++ string uses 23 bytes (on a 64-bit architecture) by squishing the length into a single byte for short strings.
I don't see how that makes moves costly, though. The move constructor for this string simply copies the bytes and zeroes out the source; it doesn't even need to branch based on whether it's a long or short string. Perhaps the concern is that such techniques are not defined behaviour? Or that the use of unions might confuse the optimizer?
It isn't so much the size of the buffer that matters, but rather the total number of words that need to be copied or otherwise manipulated (e.g. zeroed). The libc++ string is 3 words (on 32 and 64 bit) and the cost of the move is proportional to 3. If the libc++ string needed to move (for example) 6 words, then its move constructor would be twice as expensive, though still cheap compared to a copy of a long string.
I didn't know it was just 3 words, that's still very cheap. Using memcpy doesn't break the strict aliasing guarantee, so that's legal.
It could be even cheaper if it could be copied using SIMD instructions, but then that would put restrictions on the alignment of string, which one might not want.
This caught my eye. A couple of years ago I looked at MMX/SIMD - tested an example off of the codeguru website which out of the box showed MMX/SIMD was faster. I unrolled the loop of the non-MMX/SIMD code using duff's device and saw no difference in performance with or without MMX/SIMD. I'm just wondering in your experience is there a real benefit of MMX/SIMD even discounting the alignment issue. Thanks, Jeff

This caught my eye. A couple of years ago I looked at MMX/SIMD - tested an example off of the codeguru website which out of the box showed MMX/SIMD was faster. I unrolled the loop of the non-MMX/SIMD code using duff's device and saw no difference in performance with or without MMX/SIMD. I'm just wondering in your experience is there a real benefit of MMX/SIMD even discounting the alignment issue.
MMX is old tech. SSE2+ and Altivec has a real effect on such operations.

On 15/02/2011 13:35, Jeff Flinn wrote:
This caught my eye. A couple of years ago I looked at MMX/SIMD - tested an example off of the codeguru website which out of the box showed MMX/SIMD was faster. I unrolled the loop of the non-MMX/SIMD code using duff's device and saw no difference in performance with or without MMX/SIMD. I'm just wondering in your experience is there a real benefit of MMX/SIMD even discounting the alignment issue.
I don't think copying 3 words is really the area where SIMD shines the most. But it does shine (with linear speedups) in others.
participants (10)
-
Anders Dalvander
-
Dave Abrahams
-
Howard Hinnant
-
Jeff Flinn
-
Joel Falcou
-
John Bytheway
-
Mathias Gaunard
-
Scott McMurray
-
Sebastian Redl
-
Stephan T. Lavavej