[unicode] Interest Check / Proof of Concept

James Porter

19 Nov 2008 19 Nov '08

4:38 a.m.

Over the past few months, I've been tinkering with a Unicode string library. It's still *far* from finished, but it's far enough along that the overall structure is visible. I've seen a bunch of Unicode proposals for Boost come and go, so hopefully this one will address the most common needs people have. The library is based on two (immutable) string types: ct_string and rt_string. ct_strings are _C_ompile _T_ime tagged with a particular encoding, and rt_strings are _R_un _T_ime tagged with an encoding. This is to allow for faster conversion when the encoding is known at compile-time, but to allow for conversion at run-time (useful for reading XML!). General usage would look something like this: ct_string<ct::utf8> foo("Hello, world!"); ct_string<ct::utf16> bar; bar.encode(foo); rt_string baz; baz.encode(bar,rt::utf8); Note the use of ct::utf8 and rt::utf8. As you might expect from the syntax, ct::utf8 is a type, and rt::utf8 is an object. Broadly speaking, to create an encoding, you create a class with read and write methods, and then you create an instance of an rt_encoding<MyEncoding>. Most of this is laid out in the comments of my code, so I won't go into too much detail here. There's still a lot missing from the code (most notably, dynamically-sized strings and string concatenation), but here's a rundown of what *is* present: * Compile-time and run-time tagged strings * Re-encoding of strings based on compile-/run-time tags * Uses simple memory copying when source and dest encodings are the same * Forward iterators to step through code points in strings If you'd like to take a look at the code, it's available here: http://www.teamboxel.com/misc/unicode.tar.gz . I've tested it in gcc 4.3.2 and MSVC8, but most modern compilers should be able to handle it. Comments and criticisms are, of course, welcome. - Jim

Show replies by date

Andrew Sutton

19 Nov 19 Nov

2:33 p.m.

...

There's still a lot missing from the code (most notably, dynamically-sized strings and string concatenation), but here's a rundown of what *is* present:

* Compile-time and run-time tagged strings * Re-encoding of strings based on compile-/run-time tags * Uses simple memory copying when source and dest encodings are the same * Forward iterators to step through code points in strings

If you'd like to take a look at the code, it's available here: http://www.teamboxel.com/misc/unicode.tar.gz . I've tested it in gcc 4.3.2 and MSVC8, but most modern compilers should be able to handle it. Comments and criticisms are, of course, welcome.

I think it looks like a good start. I'm getting a warning about a string->wchar_t conversion. Just a couple comments/questions... - I don't think the global rt encoding objects are the way to go. I would just each each string object declare the encoding object either as a member variable or as needed inside a member function. Since they don't have any member variables, the cost is negligible. - Would it be possible to merge the ct/rt classes into a single type? - Maybe encode/decode should be free functions - algorithm like. You might have something like: estring<> s= ...; // Create an encodeable string with some default encoding (ascii?) encode(s, utf8()); // utf8 is a functor object that returns a utf8_encoder object. I guess if you go this way, the estring class would just contain an encoded string associated with the encoder type. It might be an interesting approach. Still. A good start. Andrew Sutton andrew.n.sutton@gmail.com

James Porter

9:22 p.m.

Andrew Sutton wrote:

...

I think it looks like a good start. I'm getting a warning about a string->wchar_t conversion.

I think gcc is complaining because it defines wchar_t as 32 bits. Honestly, wchar_t is pretty awful since its size is platform-dependent, but I don't think any compiler supports the new Unicode strings yet. :) I suppose I could have said "int16_t raw[] = { 'T', 'e', 's', 't', ... };", but that's not very readable!

...

Just a couple comments/questions... - I don't think the global rt encoding objects are the way to go. I would just each each string object declare the encoding object either as a member variable or as needed inside a member function. Since they don't have any member variables, the cost is negligible.

This is probably workable. Do you envision something like the following? my_string.encode(source,utf8()); It would have the benefit of making the interface for ct_strings and rt_strings the same. For ct_strings, it would specialize on the type of the encoding parameter, and for rt_strings, it would wrap the encoding up in some object to give it virtual dispatch.

...

- Would it be possible to merge the ct/rt classes into a single type?

This would definitely be possible. Assuming I can make the interface identical, I could just make a special "encoding type" for ct_strings to make them behave like rt_strings do now.

...

- Maybe encode/decode should be free functions - algorithm like.

You might have something like:

estring<> s= ...; // Create an encodeable string with some default encoding (ascii?) encode(s, utf8()); // utf8 is a functor object that returns a utf8_encoder object.

I guess if you go this way, the estring class would just contain an encoded string associated with the encoder type. It might be an interesting approach. Still. A good start.

Do you envision the encode algorithm re-encoding the contents of s into a new encoding, or just tagging s with a "utf8" encoding? Perhaps a better verb for "encode" would have been "transcode", since it's responsible for decoding from a source and encoding to a target. "encode" sounds better though. :) - Jim

James Porter

10:40 p.m.

James Porter wrote:

...

This is probably workable. Do you envision something like the following?

my_string.encode(source,utf8());

It would have the benefit of making the interface for ct_strings and rt_strings the same. For ct_strings, it would specialize on the type of the encoding parameter, and for rt_strings, it would wrap the encoding up in some object to give it virtual dispatch.

I read through this again, and it doesn't actually make sense. ct_strings would never need an encoding specified. Nevermind! - Jim

Zach Laine

3:55 p.m.

...

Over the past few months, I've been tinkering with a Unicode string library. It's still *far* from finished, but it's far enough along that the overall structure is visible. I've seen a bunch of Unicode proposals for Boost come and go, so hopefully this one will address the most common needs people have.

I would love to see a Unicode support library added to Boost. However, I question the usefulness of another string class, or in this case another hierarchy of string classes. Interoperability with std::string (and QString, and CString, and a thousand other API-specific string classes) is always thorny. I'd much rather see an iterators- and algorithms-based approach, along the lines of your ct_string::iterator. Instead of doing this:

...

baz.encode(bar,rt::utf8);

I'd rather be able to do something like this: typedef std::basic_string<some_32bit_char_type> unicode_string; unicode_string u_string = /*...*/; std::string std_string = /*...*/; typedef boost::recoding_iterator<boost::ucs4, boost::utf8> ucs4_to_utf8_iter; std::copy(ucs4_to_utf8_iter(u_string.begin()), ucs4_to_utf8_iter(u_string.end()), std::back_inserter(std_string)); // or typedef boost::recoding_iterator<boost::utf8, boost::ucs4> utf8_to_ucs4_iter; std::copy(utf8_to_ucs4_iter(std_string.begin()), utf8_to_ucs4_iter(std_string.end()), std::back_inserter(u_string)); Having iterators that do the right thing, in terms of stepping over code points or (possibly synthesized) characters as appropriate, in an efficient manner, would provide a toolkit with which anyone could write whatever custom Unicode-aware code they need. Zach

James Porter

10:27 p.m.

Zach Laine wrote:

...

I would love to see a Unicode support library added to Boost. However, I question the usefulness of another string class, or in this case another hierarchy of string classes. Interoperability with std::string (and QString, and CString, and a thousand other API-specific string classes) is always thorny. I'd much rather see an iterators- and algorithms-based approach, along the lines of your ct_string::iterator.

It might get equally thorny just trying to get the algorithms to recognize all the strange varieties of strings out there without writing iterator facades for the lot of them! It's probably possible, but I'm not I'd want it to be the primary interface for encoding. Most custom string types (both QString and CString, for instance) are designed to work with only one encoding (UTF-16 seems popular), so if you had some reason that you needed to store your strings in UTF-8, or - god forbid - Shift-JIS, you'd be out of luck. This is especially important when you're reading in arbitrary data whose encoding you don't know at compile-time. If someone sends me a message encoded in Shift-JIS and I want to forward it on, I don't want to have to decode it into UTF-8 and then re-encode it into Shift-JIS before I send it; I just want to store it in Shift-JIS.

...

Instead of doing this:

...
baz.encode(bar,rt::utf8);

I'd rather be able to do something like this:

typedef std::basic_string<some_32bit_char_type> unicode_string;

unicode_string u_string = /*...*/; std::string std_string = /*...*/;

typedef boost::recoding_iterator<boost::ucs4, boost::utf8> ucs4_to_utf8_iter; std::copy(ucs4_to_utf8_iter(u_string.begin()), ucs4_to_utf8_iter(u_string.end()), std::back_inserter(std_string));

std::strings aren't really appropriate for this purpose, at least not without a lot of changes to their interface, since they're designed for compile-time-tagged, fixed-width-encoding strings. In your examples, you have to remember what the source encoding is. This is easy enough if you know that "all my strings are in UTF-8", but if you start working with runtime-tagged strings (see my Shift-JIS example above), you'd need to keep track of every encoding in use. - Jim

Eric Niebler

11:10 p.m.

Zach Laine wrote:

...

...
Over the past few months, I've been tinkering with a Unicode string library. It's still *far* from finished, but it's far enough along that the overall structure is visible. I've seen a bunch of Unicode proposals for Boost come and go, so hopefully this one will address the most common needs people have.

I would love to see a Unicode support library added to Boost. However, I question the usefulness of another string class, or in this case another hierarchy of string classes. Interoperability with std::string (and QString, and CString, and a thousand other API-specific string classes) is always thorny. I'd much rather see an iterators- and algorithms-based approach <snip>

Agree. Thanks Zach. I'm discouraged that every time the issue of a Unicode library comes up, the discussion immediately descends into a debate about how to design yet another string class. Such a high level wrapper *might* be useful (strong emphasis on "might"), but the core must be the Unicode algorithms, and the design for a Unicode library must start there. -- Eric Niebler BoostPro Computing http://www.boostpro.com

James Porter

20 Nov 20 Nov

12:29 a.m.

Eric Niebler wrote:

...

Agree. Thanks Zach. I'm discouraged that every time the issue of a Unicode library comes up, the discussion immediately descends into a debate about how to design yet another string class. Such a high level wrapper *might* be useful (strong emphasis on "might"), but the core must be the Unicode algorithms, and the design for a Unicode library must start there.

Since it seems like there's a lot of concern with making a new string type, how about the following (off-the-cuff): * Iterator filters a la Zach's message: typedef std::basic_string<char16_t> utf16_string; utf16_string u_string = /*...*/; std::string std_string = /*...*/; typedef boost::recoding_iterator<boost::utf16, boost::utf8> utf16_to_utf8_iter; std::copy(utf16_to_utf8_iter(u_string.begin()), utf16_to_utf8_iter(u_string.end()), std::back_inserter(std_string)); * Runtime-defined filters: typedef boost::recoding_iterator<boost::utf16,boost::runtime> utf16_to_any_iter; boost::runtime *my_codec = /*...*/; std::copy(utf16_to_utf8_iter(u_string.begin(), my_codec), utf16_to_utf8_iter(u_string.end(), my_codec), std::back_inserter(std_string)); * Shorthand for the above two points: boost::transcode(u_string, boost::utf16(), std_string, boost::utf8()); * String views that can wrap up the encoding type and the data (a container of some kind: strings, vector<char>s, ropes, etc): boost::estring_view<utf8> my_utf8_string(std_string); boost::estring_view<> my_rt_string(str, my_codec); boost::transcode(my_utf8_string, my_rt_string); Luckily, most of the work I've done is in making the encoding facets extensible and chooseable at runtime, so I wouldn't mourn the loss of my (frankly none-too-zazzy) string class. - Jim

Zach Laine

2:08 p.m.

...

Eric Niebler wrote:

...
Agree. Thanks Zach. I'm discouraged that every time the issue of a Unicode library comes up, the discussion immediately descends into a debate about how to design yet another string class. Such a high level wrapper *might* be useful (strong emphasis on "might"), but the core must be the Unicode algorithms, and the design for a Unicode library must start there.

Since it seems like there's a lot of concern with making a new string type, how about the following (off-the-cuff):

* Iterator filters a la Zach's message:

[snip]

...

* Runtime-defined filters:

typedef boost::recoding_iterator<boost::utf16,boost::runtime> utf16_to_any_iter; boost::runtime *my_codec = /*...*/; std::copy(utf16_to_utf8_iter(u_string.begin(), my_codec), utf16_to_utf8_iter(u_string.end(), my_codec), std::back_inserter(std_string));

Yes, that's what I was thinking as well. In fact, if you look at the Boost.GIL any_image<> and any_image_view<> templates, you'll see that they allow the user to specify a limit number of variants (a la Boost.Variant). So it's more restrictive than a Boost.Any, but that might be an advantage if it allows you to detect more errors at runtime. I think that in use cases, one will have knowledge of the maximum number of encodings that are possible in that case. Just something to consider.

...

* Shorthand for the above two points:

boost::transcode(u_string, boost::utf16(), std_string, boost::utf8());

Looks good, but is this function an assignment, or an append?

...

* String views that can wrap up the encoding type and the data (a container of some kind: strings, vector<char>s, ropes, etc):

boost::estring_view<utf8> my_utf8_string(std_string); boost::estring_view<> my_rt_string(str, my_codec);

boost::transcode(my_utf8_string, my_rt_string);

Yes. Views are notably absent in my original post. I think views are essential for encodings that are variable in length (e.g. UTF-8). Getting the character-location of code point N, or vice versa, and doing it efficiently, is a must-have.

...

Luckily, most of the work I've done is in making the encoding facets extensible and chooseable at runtime, so I wouldn't mourn the loss of my (frankly none-too-zazzy) string class.

This is just what I was hoping. The bulk of the work you'll do in any case will probably be with the algorithms and number of supported encodings. Zach

James Porter

22 Nov 22 Nov

9:26 a.m.

Zach Laine wrote:

...

Yes, that's what I was thinking as well. In fact, if you look at the Boost.GIL any_image<> and any_image_view<> templates, you'll see that they allow the user to specify a limit number of variants (a la Boost.Variant).

I'll have to take a look at how GIL handles this, since it would make sense to use a similar interface, since I assume GIL's any_image<> allows you to do things with an image represented in any file format that GIL knows about.

...

Looks good, but is this function an assignment, or an append?

In my mind, it's an assignment, so I guess it's not exactly the same as std::copy, but all that can be adjusted as we develop the library.

...

Yes. Views are notably absent in my original post. I think views are essential for encodings that are variable in length (e.g. UTF-8). Getting the character-location of code point N, or vice versa, and doing it efficiently, is a must-have.

Views have the additional benefit of providing a unified interface no matter what base string type you use (provided the base type behaves somewhat like std::string, i.e. has begin() and end() and maybe some other requirements). And it lets us neatly sidestep the issue of "what's the best data structure to store arbitrarily-encoded text?". :) There are obviously a whole host of specific issues that we should address, especially in regards to optimizations (e.g. ASCII -> UTF-8 can be a memcpy instead of a transcoding) and validation, but it seems like we have some consensus of what the interface should look like for day-to-day use, which is farther than a lot of Boost.Unicode attempts have gotten. - Jim

Phil Endecott

20 Nov 20 Nov

8:42 p.m.

Eric Niebler wrote:

...

Zach Laine wrote:

...
...
Over the past few months, I've been tinkering with a Unicode string library. It's still *far* from finished, but it's far enough along that the overall structure is visible. I've seen a bunch of Unicode proposals for Boost come and go, so hopefully this one will address the most common needs people have.

I would love to see a Unicode support library added to Boost. However, I question the usefulness of another string class, or in this case another hierarchy of string classes. Interoperability with std::string (and QString, and CString, and a thousand other API-specific string classes) is always thorny. I'd much rather see an iterators- and algorithms-based approach <snip>

Agree. Thanks Zach. I'm discouraged that every time the issue of a Unicode library comes up, the discussion immediately descends into a debate about how to design yet another string class. Such a high level wrapper *might* be useful (strong emphasis on "might"), but the core must be the Unicode algorithms, and the design for a Unicode library must start there.

I mostly agree. If people want UTF-8 and UTF-16 iterator-adaptors that will efficiently convert byte-sequence iterators into unicode character iterators, then I probably already have exactly that. Should I package it up for review? There are, however, a few points to consider. Most importantly, if you operate on a UTF-8 string only using an iterator-adaptor then you'll miss out on most of the clever features of the encoding. Specifically: - If you need to search for an ASCII character in a UTF-8 string then you can do so just by scanning the bytes. - Similarly, searching for substrings (including substrings with non-ASCII characters) can be done just by scanning for a bytewise match. - Sorting can be done using strcmp()-like comparisons on the byte sequences. An implementation that doesn't somehow exploit these optimisations will perform sub-optimally, and I don't think that would be acceptable. I don't really have a complete solution to offer. What I do have is the beginnings of a character-set traits class with booleans indicating things like "is an ASCII superset", "is variable-length" etc. The idea is that algorithms could be specialised based on these traits. I'm not sure how it all joins together yet though. Cheers, Phil.

Zach Laine

9:05 p.m.

...

Eric Niebler wrote:

...
Zach Laine wrote:

...
...
Over the past few months, I've been tinkering with a Unicode string library. It's still *far* from finished, but it's far enough along that the overall structure is visible. I've seen a bunch of Unicode proposals for Boost come and go, so hopefully this one will address the most common needs people have.

I would love to see a Unicode support library added to Boost. However, I question the usefulness of another string class, or in this case another hierarchy of string classes. Interoperability with std::string (and QString, and CString, and a thousand other API-specific string classes) is always thorny. I'd much rather see an iterators- and algorithms-based approach

<snip>

Agree. Thanks Zach. I'm discouraged that every time the issue of a Unicode library comes up, the discussion immediately descends into a debate about how to design yet another string class. Such a high level wrapper *might* be useful (strong emphasis on "might"), but the core must be the Unicode algorithms, and the design for a Unicode library must start there.

I mostly agree. If people want UTF-8 and UTF-16 iterator-adaptors that will efficiently convert byte-sequence iterators into unicode character iterators, then I probably already have exactly that. Should I package it up for review?

Perhaps joining forces with Jim Porter would produce more interesting results than either of you would produce in isolation.

...

There are, however, a few points to consider. Most importantly, if you operate on a UTF-8 string only using an iterator-adaptor then you'll miss out on most of the clever features of the encoding. Specifically:

- If you need to search for an ASCII character in a UTF-8 string then you can do so just by scanning the bytes. - Similarly, searching for substrings (including substrings with non-ASCII characters) can be done just by scanning for a bytewise match. - Sorting can be done using strcmp()-like comparisons on the byte sequences.

An implementation that doesn't somehow exploit these optimisations will perform sub-optimally, and I don't think that would be acceptable.

These are all good and worthy things to have in the UTF-8 portion of a Unicode library. Note that they all describe algorithms. My suggestion is for a library based on iterators and algorithms instead of based on a hierarchy of string classes. Zach

David Abrahams

9:18 p.m.

on Thu Nov 20 2008, "Phil Endecott" <spam_from_boost_dev-AT-chezphil.org> wrote:

...

I mostly agree. If people want UTF-8 and UTF-16 iterator-adaptors that will efficiently convert byte-sequence iterators into unicode character iterators, then I probably already have exactly that. Should I package it up for review?

Did you look at the stuff in boost/regex/pending/unicode_iterator.hpp ? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

Phil Endecott

11:13 p.m.

David Abrahams wrote:

...

on Thu Nov 20 2008, "Phil Endecott" <spam_from_boost_dev-AT-chezphil.org> wrote:

...
I mostly agree. If people want UTF-8 and UTF-16 iterator-adaptors that will efficiently convert byte-sequence iterators into unicode character iterators, then I probably already have exactly that. Should I package it up for review?

Did you look at the stuff in boost/regex/pending/unicode_iterator.hpp ?

I did look at that a while ago. I've had another look now and it seems very complicated; mine is simpler and I believe consequently faster. Perhaps it has some features that I'm missing, resulting in the complexity. Anyway, it is basically trying to do the same thing. Phil.

Joel de Guzman

11:37 p.m.

Phil Endecott wrote:

...

David Abrahams wrote:

...
on Thu Nov 20 2008, "Phil Endecott" <spam_from_boost_dev-AT-chezphil.org> wrote:

...
I mostly agree. If people want UTF-8 and UTF-16 iterator-adaptors that will efficiently convert byte-sequence iterators into unicode character iterators, then I probably already have exactly that. Should I package it up for review?

Did you look at the stuff in boost/regex/pending/unicode_iterator.hpp ?

I did look at that a while ago. I've had another look now and it seems very complicated; mine is simpler and I believe consequently faster. Perhaps it has some features that I'm missing, resulting in the complexity.

Anyway, it is basically trying to do the same thing.

I'd love to see some benchmarks. Regards, -- Joel de Guzman http://www.boostpro.com http://spirit.sf.net

Phil Endecott

19 Nov 19 Nov

7:43 p.m.

James Porter wrote:

...

Over the past few months, I've been tinkering with a Unicode string library. It's still *far* from finished, but it's far enough along that the overall structure is visible. I've seen a bunch of Unicode proposals for Boost come and go, so hopefully this one will address the most common needs people have.

Hi Jim, Mine was probably one of those proposals that you looked at; for the record the code is all available at http://svn.chezphil.org/libpbe/trunk/include/charset/ and nearby directories. I was reasonably happy with my implementations of the most common character sets (i.e. unicode, ASCII, iso8859), but I wanted to explore some of the more esoteric ones to understand the implications that they would have on how a general-purpose framework should work. For example, I wanted to explore how error handling policies could be specified and what conditions they would need to handle. The last work that I did with this code was a general-purpose command-line conversion utility that could be used to benchmark the conversions. Input and output character sets and error policies could be set from the command-line, but the problem that I hit was that making these things template parameters led to a code-size and compilation-time explosion. That means that I'll need to rethink a few things, but it has been low on my to-do list.

...

The library is based on two (immutable) string types: ct_string and rt_string. ct_strings are _C_ompile _T_ime tagged with a particular encoding, and rt_strings are _R_un _T_ime tagged with an encoding.

Mutable vs. immutable strings is something that has been briefly discussed before. My personal preference has been for mutable strings, but without the O(1) random access guarantee of a std::string. I also considered strings where the only mutation allowed is appending, i.e. there's a back_insert_iterator. Why do you prefer immutable strings? One argument for mutable strings is simply that std::string is mutable, and that a proposal is more likely to prove popular if it changes less w.r.t. existing practice. I also have run-time and compile-time tagging. My feeling now is that compile-time-tagging is the more important case. Data whose encoding is known only at run-time can be handled using a more ad-hoc method if necessary. I also struggled to find good names for these things; I don't find ct_string and rt_string great. Do any readers have suggestions?

...

This is to allow for faster conversion when the encoding is known at compile-time, but to allow for conversion at run-time (useful for reading XML!).

General usage would look something like this:

ct_string<ct::utf8> foo("Hello, world!");

typedef ct_string<ct::utf8> utf8string;

...

ct_string<ct::utf16> bar; bar.encode(foo);

Well it's actually decoding the utf16 and encoding the utf8. Maybe "transcode", and preferably as a free function: transcode(bar,foo); equivalent to: std::copy(back_insert_iterator(bar),foo.begin(),foo.end());

...

rt_string baz; baz.encode(bar,rt::utf8);

So the encoding of the rt_string is not stored in the string?

...

Note the use of ct::utf8 and rt::utf8. As you might expect from the syntax, ct::utf8 is a type, and rt::utf8 is an object. Broadly speaking, to create an encoding, you create a class with read and write methods, and then you create an instance of an rt_encoding<MyEncoding>. Most of this is laid out in the comments of my code, so I won't go into too much detail here.

I'll try to find time to have a look, but I do encourage you to post more details to the list. That tends to generate more discussion than "please look at the code" proposals do.

...

There's still a lot missing from the code (most notably, dynamically-sized strings and string concatenation),

So what is your underlying implementation? Not std::string?

...

but here's a rundown of what *is* present:

* Compile-time and run-time tagged strings * Re-encoding of strings based on compile-/run-time tags * Uses simple memory copying when source and dest encodings are the same * Forward iterators to step through code points in strings

If you'd like to take a look at the code, it's available here: http://www.teamboxel.com/misc/unicode.tar.gz . I've tested it in gcc 4.3.2 and MSVC8, but most modern compilers should be able to handle it. Comments and criticisms are, of course, welcome.

One of my priorities has been performance; it would be good to compare e.g. utf8-to/from-utf16 conversion speed. My feeling about the way forward is as follows: - A complete character set library is a lot of work. - A library that only understands Unicode is less work, but is it what people need? - Is there a consensus about mutable vs. immutable strings? Perhaps we should start by defining a new string concept, removing the character-set-unfriendly aspects of std::string like indexing using integers, and see what people think of it. I have been trying to use only std::algorithms and iterators with strings in new code, but it can often be simpler to use indexes and the std::string members that use or return them. - It would be useful to factor out the actual Unicode bit-bashing operations. I have implementations of them that I have carefully tuned, and they are ready for wider use even though the rest of my code isn't. Regards, Phil.

Sebastian Redl

9:18 p.m.

...

James Porter wrote:

...
Over the past few months, I've been tinkering with a Unicode string library. It's still *far* from finished, but it's far enough along that the overall structure is visible. I've seen a bunch of Unicode proposals for Boost come and go, so hopefully this one will address the most common needs people have.

Hi Jim,

Mine was probably one of those proposals that you looked at; for the record the code is all available at

http://svn.chezphil.org/libpbe/trunk/include/charset/

and nearby directories. While we're throwing out Unicode libraries, I have my own tagged strings. They're quite tied up with other code around them, so I won't

Phil Endecott wrote: post them at this time, but the basic concepts work like this: 1) There are two string templates, templated on the character set. One is a replace-only, reference-counted string (i.e. you can do s = other, but it's otherwise immutable) with shared storage even for substrings (no zero-termination guarantee). The other is a non-shared, mutable string with the guarantee (and small-string optimization, but that's a detail). 2) They implement the same basic concept for non-mutating access: bidirectional iterators, a data() function, a unit_count() (the length in encoding base units) and a char_count() (the number of codepoints), substr() (with iterator arguments), and a free compare() with strcmp interface, as well as relational operators. Assignment and explicit construction from a string with another encoding is possible, transcoding is done automatically. The mutable string additionally has a replace() with an iterator range to replace and various ways of specifying the string to insert, as well as a lot of functions that can be built upon this (insert(), append(), erase()). Because the positions are always specified with iterators, it's not possible to split a multi-unit codepoint. 3) The error handling policy is a runtime parameter. Three policies are defined, not extensible: skip bad characters, replace bad characters with an encoding-specific replacement character, throw an exception. These are enforced on storing the data - it's not possible to have a string object holding invalid data. The low-level machinery is based on overloading of free functions that get a tag parameter for the character set passed, to define the characteristics of the character set. Built on that is an iterator adapter that uses this interface to adapt any bidirectional range whose value type is convertible to the encoding base type to a character sequence. These string classes are actually part of an I/O framework, and the intended usage is that character sets are converted upon I/O, so that the internally used character set is always statically known. Thus, there is no way to get a string with a runtime-determined encoding. Some examples from the test cases: BOOST_AUTO_TEST_CASE( roundtrip ) { fast_substring<native_narrow_encoding> start = string_literal("Grüße, W€lt!"); xstring<utf8> as_utf8(start); fast_substring<iso_8859_15> as_iso_8859_15(as_utf8); fast_substring<utf16le> as_utf16le(as_iso_8859_15); xstring<utf16> as_utf16(as_utf16le); xstring<windows_1252> as_windows_1252(as_utf16); fast_substring<utf16be> as_utf16be(as_windows_1252); fast_substring<native_narrow_encoding> finish(as_utf16be); BOOST_CHECK(start == finish); } BOOST_AUTO_TEST_CASE( string_literal ) { str_t s(str::string_literal("Gr\u00FC\u00DFe, Welt!")); BOOST_CHECK_EQUAL(s.unit_count(), 14u); BOOST_CHECK_EQUAL(s.char_count(), 12u); } I've implemented UTF-8, UTF-16 based on uint16_t, UTF-16 big and little endian, based on bytes, and UTF-32 based on uint32_t. I've also implemented ISO-8859-1, ISO-8859-15 and Windows-1252. If you're interested, I can try extracting the code, or post the whole thing. Sebastian

Kirit Sælensminde

20 Nov 20 Nov

4:25 a.m.

Sebastian Redl wrote:

...

...
James Porter wrote:

...
Over the past few months, I've been tinkering with a Unicode string library. It's still *far* from finished, but it's far enough along that the overall structure is visible. I've seen a bunch of Unicode proposals for Boost come and go, so hopefully this one will address the most common needs people have.

Hi Jim,

Mine was probably one of those proposals that you looked at; for the record the code is all available at

http://svn.chezphil.org/libpbe/trunk/include/charset/

and nearby directories. While we're throwing out Unicode libraries, I have my own tagged strings. They're quite tied up with other code around them, so I won't

Phil Endecott wrote: post them at this time, but the basic concepts work like this:

My attempt at Unicode string handling can be seen at: svn://svn.felspar.com/public/fost-base/stable/ (Username is 'guest', password is blank) What I did was a bit different again. I'm not concerned about non-Unicode encoding as I don't deal with random text files -- they're either created by our code, or they're small config files where we can say "just use UTF-8" -- generally we're reading from databases which are Unicode, or writing web pages where we can use Unicode without problem. What I did was to wrap the std::string/wstring class, using std::wstring on Windows and std::string on Linux -- so on Linux everything stays as UTF-8 and on Windows it is UTF-16. In the wrapper I throw away the mutable iterators and the mutating operator[]. Iterators and operator[] dereference to a UTF-32 code point. For the most part the interface follows that of std::basic_string<> and we've added std_str() to get at the underlying std:: string if you need it. In the source code on all platform we assume L"wide literal" which are UTF-16 encoded, even on platforms where wchar_t is 32 bit. This seems to be a bit more convenient, and means that when the new Unicode character literals get supported we can easily change. I've been using this style of implementation for more than 5 years (the code linked is new as the Linux handling is new), since the MSVC 6 days when we had to wrap Microsoft's std::wstring as it used a non-thread safe COW implementation. It seems to work pretty well. I've been using it for so long that I can't tell if it's intuitive or not -- I'm just used to it. The linear scans from the start of the string to calculate string boundaries (given that offsets are always in UTF-32 code points) does introduce some performance penalty. On the previous Windows implementation we cached the UTF-16 size and the UTF-32 size -- if they're the same (as they are nearly all the time) then you know you can safely use an offset without decoding everything. We don't have the experience on Linux where the underlying string is UTF-8 to know how this would work out there. What I'm now thinking is that it would probably be worth it to use a "rope" implementation built on top of this underlying string which also allows us to address some other use cases which could be faster. Kirit

Manuel Fiorelli

8:50 a.m.

I am pretty interested in a C++ UNICODE library, since it could be used as a foundation for other useful libraries: XML, Spirit and every text-related tool. I agree with you that there should be a compatibility layer over other string classes (eg. QString): iterators, algorithms or an adapter class. However, I think that Boost should either provide a string class or a guideline about internationalization, since it is not always obvious how to manage UNICODE. For example, Boost.XML (in the sandbox) is parametrized against the String type, but in the README the author states that this mechanism needs other work. I would like to know how you think to manage the I/O Manuel Fiorelli

James Porter

19 Nov 19 Nov

11:50 p.m.

Phil Endecott wrote:

...

Mutable vs. immutable strings is something that has been briefly discussed before. My personal preference has been for mutable strings, but without the O(1) random access guarantee of a std::string. I also considered strings where the only mutation allowed is appending, i.e. there's a back_insert_iterator. Why do you prefer immutable strings?

I don't have any problem with appendable strings, but mutating mid-string obviously has a potentially heavy performance hit. You could allow it, and just let the user decide if he wants to deal with the performance cost, but I'm not sure if I'm satisfied with that, so I left the strings immutable for now. Mutability also raises questions of whether to allow mutation when using code point iterators or raw iterators or both, and whether you should go even further in mimicking std::string and allow random access (whether that would be raw access or psuedo-"random access" of codepoints, I don't know).

...

I also have run-time and compile-time tagging. My feeling now is that compile-time-tagging is the more important case. Data whose encoding is known only at run-time can be handled using a more ad-hoc method if necessary. I also struggled to find good names for these things; I don't find ct_string and rt_string great. Do any readers have suggestions?

In this thread, Andrew Sutton suggested using a single string type, like this: template<typename EncodingT = runtime_tag> class estring { /* ... */ }; estring<> my_runtime_string; estring<utf8> my_compiletime_string; I'm still not sure about "estring", but at least it halves the number of names we need to come up with!

...

Well it's actually decoding the utf16 and encoding the utf8. Maybe "transcode", and preferably as a free function:

transcode(bar,foo);

Fair point. I'm not quite sure of how I'd want it to work as a free function yet, especially with regards to runtime-tagged strings.

...

...
rt_string baz; baz.encode(bar,rt::utf8);

So the encoding of the rt_string is not stored in the string?

It does store the encoding, but the call to "encode" provides it with a *new* encoding type (utf8 in this case).

...

I'll try to find time to have a look, but I do encourage you to post more details to the list. That tends to generate more discussion than "please look at the code" proposals do.

Fair enough. I didn't want to inundate people with pages of text about implementation details, so I tried to stay fairly high-level. I'll provide more details as soon as possible, but I thought it best to start with a brief overview to make sure I didn't make everyone's eyes glaze over! :)

...

So what is your underlying implementation? Not std::string?

Right now, it's just a static char array, for ease of implementation. Obviously this will change, but I was more focused on designing an interface that allowed compile-time and runtime determined transcoding of strings.

...

- A complete character set library is a lot of work.

- A library that only understands Unicode is less work, but is it what people need?

I tried to address both of these issues by making it easy to extend character encodings with whatever obscure encodings you need. I probably wouldn't write an EBCDIC facet, but I'd certainly want people to be able to roll their own if they need it.

...

- Is there a consensus about mutable vs. immutable strings? Perhaps we should start by defining a new string concept, removing the character-set-unfriendly aspects of std::string like indexing using integers, and see what people think of it. I have been trying to use only std::algorithms and iterators with strings in new code, but it can often be simpler to use indexes and the std::string members that use or return them.

We should definitely take a look at std::string and try to extract the essentially string-y components of it. std::string makes an awful lot of assumptions about what's *in* a string, so it would be good to remove all the unnecessary bits. In an alternate universe, it may have been better to have std::string and encoded_string (or whatever it should be called) act as views onto a collection of bytes/words. Of course, encoded_string could act as a view onto std::string (and/or QString, CString, MySpecialString), though I'm not convinced that's a good solution at all!

...

- It would be useful to factor out the actual Unicode bit-bashing operations. I have implementations of them that I have carefully tuned, and they are ready for wider use even though the rest of my code isn't.

My code is organized such that each encoding is a class with a read and write method. My encoding classes don't feature much in the way of error handling (yet), but they do work well with compliant strings. I'll take some time to look at your code and see what the differences are compared to mine. I think one of the problems we'll run into is that everyone has their own very particular ideas of what a Unicode library means, that it'll be extremely hard to please everyone, no matter what the interface ends up looking like. - Jim

raindog＠macrohmasheen.com

20 Nov 20 Nov

2:44 a.m.

I think it would be short sighted to expect people to welcome a string class that has different usage semantics than std::string. Users generally pick up a library and expect it to behave like the ones they've used before that solve the same problem or from different languages. C++ is already tough for newcomers because of things like the differences in char* and std::string and arrays vs vector, why add more to the complexity? Sent from my Verizon Wireless BlackBerry -----Original Message----- From: James Porter <porterj@alum.rit.edu> Date: Wed, 19 Nov 2008 17:50:41 To: <boost@lists.boost.org> Subject: Re: [boost] [unicode] Interest Check / Proof of Concept Phil Endecott wrote:

...

Mutable vs. immutable strings is something that has been briefly discussed before. My personal preference has been for mutable strings, but without the O(1) random access guarantee of a std::string. I also considered strings where the only mutation allowed is appending, i.e. there's a back_insert_iterator. Why do you prefer immutable strings?

...

I also have run-time and compile-time tagging. My feeling now is that compile-time-tagging is the more important case. Data whose encoding is known only at run-time can be handled using a more ad-hoc method if necessary. I also struggled to find good names for these things; I don't find ct_string and rt_string great. Do any readers have suggestions?

...

Well it's actually decoding the utf16 and encoding the utf8. Maybe "transcode", and preferably as a free function:

transcode(bar,foo);

Fair point. I'm not quite sure of how I'd want it to work as a free function yet, especially with regards to runtime-tagged strings.

...

...
rt_string baz; baz.encode(bar,rt::utf8);

So the encoding of the rt_string is not stored in the string?

It does store the encoding, but the call to "encode" provides it with a *new* encoding type (utf8 in this case).

...

I'll try to find time to have a look, but I do encourage you to post more details to the list. That tends to generate more discussion than "please look at the code" proposals do.

...

So what is your underlying implementation? Not std::string?

...

- A complete character set library is a lot of work.

- A library that only understands Unicode is less work, but is it what people need?

...

- Is there a consensus about mutable vs. immutable strings? Perhaps we should start by defining a new string concept, removing the character-set-unfriendly aspects of std::string like indexing using integers, and see what people think of it. I have been trying to use only std::algorithms and iterators with strings in new code, but it can often be simpler to use indexes and the std::string members that use or return them.

...

- It would be useful to factor out the actual Unicode bit-bashing operations. I have implementations of them that I have carefully tuned, and they are ready for wider use even though the rest of my code isn't.

6091

Age (days ago)

6094

Last active (days ago)

List overview

Download

20 comments

11 participants

participants (11)

Andrew Sutton
David Abrahams
Eric Niebler
James Porter
Joel de Guzman
Kirit Sælensminde
Manuel Fiorelli
Phil Endecott
raindog＠macrohmasheen.com
Sebastian Redl
Zach Laine