Re: [boost] Strings tagged with their character set

27 Sep 2007

      I see what you mean. Still, fixed-width-encoded strings are a lot easier to
code, and I think we should focus on them first just to get something
working and to have a platform to test code conversion on, which in my
opinion is the most important part. Without code conversion, it would be
difficult to read in non-ASCII strings in the first place, since
std::wfstream just converts ASCII to UTF-16.

Variable-width-encoded strings should be fairly straightforward when they
are immutable, but will probably get hairy when they can be modified.
Converting a VWE string would probably be no harder than a FWE string.

That said, I think a good (general) roadmap for this project would be:
1) Extend std::basic_string to store UCS-2 / UCS-4 (should be easy, though
string constants may pose a problem)
2) Add code conversion to move between encodings, especially for I/O
3) Create VWE string class (fairly easy if immutable, hard if mutable)

- James

On 9/27/07, Sebastian Redl <sebastian.redl@getdesigned.at > wrote:
...
James Porter wrote:
...
For certain special purposes (like the one above), a variable-width
string class would be useful, but I think we should focus on storing
strings in fixed-width encodings and then converting them appropriately
during I/O.
Actually, I disagree with this. The only general-purpose fixed-width
encoding available is UTF-32, and hardly anyone actually uses it. For
good reason: for English text, it wastes 75% of the used space. In
general, it wastes about 10 bits (30%) in everything, because Unicode
only has about, what, 2^21 code points?
[snip]

I think the problem of UTF-8 and UTF-16 strings is important and must be
...
addressed.
Sebastian Redl
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost