Strings tagged with their character set

newer
Again: No mail for bug submitter...

older
subversion UTF-8 conversion errors

Phil Endecott

23 Sep 2007 23 Sep '07

11:37 p.m.

Dear All, Something that I have been thinking about for a while is storing strings tagged with their character set. Since I now have a practical need for this I plan to try to implement something. Your feedback would be appreciated. The starting point is the idea that the character set of a string may be known at compile time or at run time, and so two types of tagging are possible. First compile-time tagging: template <character_set> class tagged_string { ... }; tagged_string<utf8> s1; tagged_string<latin1> s2; Some typedefs would be appropriate: typedef tagged_string<utf8> utf8string; Now run-time tagging: class rt_tagged_string { private: character_set cs; public: rt_tagged_string(character_set cs_): cs(cs_) ... ... }; rt_tagged_string(utf8) s3; (Consise-yet-clear names for any of these classes would be great.) I propose to implement conversion between the strings using icode and/or GNU recode. It would be easy to allow this conversion to happen invisibly, but it might be wiser to make conversion explicit. I'm not sure what the 'character_set' that I've used above should be. It needs to be some sort of user-extensible enum or type-tag. We need character types of 8, 16 and 32 bits. wchar is not useful here because it's not defined whether it's 16 or 32 bits. So I propose the following, modelled after cstdint: typedef char char8_t; typedef <implementation-defined> char16_t; typedef <implementation-defined> char32_t; I then propose a character_set_traits class: template <character_set> class character_set_traits; template <> class character_set_traits<utf8> { typdef char8_t char_t; const bool variable_width = true; ... }; For the fixed-width, compile-time-tagged strings I think it makes sense to inherit from std::basic_string< character_set_traits<charset>::char_t >. The only problem I can see with this is that latin1string s1 = "hello world"; s1.substr(1,5) <--- this returns a std::string, not a latin1string If latin1string has a constructor from std::string (which is its own base type) that's fine, i.e. we can still write: latin1string s2 = s1.substr(1,5); but unfortunately we can also write latin2string s3 = s1.substr(1,5); which is not so good. So a different approach is to define a set of character-set-specific character types, and build string types from them: typedef char8_t latin1char; typedef char8_t latin2char; For variable-width character sets, the methods of std::string are less useful (though far from useless). I understand that there's already a utf8 iterator somewhere in Boost, can it help? For run-time character sets, is there any way to provide e.g. run-time iterators? I imagine these strings being used as follows: - Input to the program is either run-time or compile-time tagged with any character set. - Data that is not manipulated in any way it just passed through. - Data that will be processed will first be converted to a suitable, compile-time-tagged, character set, and if appropriate converted back afterwards. So the absence of (useful) string operations on run-time-tagged or variable-width character set data is not a problem. For conversions, there is the question of partial characters in variable-width character sets. If a program is processing data in chunks it may be legitimate for a chunk boundary to fall in the middle of a UTF8 character. IIRC, icode has a method to deal with this which we could expose in a stateful converter: charset_converter utf8_to_ucs4(utf8,ucs4); while (!eof) { utf8string s = get_chunk(); ucs4string t = utf8_to_ucs4(s); send_chunk(t); } utf8_to_ucs4.flush(); - but many applications may only need a stateless converter. I will be working on this over the next couple of weeks, so any feedback would be much appreciated. Regards, Phil.

Show replies by date

James Porter

24 Sep 24 Sep

12:11 a.m.

Instead of defining character types per character set, you could use a specialized char_traits class. It contains state_type, which is used with codecvt from the I/O stream library. The default typedef for char and wchar_t is mbstate_t, which appears in the standard specializations for codecvt. (codecvt is used to perform code conversion between character types; it's used in wfstream to convert a stream of chars on disk to wchar_ts in memory.) If you change state_type in the char_traits, you'd be able to differentiate the various basic_string types and include information about the character encoding without writing a whole lot of new code. To be honest, I'm only just beginning to look into this myself, so I'm afraid I don't have a whole lot of information to give you, but I do think this would be the simplest way to handle this part of your project. - James Phil Endecott wrote: [snip]

...

If latin1string has a constructor from std::string (which is its own base type) that's fine, i.e. we can still write:

latin1string s2 = s1.substr(1,5);

but unfortunately we can also write

latin2string s3 = s1.substr(1,5);

which is not so good.

So a different approach is to define a set of character-set-specific character types, and build string types from them:

typedef char8_t latin1char; typedef char8_t latin2char; [/snip]

Phil Endecott

6:53 p.m.

James Porter wrote:

...

Instead of defining character types per character set, you could use a specialized char_traits class. It contains state_type, which is used with codecvt from the I/O stream library. The default typedef for char and wchar_t is mbstate_t, which appears in the standard specializations for codecvt. (codecvt is used to perform code conversion between character types; it's used in wfstream to convert a stream of chars on disk to wchar_ts in memory.)

If you change state_type in the char_traits, you'd be able to differentiate the various basic_string types and include information about the character encoding without writing a whole lot of new code.

Thanks for the suggestion. I need to learn some more about this corner of "namespace std", clearly, before I go and re-invent something. Phil.

Joseph Gauterin

25 Sep 25 Sep

1:26 p.m.

...

...
If you change state_type in the char_traits, you'd be able to differentiate the various basic_string types and include information about the character encoding without writing a whole lot of new code.

Thanks for the suggestion. I need to learn some more about this corner of "namespace std", clearly, before I go and re-invent something. IIRC, some of the non-const std::basic_string methods aren't suitable for handling variable width encodings like utf8 and utf16 - non-const operator[] in paticular returns a reference to the character type - a big problem if you want to assign a value > 0x7F (i.e. a character that uses 2 or more bytes).

I've noticed that there are frequent requests/proposals for some sort of boost unicode/string encoding library. I've thought about the problem and it seems to big for one person to handle in their spare time - perhaps a group of us should get together to discuss working on one? I'd be happy to participate.

Grubb, Jared

4:30 p.m.

...

...
If you change state_type in the char_traits, you'd be able to differentiate the various basic_string types and include information about the character encoding without writing a whole lot of new code.

Thanks for the suggestion. I need to learn some more about this corner of "namespace std", clearly, before I go and re-invent something. IIRC, some of the non-const std::basic_string methods aren't suitable for handling variable width encodings like utf8 and utf16 - non-const operator[] in paticular returns a reference to the character type - a big problem if you want to assign a value > 0x7F (i.e. a character

I would be interested in helping as well. I've been looking for a hobby project to help out with. -----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Joseph Gauterin Sent: Tuesday, September 25, 2007 6:26 AM To: boost@lists.boost.org Subject: Re: [boost] Strings tagged with their character set that uses 2 or more bytes). I've noticed that there are frequent requests/proposals for some sort of boost unicode/string encoding library. I've thought about the problem and it seems to big for one person to handle in their spare time - perhaps a group of us should get together to discuss working on one? I'd be happy to participate. _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Phil Endecott

26 Sep 26 Sep

8:46 p.m.

Joseph Gauterin wrote:

...

IIRC, some of the non-const std::basic_string methods aren't suitable for handling variable width encodings like utf8 and utf16 - non-const operator[] in paticular returns a reference to the character type - a big problem if you want to assign a value > 0x7F (i.e. a character that uses 2 or more bytes).

Yes, very true. One option is to convert to a fixed-size character set before doing anything like operator[], and to not allow strings of variable-width character sets. If you do want to apply operator[] to a UTF8 string, what type should it return? A reference to a range of bytes, somehow? A proxy that encodes/decodes to a UCS4 character? Or, you could say that the iterator is a byte iterator, not a character iterator. Lots of possibilities.

...

I've noticed that there are frequent requests/proposals for some sort of boost unicode/string encoding library. I've thought about the problem and it seems to big for one person to handle in their spare time

Let me say "part time" rather than "spare time"...

...

- perhaps a group of us should get together to discuss working on one? I'd be happy to participate.

I would definitely encourage breaking the work up into smaller chunks. IMHO "smaller is better" for Boost libraries; there have been a number of occasions when I've discovered that a feature I want is hidden as an internal component of a Boost library, and I've felt that it should have been a stand-alone public entity. So let's think about how this work can be split up: - A charset_trait class. I have started on this. The missing piece is a way to look up traits of character sets that are known at run-time; input would be appreciated. - Compile-time and run-time tagged strings. The basics of this are straightforward and done. - Conversions. My approach at present is to use iconv via a functor that I wrote a while ago. I believe iconv is widely available; however, some implementations may support only a small set of character sets. Alternatives would be interesting. - Variable width iterators, including the issue that you raised above. - Interaction with locales, internationalisation, and system APIs. and no doubt more. Thinking about the interfaces between these areas and the user would be a good place to start. Regards, Phil.

Joseph Gauterin

9:21 p.m.

...

Yes, very true. One option is to convert to a fixed-size character set before doing anything like operator[], and to not allow strings of variable-width character sets. If you do want to apply operator[] to a UTF8 string, what type should it return? A reference to a range of bytes, somehow? A proxy that encodes/decodes to a UCS4 character? Or, you could say that the iterator is a byte iterator, not a character iterator. Lots of possibilities. I'd add making the string classes immutable to the list. That way dereferencing an iterator (by which I mean calling unary op*) of any type could then return a unicode code point by value. Mutable sequences that pretend to hold a different type than they actually do don't work well with C++ idoms (e.g. vector<bool>). Strings could be built using a stringstream like approach or by using concatenation (with possible expression template optimizations).

Making the iterator a byte iterator, not a code point iterator, pushes the responsibility for knowing how to handle the variable widthness of the different encodings back onto the user. There are certainly a lot of possibilities, and we should try to get some sort of consensus before we go further with this.

...

I would definitely encourage breaking the work up into smaller chunks. Agreed

...

Conversions. My approach at present is to use iconv via a functor that I wrote a while ago. I believe iconv is widely available; however, some implementations may support only a small set of character sets. Alternatives would be interesting. IIRC, iconv is licensed under the GPL, which would prevent it from being integrated into boost. We should make whatever interface we come up with easily extendable, so that people could write add support for whatever encoding they require, possibly using iconv if using GPL software isn't a problem with them.

...

- Interaction with locales, internationalisation, and system APIs. We'll definitely need a way to convert to a raw pointer representation (like std::string.c_str()) for interaction with some APIs.

Lots to think about.

Felipe Magno de Almeida

9:38 p.m.

On 9/26/07, Joseph Gauterin <joseph.gauterin@googlemail.com> wrote:

...

I'd add making the string classes immutable to the list.

You mean having two classes then, isnt it? If not, how should I do to replace sub-strings? It is a very common use.

...

That way dereferencing an iterator (by which I mean calling unary op*) of any type could then return a unicode code point by value.Mutable sequences that pretend to hold a different type than they actually do don't work well with C++ idoms (e.g. vector<bool>).

The real problem with vector<bool> is that it isn't a Container. But std::string never was a container, so it is not a real problem. Besides, we could have different "views" of a string. Mutable and non-mutable ones, each with a different iterator class.

...

Strings could be built using a stringstream like approach

I don't understand what you mean.

...

or by using concatenation (with possible expression template optimizations).

Making the iterator a byte iterator, not a code point iterator, pushes the responsibility for knowing how to handle the variable widthness of the different encodings back onto the user.

So what is your opinion about it? I find it very useless a byte iterator, except for copying.

...

There are certainly a lot of possibilities, and we should try to get some sort of consensus before we go further with this.

Agreed. [snip]

...

...
Conversions. My approach at present is to use iconv via a functor that I wrote a while ago. I believe iconv is widely available; however, some implementations may support only a small set of character sets. Alternatives would be interesting.

IIRC, iconv is licensed under the GPL, which would prevent it from being integrated into boost. We should make whatever interface we come up with easily extendable, so that people could write add support for whatever encoding they require, possibly using iconv if using GPL software isn't a problem with them.

I don't find it a problem, iconv wouldn't ship with Boost. So only who could use iconv would build the library. Just like regex does with ICU and iostreams does with zlib.

...

...
- Interaction with locales, internationalisation, and system APIs.

We'll definitely need a way to convert to a raw pointer representation (like std::string.c_str()) for interaction with some APIs.

Surely.

...

Lots to think about.

So let's start ;) Regards, -- Felipe Magno de Almeida

Joseph Gauterin

9:57 p.m.

...

If not, how should I do to replace sub-strings? It is a very common use. The substring method could return a new string object rather than modify the old one - C# handles its immutable strings in this way. I'm not so sure having

...

...
Strings could be built using a stringstream like approach I don't understand what you mean. Sorry, I wasn't very clear. I meant we would provide a class like std::stringstream that can have strings (of any encoding) added to it using <<. It would convert its contents to a string when the str() method is called (we could pass the required encoding as a parameter/template parameter).

...

You mean having two classes then, isnt it? It would mean that, yes. One class for the strings (immutable), one class for constructing the strings.

...

...
Making the iterator a byte iterator, not a code point iterator, pushes the responsibility for knowing how to handle the variable widthness of the different encodings back onto the user.

So what is your opinion about it? I find it very useless a byte iterator, except for copying. My opinion is also that byte iterators aren't a good idea - if people want to iterate over bytes than a std::vector<byte> would be more appropriate.

...

...
We'll definitely need a way to convert to a raw pointer representation (like std::string.c_str()) for interaction with some APIs. Surely Applications often pass string data to APIs as raw pointers - I think that the best way to handle this would be to store strings as contiguous memory within the string class.

These are just my thoughts at the moment as I type. I'll give the matter some more etensive thought before I reply again.

Phil Endecott

11:08 p.m.

Joseph Gauterin wrote:

...

Making the iterator a byte iterator, not a code point iterator, pushes the responsibility for knowing how to handle the variable widthness of the different encodings back onto the user.

Indeed, and smart users might prefer to take that responsibility sometimes. For example, if I want to break up a lump of UTF8 text into lines at each \n then I can just treat it as bytes and look for \n, since \n never occurs in a multibyte character in UTF8. As another example, an XML parser can exploit this when looking for its various punctuation characters. Because a UTF8 character-iterator has the overhead of determining the character width, and also as variable-width iterator operations like operator- are not O(1), having the option to use a byte iterator could be a significant performance help. Of course you could just use a vector<char> or similar when you want to do this sort of thing, but that's not great if you want to mix-and-match byte and character operations without copying the whole string. I'm wondering about offering distinct "unit" (e.g. byte) and "character" types in the charset_traits class, and providing separate unit_iterator and character_iterator types and operations. Or maybe the character_iterators are best provided by some sort of "adapter" layer?

...

IIRC, iconv is licensed under the GPL

The iconv API is a POSIX and SUS standard. There is an implementation in glibc, which is LGPLed; I believe that other OSes have their own implementations (including BSD-licensed ones). I thought that it was included in Windows since NT but Google tells me I'm wrong. We would certainly want a conversion interface that could be adapted to std::codecvt, iconv, recode (which is a GNU-only thing), icu, etc. I have already written functor wrappers for iconv and recode which work like this: Iconver latin1_to_utf8("latin1","utf8"); utf8string s = latin1_to_utf8(x); The functor can store any state for variable-width charsets. Iconv takes charset names as char*s; I have put a char* name in my charset_traits class to support this. Something is needed to indicate policy for conversion problems, e.g. throw or insert '?' when there is no corresponding character in the target charset. How compatible could this be made with codecvt and icu? Thanks for the many replies. Do keep posting. I'm not going to try to keep up with replies to everything, though; I'm going to try and write come code! Regards, Phil.

Jeremy Maitin-Shepard

27 Sep 27 Sep

8:25 p.m.

"Phil Endecott" <spam_from_boost_dev@chezphil.org> writes: [snip]

...

I'm wondering about offering distinct "unit" (e.g. byte) and "character" types in the charset_traits class, and providing separate unit_iterator and character_iterator types and operations. Or maybe the character_iterators are best provided by some sort of "adapter" layer?

I think providing the code point iterators in a adapter layer is better. The reason is that iterating over code points is just one of several higher-level-than-byte- iterations that might be useful. In particular, it seems that for many string manipulation tasks, even iterating over code points is not sufficient to handle international text; rather, it may be necessary to iterate over grapheme clusters. [snip] -- Jeremy Maitin-Shepard

Felipe Magno de Almeida

26 Sep 26 Sep

9:28 p.m.

On 9/26/07, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:

...

Joseph Gauterin wrote:

[snip]

...

...
I've noticed that there are frequent requests/proposals for some sort of boost unicode/string encoding library. I've thought about the problem and it seems to big for one person to handle in their spare time

Let me say "part time" rather than "spare time"...

Sorry to jump into the discussion, but I've been watching it since the start. And I'm interested in this project too. Though I'm a little swamped with work right now, I do work with exactly one use case exposed in the thread: e-mail parsing (and etcs about it). And tagging is exactly the best approach. The worst part being: how do we compare external strings and email text? Until now the safest approach I found is to convert everything to Unicode and then compare both (if they don't have the same encoding).

...

...
- perhaps a group of us should get together to discuss working on one? I'd be happy to participate.

I would definitely encourage breaking the work up into smaller chunks. IMHO "smaller is better" for Boost libraries; there have been a number of occasions when I've discovered that a feature I want is hidden as an internal component of a Boost library, and I've felt that it should have been a stand-alone public entity. So let's think about how this work can be split up:

This seem like a very good approach.

...

- A charset_trait class. I have started on this. The missing piece is a way to look up traits of character sets that are known at run-time; input would be appreciated.

- Compile-time and run-time tagged strings. The basics of this are straightforward and done.

Not as easy if a "universal string" class is to be achieved. But we can probably left it out for now.

...

- Conversions. My approach at present is to use iconv via a functor that I wrote a while ago. I believe iconv is widely available; however, some implementations may support only a small set of character sets. Alternatives would be interesting.

I use icu extensively, never used iconv.

...

- Variable width iterators, including the issue that you raised above.

boost.iterator makes this job quite easy.

...

- Interaction with locales, internationalisation, and system APIs.

I'm not an IOStream expert, but I'm very use to working with Windows API.

...

and no doubt more. Thinking about the interfaces between these areas and the user would be a good place to start.

There were also some interesting discussions about Unicode in the past, though they didn't seem to go anywhere towards any conclusion. But were raised very important concerns w.r.t internationalization.

...

Regards,

Phil.

Thanks Phil, -- Felipe Magno de Almeida

James Porter

9:57 p.m.

I think we could use the locale/code conversion functionality available in the standard I/O streams library to minimize the amount of new code needed and to make it more, well, standard. In general, I'd expect most code conversions to be occurring during I/O anyway (exceptions to this could probably be handled using stringstreams). Appendix D of "The C++ Programming Language" has a fair amount of information on the topic (online here: http://www.research.att.com/~bs/3rd_loc0.html ) The I/O streams' code conversion (through std::codecvt) can potentially convert between any two encodings/character sets, assuming code is written for that particular conversion. std::codecvt takes 3 template parameters: internal character encoding, external encoding, and conversion scheme (called "state"). We could specialize this to take 4 parameters, replacing the single conversion scheme with a pair: one from the internal encoding to the character set itself, and one from the character set to the external encoding. So something like this: std::codecvt< utf16,utf8,pair<utf16_to_ucs4,ucs4_to_utf8> > would convert an internal UTF-16 encoding of a string to an external UTF-8 encoding. However, an I/O stream can only have one codecvt instance at a time (via imbuing a locale), so this raises the question of how we should handle streaming out two Unicode strings with different encodings. On a different note, does anyone see a practical use in having (mutable) strings with variable-width character encodings? I can't think of any practical use for them that wouldn't be equally well-served with an array of bytes (like the email MIME-type example). As for run-time tagging of strings, I doubt it would work very well, since it would be difficult to extend a run-time tagged string class to handle new encodings/character sets. - James Phil Endecott wrote:

...

I would definitely encourage breaking the work up into smaller chunks. IMHO "smaller is better" for Boost libraries; there have been a number of occasions when I've discovered that a feature I want is hidden as an internal component of a Boost library, and I've felt that it should have been a stand-alone public entity. So let's think about how this work can be split up:

- A charset_trait class. I have started on this. The missing piece is a way to look up traits of character sets that are known at run-time; input would be appreciated.

- Compile-time and run-time tagged strings. The basics of this are straightforward and done.

- Conversions. My approach at present is to use iconv via a functor that I wrote a while ago. I believe iconv is widely available; however, some implementations may support only a small set of character sets. Alternatives would be interesting.

- Variable width iterators, including the issue that you raised above.

- Interaction with locales, internationalisation, and system APIs.

and no doubt more. Thinking about the interfaces between these areas and the user would be a good place to start.

Regards,

Phil.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Michael Marcin

27 Sep 27 Sep

1:10 a.m.

James Porter wrote:

...

On a different note, does anyone see a practical use in having (mutable) strings with variable-width character encodings? I can't think of any practical use for them that wouldn't be equally well-served with an array of bytes (like the email MIME-type example).

What encoding would you propose we use that is not variable length? UTF-8, UTF-16, and UTF-32 certainly are all variable length encodings. - Michael Marcin

James Porter

1:31 a.m.

Actually, UTF-32 (equivalently UCS-4) *is* fixed-width (as of the Unicode 5.0.0 standard). Page 31 of the standard (chapter 2) says: "UTF-32 is the simplest Unicode encoding form. Each Unicode code point is represented directly by a single 32-bit code unit. Because of this, UTF-32 has a one-to-one relationship between encoded character and code unit; it is a fixed-width character encoding form." - James Michael Marcin wrote:

...

James Porter wrote:

...
On a different note, does anyone see a practical use in having (mutable) strings with variable-width character encodings? I can't think of any practical use for them that wouldn't be equally well-served with an array of bytes (like the email MIME-type example).

What encoding would you propose we use that is not variable length?

UTF-8, UTF-16, and UTF-32 certainly are all variable length encodings.

- Michael Marcin

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Sebastian Redl

10:54 a.m.

James Porter wrote:

...

Actually, UTF-32 (equivalently UCS-4) *is* fixed-width (as of the Unicode 5.0.0 standard). Page 31 of the standard (chapter 2) says:

"UTF-32 is the simplest Unicode encoding form. Each Unicode code point is represented directly by a single 32-bit code unit. Because of this, UTF-32 has a one-to-one relationship between encoded character and code unit; it is a fixed-width character encoding form."

UTF-32 is a fixed-width encoding of Unicode, but Unicode itself is a "variable-width character set", what with combining characters. Whether this is the business of a core string layer in C++ is a different question. Sebastian Redl

James Porter

2:23 p.m.

I'd argue that that's more of a typesetting issue, since the actual code points are fixed-width. That said, someone mentioned the Windows API's use of UTF-16, which is one place where you wouldn't be using I/O streams to convert to a variable-width encoding. For certain special purposes (like the one above), a variable-width string class would be useful, but I think we should focus on storing strings in fixed-width encodings and then converting them appropriately during I/O. This stays closer to the "namespace std" way, and should solve most (but obviously not all) of the problems with character encodings. - James Sebastian Redl wrote:

...

UTF-32 is a fixed-width encoding of Unicode, but Unicode itself is a "variable-width character set", what with combining characters.

Whether this is the business of a core string layer in C++ is a different question.

Sebastian Redl _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Sebastian Redl

2:37 p.m.

...

For certain special purposes (like the one above), a variable-width string class would be useful, but I think we should focus on storing strings in fixed-width encodings and then converting them appropriately during I/O. Actually, I disagree with this. The only general-purpose fixed-width encoding available is UTF-32, and hardly anyone actually uses it. For good reason: for English text, it wastes 75% of the used space. In general, it wastes about 10 bits (30%) in everything, because Unicode only has about, what, 2^21 code points? Under no circumstances can a UTF-8 string be larger, nor a UTF_16 string. You may say that linear

James Porter wrote: traversal is faster, because you don't have to inspect the bytes to find out where the next one is. I'm tempted to disagree there. Random access may be faster (a lot faster), but you rarely need this. Mostly, you want to access data linearly anyway. And because UTF-8 can squeeze up to 4 characters in the space where UTF-32 puts one, cache coherency is much better. Then there's the "practical use" issue. The Linux kernel uses UTF-8 internally (when compiled appropriately). The Windows kernel uses UTF-16. No kernel I know of uses UTF-32. This means that for every system call, UTF-32 strings have to be converted. Another performance hit there. (Not to mention complexity hit, for calling APIs that return text.) Hmm ... I think Qt and wxWidgets (with appropriate configuration) use UTF-32 on Linux. I think the problem of UTF-8 and UTF-16 strings is important and must be addressed. Sebastian Redl

James Porter

3:36 p.m.

I see what you mean. Still, fixed-width-encoded strings are a lot easier to code, and I think we should focus on them first just to get something working and to have a platform to test code conversion on, which in my opinion is the most important part. Without code conversion, it would be difficult to read in non-ASCII strings in the first place, since std::wfstream just converts ASCII to UTF-16. Variable-width-encoded strings should be fairly straightforward when they are immutable, but will probably get hairy when they can be modified. Converting a VWE string would probably be no harder than a FWE string. That said, I think a good (general) roadmap for this project would be: 1) Extend std::basic_string to store UCS-2 / UCS-4 (should be easy, though string constants may pose a problem) 2) Add code conversion to move between encodings, especially for I/O 3) Create VWE string class (fairly easy if immutable, hard if mutable) - James On 9/27/07, Sebastian Redl <sebastian.redl@getdesigned.at > wrote:

...

James Porter wrote:

...
For certain special purposes (like the one above), a variable-width string class would be useful, but I think we should focus on storing strings in fixed-width encodings and then converting them appropriately during I/O. Actually, I disagree with this. The only general-purpose fixed-width encoding available is UTF-32, and hardly anyone actually uses it. For good reason: for English text, it wastes 75% of the used space. In general, it wastes about 10 bits (30%) in everything, because Unicode only has about, what, 2^21 code points?

[snip] I think the problem of UTF-8 and UTF-16 strings is important and must be

...

addressed.

Sebastian Redl _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Sebastian Redl

4:22 p.m.

James Porter wrote:

...

since std::wfstream just converts ASCII to UTF-16.

Just nit-picking here: it converts to wchar_t, which may or may not be UTF-16. On Win32 platforms, it is, but on Linux, for example, it's UTF-32.

...

Variable-width-encoded strings should be fairly straightforward when they are immutable, but will probably get hairy when they can be modified.

True. I think the strings should be immutable. I think experience with Java and C# compared to C++ shows that an immutable string class is superior in most use cases.

...

That said, I think a good (general) roadmap for this project would be: 1) Extend std::basic_string to store UCS-2 / UCS-4 (should be easy, though string constants may pose a problem)

Doesn't basic_string<wchar_t> do just that already? Sebastian Redl

James Porter

4:56 p.m.

On 9/27/07, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:

...

Just nit-picking here: it converts to wchar_t, which may or may not be UTF-16. On Win32 platforms, it is, but on Linux, for example, it's UTF-32.

Yeah, I realized that after I clicked "send". I guess I should eat breakfast before sending email. :)

...

True. I think the strings should be immutable. I think experience with Java and C# compared to C++ shows that an immutable string class is superior in most use cases.

There should be some means to (possibly indirectly) modify a variable-width-encoded string, though it doesn't necessarily have to be through the class itself. A stringstream may be more appropriate.

...

...
1) Extend std::basic_string to store UCS-2 / UCS-4 (should be easy,

That said, I think a good (general) roadmap for this project would be: though

...
string constants may pose a problem)

Doesn't basic_string<wchar_t> do just that already?

It doesn't do it in a portable manner. In Windows, basic_string<wchar_t> is, ostensibly, UTF-16, but in Linux, it's UTF-32. There should be a portable solution that guarantees a particular fixed-width encoding. I'd argue that basic_string<wchar_t> isn't exactly Unicode at all, though I'm being nit-picky. char_traits<wchar_t>::state_type is mbstate_t, which is the state type used by codecvt to convert a narrow (ASCII) stream into a wide stream. In short, the stream (and ultimately the string) isn't Unicode, it's just ASCII stored with 2 (or 4) bytes per character. This goes back to the problems with using wfstream. I think, to have a truly distinct basic_string specialization, we'd need portable 16- and 32-bit char types, and a way to unambiguously specify its encoding. My hope is that we can use char_traits<...>::state_type as a way to make code conversion simpler. Ideally, I'd like something that examines the state_type of the source and the target, and builds a converter based on those two pieces of information. It would be great if I could say something like: ofstream<utf8> file("out.txt"); file << ucs4string << utf16string << jisstring << asciistring << endl; and have it work automatically. - James

Sebastian Redl

6:01 p.m.

...

It doesn't do it in a portable manner. OK, so that's what you meant. Yeah, that's a problem. I'd argue that basic_string<wchar_t> isn't exactly Unicode at all, though I'm being nit-picky. char_traits<wchar_t>::state_type is mbstate_t, That has nothing to do with what basic_string<wchar_t> is, though, because that state is to be used when converting the string to an external encoding. It just shows how screwed-up the current system is: char_traits<Ch>::state_type is a type that should be able to hold the shift state when converting to a (runtime-specified through the locale) external encoding - but the shift state is specific to the _external encoding_, not the internal one. It doesn't make any sense whatsoever

James Porter wrote: that char_traits for the wide character type (and thus the indicator of the _internal encoding_) should hold this type. Sebastian Redl

James Porter

7:12 p.m.

On 9/27/07, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:

...

That has nothing to do with what basic_string<wchar_t> is, though, because that state is to be used when converting the string to an external encoding.

Well, clearly that state needs to know what the internal encoding is in the first place, so they are related in some way. Ideally, I'd like to be able to use the state_type for basic_string as one half of the shift state, with the other half being the state_type for the target (say, an output stream). Put those two together, and we could build a codecvt facet of the form: codecvt<internal_char, external_char, conversion_pair<internal_state_type, external_state_type> > Doing it this way obviously wouldn't work with any of the I/O streams now, since they require the char_traits to be the same, but perhaps we could define a converting_stream type that creates the above codecvt facet automatically and handles conversion. In some sense, I suppose this is an abuse of char_traits<Ch>::state_type, but I think it's the most backwards-compatible way. That is, if the internal (string) and external (stream) encodings were the same, I/O would behave like it does now (only hopefully in a more predictable/useful fashion when dealing with Unicode). - James

Sebastian Redl

9:26 p.m.

James Porter wrote:

...

On 9/27/07, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:

...
That has nothing to do with what basic_string<wchar_t> is, though, because that state is to be used when converting the string to an external encoding.

Well, clearly that state needs to know what the internal encoding is in the first place, No, why? What difference does it make to the shift state of Shift-JIS whether you convert to this encoding from UTF-8 or UTF-16?

Sebastian Redl

James Porter

10:06 p.m.

Perhaps I'm misunderstanding the purpose of the state_type typedef in char_traits. It seems that it's used for two things: to specify the type that will hold the actual shift state for encodings that require it, and to specify a codecvt facet for the encoding in question (to read/write it from/to a stream of bytes). The latter part is what I'm focusing on. Appendix D of "The C++ Programming Language" said of codecvt: "The State template argument is the type used to hold the shift state of the stream being converted. State can also be used to identify different conversions by specifying a specialization." I probably should have been clearer that I was referring to the state type and not the shift state itself. What I meant was that, if you defined a shift state as class JISstate { ... };, you would need to specialize codecvt to convert a Shift JIS encoding on disk to a *particular* encoding in memory (say UTF-16). You'd need a different specialization of codecvt to convert to UTF-8. Hopefully this explains my position better, and I apologize if I caused needless confusion. This may not even be the best way, but with a converting_stream class, we could do the following: - create a converting_ifstream with char_traits<Ch>::state_type of JISstate - create a string with char_traits<Ch>::state_type of UTF8 - (automatically) build a codecvt facet with a state_type of conversion_pair<JISstate,UTF8> - the conversion_pair would take bytes encoded as Shift JIS, convert them to a Unicode code point, and convert that to UTF-8 byte(s) - read data from the converting_ifstream to the string - the codecvt facet would then run the conversion from conversion_pair, resulting in a UTF-8 encoded string from Shift JIS data on disk This could then be extended to UTF-16 simply by creating a state_type class for it and specifying a conversion between Unicode code points and UTF-16. Like I said, this may not be the best way, but hopefully it at least explains my idea better. - James On 9/27/07, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:

...

...
On 9/27/07, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:

...
That has nothing to do with what basic_string<wchar_t> is, though, because that state is to be used when converting the string to an external encoding.

Well, clearly that state needs to know what the internal encoding is in

James Porter wrote: the

...
first place, No, why? What difference does it make to the shift state of Shift-JIS whether you convert to this encoding from UTF-8 or UTF-16?

Sebastian Redl _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Sebastian Redl

10:50 p.m.

...

Perhaps I'm misunderstanding the purpose of the state_type typedef in char_traits. From my point of view, there's nothing *to* understand. The whole thing is a misfeature. Here's what the standard says about char_traits<Ch>::state_type: For a streambuf or its derived classes whose underlying stream is a multibyte character stream and specialized for CHAR_T and TRAIT_T, a type or class STATE_T is used to define state_type in the traits and represents the conversion state type or class which is applied to the codecvt<> facet defined in _lib.localization <http://www.open-std.org/jtc1/sc22/open/n2356/lib-locales.html#lib.localization>_. So the state_type is to be passed as the third template parameter of codecvt. Here's what the standard has to say about that: The stateT argument selects the pair of codesets being mapped between. Ah, now this is interesting. It matches what you say below: It seems that it's used for two things: to specify the type that will hold the actual shift state for encodings that require it, and to specify a codecvt facet for the encoding in question (to read/write it from/to a stream of bytes). However, this is positively useless: it means that you have to specify

James Porter wrote: the external encoding at compile time! This is not just highly unrealistic, it's positively absurd. Small wonder that there is absolutely nothing beyond a synopsis for codecvt_byname. Locales like "en_GB.UTF-8"? Forget them, they do not match the standard's view of how conversion works. For that matter, why is codecvt a facet at all? If the stateT arguments selects the encodings to convert between, what does the locale have to do with it? Don't forget that the locale stores facets by type, i.e. you can't put two facets with different stateT parameters in the same locale and expect them to occupy the same space. No: each gets its own place. Bah! The external encoding must be specifiable as a runtime selection, preferably as a string. Everything else is academic playground. Sebastian Redl

Joseph Gauterin

4:31 p.m.

...

...
Variable-width-encoded strings should be fairly straightforward when they are immutable, but will probably get hairy when they can be modified. True. I think the strings should be immutable. I think experience with Java and C# compared to C++ shows that an immutable string class is superior in most use cases

I also agree that immutable strings are the way forward for VWEs. If we had mutable strings consider how badly the following would perform: std::replace(utfString.begin(),utfString.end(),SingleByteChar,MultiByteChar); Although this looks O(n) at first glance, it's actually O(n^2), as the container has to expand itself for every replacement. I don't think a library should make writing worst case scenario type code that easy.

Joseph Gauterin

4:35 p.m.

...

since std::wfstream just converts ASCII to UTF-16. If only it were that simple. *Some implementations* of std::wstring do that, but those that do still only provide a byte oriented, not code point oriented interface to the data.

Steven Watanabe

28 Sep 28 Sep

4:25 p.m.

AMDG Joseph Gauterin <joseph.gauterin <at> googlemail.com> writes:

...

...
...
Variable-width-encoded strings should be fairly straightforward when they are immutable, but will probably get hairy when they can be modified. True. I think the strings should be immutable. I think experience with Java and C# compared to C++ shows that an immutable string class is superior in most use cases

I also agree that immutable strings are the way forward for VWEs.

If we had mutable strings consider how badly the following would perform: std::replace(utfString.begin(),utfString.end(),SingleByteChar,MultiByteChar);

Although this looks O(n) at first glance, it's actually O(n^2), as the container has to expand itself for every replacement. I don't think a library should make writing worst case scenario type code that easy.

I agree that mutable iterators are not a good idea, since they make an operation that is normally O(1) linear. In addition, *iter = MultiByteChar would invalidate all the other iterators to the string causing your example to crash. However, I see no reason to prevent usage like std::replace_copy(utfString.begin(), utfString.end(), std::back_inserter(newutfString), SingleByteChar MultibyteChar); In Christ, Steven Watanabe

Sid Sacek

5:02 p.m.

New subject: SUN Compiler option -features=tmplrefstatic

To whom it may concern, When compiling Boost code using the Sun compiler, the Boost documentation makes this statement. Quote: When using this compiler on complex C++ code, such as the Boost C++ library, it is recommended to specify the following options when intializing the sun module: -library=stlport4 -features=tmplife -features=tmplrefstatic End-Quote: I don't understand the effect of the 'tmplrefstatic' argument on the compiled code. Why is that option recommended, ie. What does it do? I searched the web for a couple of hours for an explanation and couldn't find an answer to my question. I'm hoping somebody on this list knows because Boost is recommending it. Thanks in advance, -Sid Sacek

Simon Atanasyan

5:23 p.m.

New subject: SUN Compiler option -features=tmplrefstatic

The short answer is very simple. By default Sun C++ compiler does not allow to refer static symbols from templates. -features=tmplrefstatic turns on this ability. 2007/9/28, Sid Sacek <ssacek@appsecinc.com>:

...

To whom it may concern,

When compiling Boost code using the Sun compiler, the Boost documentation makes this statement.

Quote: When using this compiler on complex C++ code, such as the Boost C++ library, it is recommended to specify the following options when intializing the sun module: -library=stlport4 -features=tmplife -features=tmplrefstatic End-Quote:

I don't understand the effect of the 'tmplrefstatic' argument on the compiled code. Why is that option recommended, ie. What does it do?

I searched the web for a couple of hours for an explanation and couldn't find an answer to my question. I'm hoping somebody on this list knows because Boost is recommending it.

Thanks in advance, -Sid Sacek

-- Simon Atanasyan

Sid Sacek

8:23 p.m.

New subject: SUN Compiler option -features=tmplrefstatic

Hi, Thanks for that answer, but I still don't think I understand. This sample test code builds with and without the -features=tmplrefstatic argument, so I guess I need a code example to clarify it, if you don't mind. Thank you, -Sid Sacek template< typename Ty > struct XYZ { static int si; static void foo( void ); XYZ( void ) { si = 555; foo(); } }; int XYZ< int >::si = 333; template<> void XYZ< int >::foo( void ) { return; } int main( void ) { XYZ< int > xyz; } -----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Simon Atanasyan Sent: Friday, September 28, 2007 1:24 PM To: boost@lists.boost.org Subject: Re: [boost] SUN Compiler option -features=tmplrefstatic The short answer is very simple. By default Sun C++ compiler does not allow to refer static symbols from templates. -features=tmplrefstatic turns on this ability. 2007/9/28, Sid Sacek <ssacek@appsecinc.com>:

...

To whom it may concern,

When compiling Boost code using the Sun compiler, the Boost documentation makes this statement.

Quote: When using this compiler on complex C++ code, such as the Boost C++

library,

...

it is recommended to specify the following options when intializing the sun module: -library=stlport4 -features=tmplife -features=tmplrefstatic End-Quote:

I don't understand the effect of the 'tmplrefstatic' argument on the compiled code. Why is that option recommended, ie. What does it do?

I searched the web for a couple of hours for an explanation and couldn't find an answer to my question. I'm hoping somebody on this list knows because Boost is recommending it.

Thanks in advance, -Sid Sacek

-- Simon Atanasyan _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Simon Atanasyan

29 Sep 29 Sep

5:50 a.m.

New subject: SUN Compiler option -features=tmplrefstatic

% cat test.cc static int bar(int) { return 0; } template <typename T> int foo(T arg) { return bar(arg); } int main() { return foo(1); } % CC -V test.cc CC: Sun C++ 5.9 SunOS_sparc Patch 124863-01 2007/07/25 ccfe: Sun C++ 5.9 SunOS_sparc Patch 124863-01 2007/07/25 "test.cc", line 9: Error: Reference to static bar(int) not allowed in template foo<int>(int), try using -features=tmplrefstatic. "test.cc", line 14: Where: While instantiating "foo<int>(int)". "test.cc", line 14: Where: Instantiated from non-template code. 1 Error(s) detected. 2007/9/29, Sid Sacek <ssacek@appsecinc.com>:

...

Hi,

Thanks for that answer, but I still don't think I understand. This sample test code builds with and without the -features=tmplrefstatic argument, so I guess I need a code example to clarify it, if you don't mind.

Thank you, -Sid Sacek

template< typename Ty > struct XYZ { static int si; static void foo( void ); XYZ( void ) { si = 555; foo(); } };

int XYZ< int >::si = 333; template<> void XYZ< int >::foo( void ) { return; }

int main( void ) { XYZ< int > xyz; }

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Simon Atanasyan Sent: Friday, September 28, 2007 1:24 PM To: boost@lists.boost.org Subject: Re: [boost] SUN Compiler option -features=tmplrefstatic

The short answer is very simple. By default Sun C++ compiler does not allow to refer static symbols from templates. -features=tmplrefstatic turns on this ability.

2007/9/28, Sid Sacek <ssacek@appsecinc.com>:

...
To whom it may concern,

When compiling Boost code using the Sun compiler, the Boost documentation makes this statement.

Quote: When using this compiler on complex C++ code, such as the Boost C++

library,

...
it is recommended to specify the following options when intializing the sun module: -library=stlport4 -features=tmplife -features=tmplrefstatic End-Quote:

I don't understand the effect of the 'tmplrefstatic' argument on the compiled code. Why is that option recommended, ie. What does it do?

I searched the web for a couple of hours for an explanation and couldn't find an answer to my question. I'm hoping somebody on this list knows because Boost is recommending it.

Thanks in advance, -Sid Sacek

-- Simon Atanasyan _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- Simon Atanasyan

Sid Sacek

7:21 a.m.

New subject: SUN Compiler option -features=tmplrefstatic

Ok, that works. Thank you very much for taking the time and explaining it. Regards, -Sid Sacek -----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Simon Atanasyan Sent: Saturday, September 29, 2007 1:50 AM To: boost@lists.boost.org Subject: Re: [boost] SUN Compiler option -features=tmplrefstatic % cat test.cc static int bar(int) { return 0; } template <typename T> int foo(T arg) { return bar(arg); } int main() { return foo(1); } % CC -V test.cc CC: Sun C++ 5.9 SunOS_sparc Patch 124863-01 2007/07/25 ccfe: Sun C++ 5.9 SunOS_sparc Patch 124863-01 2007/07/25 "test.cc", line 9: Error: Reference to static bar(int) not allowed in template foo<int>(int), try using -features=tmplrefstatic. "test.cc", line 14: Where: While instantiating "foo<int>(int)". "test.cc", line 14: Where: Instantiated from non-template code. 1 Error(s) detected. 2007/9/29, Sid Sacek <ssacek@appsecinc.com>:

...

Hi,

Thanks for that answer, but I still don't think I understand. This sample test code builds with and without the -features=tmplrefstatic argument, so

...

guess I need a code example to clarify it, if you don't mind.

Thank you, -Sid Sacek

template< typename Ty > struct XYZ { static int si; static void foo( void ); XYZ( void ) { si = 555; foo(); } };

int XYZ< int >::si = 333; template<> void XYZ< int >::foo( void ) { return; }

int main( void ) { XYZ< int > xyz; }

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Simon Atanasyan Sent: Friday, September 28, 2007 1:24 PM To: boost@lists.boost.org Subject: Re: [boost] SUN Compiler option -features=tmplrefstatic

The short answer is very simple. By default Sun C++ compiler does not allow to refer static symbols from templates. -features=tmplrefstatic turns on this ability.

2007/9/28, Sid Sacek <ssacek@appsecinc.com>:

...
To whom it may concern,

When compiling Boost code using the Sun compiler, the Boost

...
makes this statement.

Quote: When using this compiler on complex C++ code, such as the Boost C++

documentation library,

...
it is recommended to specify the following options when intializing the sun module: -library=stlport4 -features=tmplife -features=tmplrefstatic End-Quote:

I don't understand the effect of the 'tmplrefstatic' argument on the compiled code. Why is that option recommended, ie. What does it do?

I searched the web for a couple of hours for an explanation and couldn't find an answer to my question. I'm hoping somebody on this list knows because Boost is recommending it.

Thanks in advance, -Sid Sacek

-- Simon Atanasyan _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- Simon Atanasyan _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Andrey Semashev

9:25 a.m.

New subject: SUN Compiler option -features=tmplrefstatic

Simon Atanasyan wrote:

...

% cat test.cc static int bar(int) { return 0; }

template <typename T> int foo(T arg) { return bar(arg); }

int main() { return foo(1); }

% CC -V test.cc CC: Sun C++ 5.9 SunOS_sparc Patch 124863-01 2007/07/25 ccfe: Sun C++ 5.9 SunOS_sparc Patch 124863-01 2007/07/25 "test.cc", line 9: Error: Reference to static bar(int) not allowed in template foo<int>(int), try using -features=tmplrefstatic. "test.cc", line 14: Where: While instantiating "foo<int>(int)". "test.cc", line 14: Where: Instantiated from non-template code. 1 Error(s) detected.

I'd like to note that the code is valid right until foo is instantiated in more than one translation unit. So the compiler restriction might not be a bad idea after all.

Sebastian Redl

28 Sep 28 Sep

9:36 p.m.

New subject: SUN Compiler option -features=tmplrefstatic

Please don't start a new topic by replying to existing posts. It messes up the thread view for various mail/news clients. Sebastian Redl

David Rodríguez Ibeas

9 Jan 9 Jan

12:02 p.m.

On Sep 27, 2007 5:31 PM, Joseph Gauterin <joseph.gauterin@googlemail.com> wrote:

...

...
...
Variable-width-encoded strings should be fairly straightforward when they are immutable, but will probably get hairy when they can be modified. True. I think the strings should be immutable. I think experience with Java and C# compared to C++ shows that an immutable string class is superior in most use cases

I also agree that immutable strings are the way forward for VWEs.

I believe that some thinking must be spent on this issue. Considering Java (no experience with C#), memory allocation is really fast due to the way the garbage collector is implemented. Creation of new strings for each change is quite cheap, only the data copying part, and even then there are other ways (StringBuffer/StringBuilder) that offer a modifiable version of a string. Then there are other things in JVM that helps speeding up, but with some associated _cost_. Strings can be 'internalized' by the jvm, keeping just one copy of each different string in memory and sharing references. This is a side effect of inmutability: there is no problem sharing the same copy as it cannot change. Then problems arise if you offer access to the raw data, which is a feature in Java: suddenly inmutable strings can be changed and System.out.println( "Say Hi!" ) can print 'Goodbye' if some other code anywhere else changes the raw data. This cannot be avoided in Java as final is a property of the reference, not the associated data. On the other hand, I believe it can be correctly implemented in C++.

...

If we had mutable strings consider how badly the following would perform: std::replace(utfString.begin(),utfString.end (),SingleByteChar,MultiByteChar);

Although this looks O(n) at first glance, it's actually O(n^2), as the container has to expand itself for every replacement. I don't think a library should make writing worst case scenario type code that easy.

While this is a problem that I don't know if has a solution, an alternative replace can be implemented in the library that performs in linear time by constructing a new string copying values an replacing on the same iteration. Could std::replace() be disabled somehow?? (SFINAE??)

Sebastian Redl

12:18 p.m.

David Rodríguez Ibeas wrote:

...

On Sep 27, 2007 5:31 PM, Joseph Gauterin <joseph.gauterin@googlemail.com> wrote:

While this is a problem that I don't know if has a solution, an alternative replace can be implemented in the library that performs in linear time by constructing a new string copying values an replacing on the same iteration. Could std::replace() be disabled somehow?? (SFINAE??)

It ought to be possible to overload it and, if the string is not part of std, have the overloaded version be picked up with ADL. Only if replace() isn't explicitly qualified, of course, which is a problem. But I think immutable strings are the way forward anyway. Sebastian Redl

Phil Endecott

8:29 p.m.

It's nice to see this thread from September picked up again, as I was a bit disappointed by the volume of response at the time to my proposal. I may be plugging this code in to something real quite soon, and will try to drum up some interest again here if I do. With XML being mentioned again, I think that character sets are something that need attention. [Be warned that some readers will not see new messages on this old thread.] Sebastian Redl wrote:

...

David Rodr?guez Ibeas wrote:

...
On Sep 27, 2007 5:31 PM, Joseph Gauterin <joseph.gauterin@googlemail.com> wrote:

[putting back the context]

...

...
...
If we had mutable strings consider how badly the following would perform: std::replace(utfString.begin(),utfString.end(),SingleByteChar,MultiByteChar); Although this looks O(n) at first glance, it's actually O(n^2), as the container has to expand itself for every replacement. I don't think a library should make writing worst case scenario type code that easy.

...

...
While this is a problem that I don't know if has a solution, an alternative replace can be implemented in the library that performs in linear time by constructing a new string copying values an replacing on the same iteration. Could std::replace() be disabled somehow?? (SFINAE??)

It ought to be possible to overload it and, if the string is not part of std, have the overloaded version be picked up with ADL. Only if replace() isn't explicitly qualified, of course, which is a problem. But I think immutable strings are the way forward anyway.

For a UTF-8 string, my proposal offered a mutable random-access byte iterator a const bidirectional character iterator a mutable output character iterator std::replace needs a mutable forward iterator, so you wouldn't be able to apply it to the character iterator. The library wouldn't "let you write worst case code". There is, however, the replace_copy algorithm, which I think does exactly what you need; it takes a pair of input iterators and an output iterator, i.e. something like utf8_string s1 = "......"; utf8_string s2; std::replace_copy(s1.begin(),s1.end(), utf8_string::character_output_iterator(s2), L'x',L'y'); Concerning mutable vs. immutable strings: which is best in any particular case clearly depends on the size of the string, the operation being performed, and whether it has a variable-length encoding. The programmer should be allowed to choose which to use. (An interesting case is where the size or character set changes at run-time, and a run-time choice of algorithm is appropriate.) Regards, Phil.

Sebastian Redl

8:59 p.m.

Having run into string-related problems myself in my iochain project, I'm thinking about character encodings again. (I don't like the term character set. It ought to mean something different from its common usage. UTF-8 and UTF-16 aren't character sets.) Phil Endecott wrote:

...

For a UTF-8 string, my proposal offered

a mutable random-access byte iterator

What is the use case for this?

...

Concerning mutable vs. immutable strings: which is best in any particular case clearly depends on the size of the string, the operation being performed, and whether it has a variable-length encoding. The programmer should be allowed to choose which to use. (An interesting case is where the size or character set changes at run-time, and a run-time choice of algorithm is appropriate.)

Why on earth would you change the character set of a string at runtime? Robert O'Cahallan (roc of Mozilla) recently blogged a bit about strings. Mozilla is a project that has a lot of experience with strings, so I put quite some weight on his opinion. http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html Sebastian Redl

Felipe Magno de Almeida

9:02 p.m.

On Jan 9, 2008 6:59 PM, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:

...

[snip]

...

Why on earth would you change the character set of a string at runtime?

Because sometimes you have to interface with others API which uses different character sets? [snip]

...

Sebastian Redl

-- Felipe Magno de Almeida

Sebastian Redl

9:16 p.m.

Felipe Magno de Almeida wrote:

...

On Jan 9, 2008 6:59 PM, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:

...
Why on earth would you change the character set of a string at runtime?

Because sometimes you have to interface with others API which uses different character sets?

Yes, but change the character set *of a string object*? In such cases, you would create a copy. Sebastian Redl

Felipe Magno de Almeida

10:21 p.m.

On Jan 9, 2008 7:16 PM, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:

...

Felipe Magno de Almeida wrote:

...
On Jan 9, 2008 6:59 PM, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:

...
Why on earth would you change the character set of a string at runtime?

Because sometimes you have to interface with others API which uses different character sets?

Yes, but change the character set *of a string object*? In such cases, you would create a copy.

Now I see, sorry. You're right. It seem like a useless feature.

...

Sebastian Redl

Regards, -- Felipe Magno de Almeida

Phil Endecott

10 Jan 10 Jan

1:25 a.m.

Sebastian Redl wrote:

...

Phil Endecott wrote:

...
For a UTF-8 string, my proposal offered

a mutable random-access byte iterator

What is the use case for this?

It's for when you want to treat the data as a sequence of bytes. For example, another thread at the moment is discussing base64 encoding. The input to a base64 encoder could be a byte stream iterator. There are also cases where you can exploit knowledge about the encoding to use a byte iterator in place of a character iterator. Specifically, in UTF-8 all bytes after the first of a multi-byte character are

...

=128. So in a parser, I might want to skip forward to the next '"', or '<' or whatever; since those are both <128, I can do this significantly more efficiently using the byte iterator.

...

...
Concerning mutable vs. immutable strings: which is best in any particular case clearly depends on the size of the string, the operation being performed, and whether it has a variable-length encoding. The programmer should be allowed to choose which to use. (An interesting case is where the size or character set changes at run-time, and a run-time choice of algorithm is appropriate.)

Why on earth would you change the character set of a string at runtime?

I should have written "where the size or character set _varies_ at run-time". Phil.

Jeremy Maitin-Shepard

27 Sep 27 Sep

8:37 p.m.

"James Porter" <porterj@alum.rit.edu> writes:

...

I see what you mean. Still, fixed-width-encoded strings are a lot easier to code, and I think we should focus on them first just to get something working and to have a platform to test code conversion on, which in my opinion is the most important part.

I think as others have said, in practice a fixed-width encoding really gains you very little or nothing at all. Needing random access to code points is, I think, an extremely rare operation. Replacing one code point with another code point is also likewise a rare operation; in general you would replace one substring (perhaps a grapheme cluster) with another substring (which may also be a grapheme cluster). [snip]

...

That said, I think a good (general) roadmap for this project would be: 1) Extend std::basic_string to store UCS-2 / UCS-4 (should be easy, though string constants may pose a problem)

UCS-2 is bogus and should not be used at all. Conceivably UCS-4 is legitimate but in practice not likely to be used by anyone. Still, it is probably important to support it. The primary encodings of Unicode to be supported should be UTF-8 and UTF-16.

...

2) Add code conversion to move between encodings, especially for I/O 3) Create VWE string class (fairly easy if immutable, hard if mutable)

I don't think the issues of a mutable UTF-8/UTF-16 representation are very different from the issues of a mutable UTF-32 representation. In practice, in handling non-ASCII text, all searching and replacement will be in terms of substrings (likely single or sequences of grapheme clusters). -- Jeremy Maitin-Shepard

James Porter

9:13 p.m.

On 9/27/07, Jeremy Maitin-Shepard <jbms@cmu.edu> wrote:

...

I think as others have said, in practice a fixed-width encoding really gains you very little or nothing at all. Needing random access to code points is, I think, an extremely rare operation.

I know, but it'd be easy to put together a fixed-width encoded basic_string, and we could use that as a basis for building a code conversion framework, at least as a proof-of-concept. Of course, that assumes that we'd be using basic_string for fixed-width strings, which isn't necessarily the case. UCS-2 is bogus and should not be used at all. Conceivably UCS-4 is

...

legitimate but in practice not likely to be used by anyone. Still, it is probably important to support it.

Are there any situations where UCS-2 is actually needed (deprecated libraries, for instance)? If not, then I agree that we can eliminate it. I don't think the issues of a mutable UTF-8/UTF-16 representation are

...

very different from the issues of a mutable UTF-32 representation. In practice, in handling non-ASCII text, all searching and replacement will be in terms of substrings (likely single or sequences of grapheme clusters).

I suppose it depends on how we allow UTF-8/UTF-16 strings to be modified. Direct (mutable) character access through operator [] would be bad, but substrings would be better. Depending on the situation, it may be better to use a stringstream to compose a new string from the old. I'd have to think about it some more. - James

Michael Marcin

28 Sep 28 Sep

1:07 a.m.

James Porter wrote:

...

On 9/27/07, Jeremy Maitin-Shepard <jbms@cmu.edu> wrote:

...
UCS-2 is bogus and should not be used at all. Conceivably UCS-4 is legitimate but in practice not likely to be used by anyone. Still, it is probably important to support it.

Are there any situations where UCS-2 is actually needed (deprecated libraries, for instance)? If not, then I agree that we can eliminate it.

From http://en.wikipedia.org/wiki/UCS-2 Symbian OS used in Nokia S60 handsets and Sony Ericsson UIQ handsets uses UCS-2. Older Windows NT systems (prior to Windows 2000) only support UCS-2. The Python language environment has used UCS-2 internally since version 2.1, although newer versions can use UCS-4 to store supplementary characters (instead of UTF-16). - Michael Marcin

Joseph Gauterin

7:23 a.m.

...

Symbian OS used in Nokia S60 handsets and Sony Ericsson UIQ handsets uses UCS-2. UCS-2 would be very easy to support as it's a strict subset of UTF-16. I think we should focus on the library design first before deciding which encodings to include though.

...

The external encoding must be specifiable as a runtime selection. Agreed. One of the most important use cases I see for this library would be reading an external file that's encoded in a manner not known until runtime - e.g. an xml file. Having to specify source encodings at compile time wouldn't be useful. Specifying destination encodings at compile time does seem useful though - i.e Read whatever is in this file and encode it into UTF-16. The destination encoding would probably have to be known anyway to pass the data to APIs.

...

I would definitely encourage breaking the work up into smaller chunks. I think one chunk should be routines that, given a sequence of bytes, determine that possible encodings used to encode those bytes (looking at bytes used and any BOM that might exist). This could be overloaded/extended by users for XML and other formats that can explicitly define their encoding (e.g. the encoding attribute on XML declarations).

Kirit Sælensminde

7:45 a.m.

Jeremy Maitin-Shepard wrote:

...

"James Porter" <porterj@alum.rit.edu> writes:

...
I see what you mean. Still, fixed-width-encoded strings are a lot easier to code, and I think we should focus on them first just to get something working and to have a platform to test code conversion on, which in my opinion is the most important part.

I think as others have said, in practice a fixed-width encoding really gains you very little or nothing at all. Needing random access to code points is, I think, an extremely rare operation. Replacing one code point with another code point is also likewise a rare operation; in general you would replace one substring (perhaps a grapheme cluster) with another substring (which may also be a grapheme cluster).

On our implementation we store both the UTF-32 and UTF-16 length (internally use UTF-16) in the string object. For the vast majority of strings these lengths are the same. This optimisation tripled the speed of our RSS feed generation as it does do lot of replacing as it needs to HTML encode <, > and &. For long strings it is almost always a win to go through and calculate the final length you will get after the substitutions so that you can do a single pass with a single allocation to generate the target string. K

Kirit Sælensminde

27 Sep 27 Sep

8:05 a.m.

(Sorry if this is a double post, I'd not subscribed to the list first time) Joseph Gauterin wrote:

...

...
...
If you change state_type in the char_traits, you'd be able to differentiate the various basic_string types and include information about the character encoding without writing a whole lot of new code. Thanks for the suggestion. I need to learn some more about this corner of "namespace std", clearly, before I go and re-invent something. IIRC, some of the non-const std::basic_string methods aren't suitable for handling variable width encodings like utf8 and utf16 - non-const operator[] in paticular returns a reference to the character type - a big problem if you want to assign a value > 0x7F (i.e. a character that uses 2 or more bytes).

I've noticed that there are frequent requests/proposals for some sort of boost unicode/string encoding library. I've thought about the problem and it seems to big for one person to handle in their spare time - perhaps a group of us should get together to discuss working on one? I'd be happy to participate.

I'm going to chime in here to say that I've been using a string implementation similar to this for a few years now. Our systems are on Windows so we want UTF-16 where we interface with Windows APIs and other Windows software, but we wanted to put all of the surrogate pairs stuff in one place. Our FSLib::wstring uses UTF-32 characters for character interfaces (i.e. at() and operator[]), but UTF-16 internallly. We throw out the non-const operator[] and the non-const iterator. They haven't really been missed. We also have to offer a std_str() which returns a std::wstring and buffer_begin() and buffer_end() which return wchar_t* so we can use Boost.Regex etc. I've also started looking at tagged types for many of the same sorts of things already mentioned. I also want to use them to describe other types of encodings such as HTTP query string and file specification encodings, HTML attribute encoding, SQL statement string encoding etc. The idea being here that it would be impossible to concatenate a query string encoded string to a HTML attribute encoded one without using the correct conversion function. The idea here is to improve security to defeat things like XSS attacks on web servers and SQL injection attacks. I've been looking at making the conversions happen through explicit constructors in order to make it easier to use. A final thing I've just started to look at is to get the compiler to choose the best internal representation out of UTF-8, UTF=16 and UTF-32 for general use, but it's not something I've gotten very far with. K

Sebastian Redl

24 Sep 24 Sep

10:54 a.m.

Phil Endecott wrote:

...

Dear All,

Something that I have been thinking about for a while is storing strings tagged with their character set. Since I now have a practical need for this I plan to try to implement something. Your feedback would be appreciated.

Hi, I've played around with this concept a lot already. I basically think that encoding-bound strings are a MUST for proper, safe, internationalized string handling. Everything else, in particular the current situation, is a mess. If you want, I can package up what I've done so far (not really much, but a lot of comments containing concepts) and put it somewhere. One thing: I think runtime-tagged strings are useless. Programming should happen with one or at most two fixed encodings, known at compile time. Because of the differences in behaviour in encodings (base unit 8, 16 or 32 bits, or 8 with various endians, fixed-length encodings vs variable-length encodings, ...), it is not good to write a type handling them all at runtime. I think that runtime-specified string conversion should be an I/O question. In other words, when character data enters your program, you convert it to the encoding you use internally, when it leaves the program, you convert it to an external encoding. In-between, you use whatever your program uses, and you specify it at compile time. I'd be willing to cooperate on this project, too. I'm mostly busy with my new I/O stuff, but the tagged strings form the foundation of the text I/O part, so I need the character library sooner or later anyway. Sebastian Redl

Phil Endecott

5:08 p.m.

Sebastian Redl wrote:

...

Phil Endecott wrote:

...
Dear All,

Something that I have been thinking about for a while is storing strings tagged with their character set. Since I now have a practical need for this I plan to try to implement something. Your feedback would be appreciated.

Hi,

I've played around with this concept a lot already. I basically think that encoding-bound strings are a MUST for proper, safe, internationalized string handling. Everything else, in particular the current situation, is a mess.

If you want, I can package up what I've done so far (not really much, but a lot of comments containing concepts) and put it somewhere.

Yes please.

...

One thing: I think runtime-tagged strings are useless. Programming should happen with one or at most two fixed encodings, known at compile time. Because of the differences in behaviour in encodings (base unit 8, 16 or 32 bits, or 8 with various endians, fixed-length encodings vs variable-length encodings, ...), it is not good to write a type handling them all at runtime. I think that runtime-specified string conversion should be an I/O question. In other words, when character data enters your program, you convert it to the encoding you use internally, when it leaves the program, you convert it to an external encoding. In-between, you use whatever your program uses, and you specify it at compile time.

Consider processing a MIME email. It may have several parts each with a different character set. I would imagine a flow something like this: read in message as a sequence-of-bytes for each message part { find the character set put the body in a run-time-tagged string do something with the body } Now, "do something with the body" might be "save it in a file", i.e. f << "content-type: text/plain; charset=\"" << body.charset << "\"\n" << "\n"; << body.data; In this case, it would be wasteful to convert to and from a compile-time-fixed character set. On the other hand, "do something with the body" might be "search for <string>". In this case, converting to a compile-time-fixed character set, preferably a universal one, would be best: ucs4string body_ucs4 = body.data; // if we have implicit conversion... body_ucs4.find("hello"); What I'm saying is: yes, good practice is very often to convert to a fixed character set before doing anything to the data; but no I don't think that that can happen exclusively inside an I/O layer. So some method of representing run-time-tagged data - if only temporarily, before conversion - is needed.

...

I'd be willing to cooperate on this project, too. I'm mostly busy with my new I/O stuff, but the tagged strings form the foundation of the text I/O part, so I need the character library sooner or later anyway.

I have a small project in progress which needs a subset of this functionality, and I'm planning to use it as a testbed for these ideas. I'll post again when I have something more concrete. The area where I would most appreciate some input is in how to provide a "user-extensible enum or type tag" for character sets. Regards, Phil.

Sebastian Redl

26 Sep 26 Sep

4:26 p.m.

Phil Endecott wrote:

...

Sebastian Redl wrote:

...
If you want, I can package up what I've done so far (not really much, but a lot of comments containing concepts) and put it somewhere.

Yes please.

Here you are: http://windmuehlgasse.getdesigned.at/characters.zip Note two things about this archive: 1) The converters have a terrible interface. It's unfriendly and still not powerful enough to do what I want. That part has to be completely redesigned. That is not to say that there aren't some worthwhile ideas there, though. 2) I make a very strict distinction between the terms "character set" and "encoding". A character set is a mapping of abstract characters to code points, which are integral values. ISO-10464 and Unicode define such a mapping. US-ASCII is such a mapping, too. Early versions of the ISO-8859 family of standards defined such mappings. An encoding is a way to map these code points to sequences of bytes. UTF-8, UTF-16 and UTF-32 are all encodings of Unicode. US-ASCII is its own encoding. New revisions of the ISO-8859 family are defined in terms of Unicode; they are encodings of that character set, though incomplete ones. This distinction goes quite against common (mis)usage: from MIME content types (ContentType: text/html; charset=...) over Java (String.getBytes(..., String charsetName), the entire java.nio.charset package), to GCC's compiler flags (-fexec-charset=...) - they all use charset or character set for what is really the encoding. Argue about "common usage" all you want - it still doesn't make sense to call UTF-8 a character set, because it isn't. XML, for example, gets it right: <?xml version="1.0" encoding="UTF-8"?> Ah, well. Rant over.

...

Consider processing a MIME email. It may have several parts each with a different character set. I would imagine a flow something like this:

read in message as a sequence-of-bytes for each message part { find the character set put the body in a run-time-tagged string do something with the body }

I disagree with this flow. My flow is: read in message as a sequence-of-bytes for each message part { find the type do I want to do something with its string form? yes { find the character set put the body in a compile-time-tagged string, converting from the found character set do something with the body } no { do something with the body as a byte sequence } }

...

Now, "do something with the body" might be "save it in a file", i.e.

That would be something I do with the byte sequence.

...

f << "content-type: text/plain; charset=\"" << body.charset << "\"\n" << "\n"; << body.data;

That actually doesn't make any sense, sorry. You can't just write a runtime-tagged string to a text stream, not with C++ iostreams being what they are. If they're open in binary mode, you should just push the bytes through (and you shouldn't use the formatted I/O operators) - all the bytes, including the MIME headers. If they're open in text mode, then it all gets really weird. Either you actually convert the string to some output encoding (in which case, why do you write the original encoding into the file?), or you don't, in which case you risk corruption. Oh, and did I mention that if the thing were a wide stream, the output operator would have to convert the runtime-tagged string to a wide string anyway? And then the file buffer would convert it back. Meh. Just stay with the raw bytes.

...

In this case, it would be wasteful to convert to and from a compile-time-fixed character set.

Yes. It would also be wasteful to construct a runtime-tagged string, when you could just access a section of the raw byte stream.

...

So some method of representing run-time-tagged data - if only temporarily, before conversion - is needed.

This representation is my converting input stream.

...

I have a small project in progress which needs a subset of this functionality, and I'm planning to use it as a testbed for these ideas. I'll post again when I have something more concrete. The area where I would most appreciate some input is in how to provide a "user-extensible enum or type tag" for character sets.

Maybe the archive I uploaded will help. I'm thinking of type tags with some metafunctions to specialize. Sebastian Redl

Tobias Schwinger

24 Sep 24 Sep

6:06 p.m.

Sebastian Redl wrote:

...

Phil Endecott wrote:

...
Dear All,

Something that I have been thinking about for a while is storing strings tagged with their character set. Since I now have a practical need for this I plan to try to implement something. Your feedback would be appreciated.

Hi,

I've played around with this concept a lot already. I basically think that encoding-bound strings are a MUST for proper, safe, internationalized string handling. Everything else, in particular the current situation, is a mess.

If you want, I can package up what I've done so far (not really much, but a lot of comments containing concepts) and put it somewhere.

One thing: I think runtime-tagged strings are useless. Programming should happen with one or at most two fixed encodings, known at compile time. Because of the differences in behaviour in encodings (base unit 8, 16 or 32 bits, or 8 with various endians, fixed-length encodings vs variable-length encodings, ...), it is not good to write a type handling them all at runtime. I think that runtime-specified string conversion should be an I/O question. In other words, when character data enters your program, you convert it to the encoding you use internally, when it leaves the program, you convert it to an external encoding. In-between, you use whatever your program uses, and you specify it at compile time.

Well, having I/O facilities provide the only means for converting strings of different encodings would make using compiled libraries that use a different string encoding than my program pretty awkward, wouldn't it? I agree that the "runtime tagging suggestion" seems overkill. Maybe providing lazily evaluated, possibly cached, compile- and runtime "string views" is a good idea, however (and might probably give a nice framework to implement encoding conversions as well). Examples: // ...given some strings a,b, and c string<utf8> s = a + b + ":" + c; // can get away with exactly one allocation since operator+ can // return a compile-time string view -- string<utf8> s = "world"; string_view<utf8> v = "Hello " + s + "!"; std::cout << v << std::endl; s = "you"; std::cout << v << std::endl; // Output: // Hello world! // Hello you! // For a more "real-world" use case of runtime string_views // consider a lexer taking apart an in-memory file, with SBO // applied to the string_view template... Regards, Tobias

Atry

29 Sep 29 Sep

6:14 a.m.

I have some work for it, see http://lists.boost.org/Archives/boost/2007/08/125945.php 2007/9/24, Phil Endecott <spam_from_boost_dev@chezphil.org>:

...

Dear All,

Something that I have been thinking about for a while is storing strings tagged with their character set. Since I now have a practical need for this I plan to try to implement something. Your feedback would be appreciated.

The starting point is the idea that the character set of a string may be known at compile time or at run time, and so two types of tagging are possible. First compile-time tagging:

template <character_set> class tagged_string { ... };

tagged_string<utf8> s1; tagged_string<latin1> s2;

Some typedefs would be appropriate:

typedef tagged_string<utf8> utf8string;

Now run-time tagging:

class rt_tagged_string { private: character_set cs; public: rt_tagged_string(character_set cs_): cs(cs_) ... ... };

rt_tagged_string(utf8) s3;

(Consise-yet-clear names for any of these classes would be great.)

I propose to implement conversion between the strings using icode and/or GNU recode. It would be easy to allow this conversion to happen invisibly, but it might be wiser to make conversion explicit.

I'm not sure what the 'character_set' that I've used above should be. It needs to be some sort of user-extensible enum or type-tag.

We need character types of 8, 16 and 32 bits. wchar is not useful here because it's not defined whether it's 16 or 32 bits. So I propose the following, modelled after cstdint:

typedef char char8_t; typedef <implementation-defined> char16_t; typedef <implementation-defined> char32_t;

I then propose a character_set_traits class:

template <character_set> class character_set_traits;

template <> class character_set_traits<utf8> { typdef char8_t char_t; const bool variable_width = true; ... };

For the fixed-width, compile-time-tagged strings I think it makes sense to inherit from std::basic_string< character_set_traits<charset>::char_t >. The only problem I can see with this is that

latin1string s1 = "hello world"; s1.substr(1,5) <--- this returns a std::string, not a latin1string

If latin1string has a constructor from std::string (which is its own base type) that's fine, i.e. we can still write:

latin1string s2 = s1.substr(1,5);

but unfortunately we can also write

latin2string s3 = s1.substr(1,5);

which is not so good.

So a different approach is to define a set of character-set-specific character types, and build string types from them:

typedef char8_t latin1char; typedef char8_t latin2char;

For variable-width character sets, the methods of std::string are less useful (though far from useless). I understand that there's already a utf8 iterator somewhere in Boost, can it help?

For run-time character sets, is there any way to provide e.g. run-time iterators?

I imagine these strings being used as follows: - Input to the program is either run-time or compile-time tagged with any character set. - Data that is not manipulated in any way it just passed through. - Data that will be processed will first be converted to a suitable, compile-time-tagged, character set, and if appropriate converted back afterwards.

So the absence of (useful) string operations on run-time-tagged or variable-width character set data is not a problem.

For conversions, there is the question of partial characters in variable-width character sets. If a program is processing data in chunks it may be legitimate for a chunk boundary to fall in the middle of a UTF8 character. IIRC, icode has a method to deal with this which we could expose in a stateful converter:

charset_converter utf8_to_ucs4(utf8,ucs4); while (!eof) { utf8string s = get_chunk(); ucs4string t = utf8_to_ucs4(s); send_chunk(t); } utf8_to_ucs4.flush();

- but many applications may only need a stateless converter.

I will be working on this over the next couple of weeks, so any feedback would be much appreciated.

Regards,

Phil.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Atry

6:21 a.m.

I think such a library should be based on a range algorithm like oven( http://p-stade.sourceforge.net/oven/doc/html/index.html). 2007/9/29, Atry <pop.atry@gmail.com>:

...

I have some work for it, see http://lists.boost.org/Archives/boost/2007/08/125945.php

2007/9/24, Phil Endecott < spam_from_boost_dev@chezphil.org>:

...
Dear All,

Something that I have been thinking about for a while is storing strings tagged with their character set. Since I now have a practical need for this I plan to try to implement something. Your feedback would be appreciated.

The starting point is the idea that the character set of a string may be known at compile time or at run time, and so two types of tagging are possible. First compile-time tagging:

template <character_set> class tagged_string { ... };

tagged_string<utf8> s1; tagged_string<latin1> s2;

Some typedefs would be appropriate:

typedef tagged_string<utf8> utf8string;

Now run-time tagging:

class rt_tagged_string { private: character_set cs; public: rt_tagged_string(character_set cs_): cs(cs_) ... ... };

rt_tagged_string(utf8) s3;

(Consise-yet-clear names for any of these classes would be great.)

I propose to implement conversion between the strings using icode and/or GNU recode. It would be easy to allow this conversion to happen invisibly, but it might be wiser to make conversion explicit.

I'm not sure what the 'character_set' that I've used above should be. It needs to be some sort of user-extensible enum or type-tag.

We need character types of 8, 16 and 32 bits. wchar is not useful here because it's not defined whether it's 16 or 32 bits. So I propose the following, modelled after cstdint:

typedef char char8_t; typedef <implementation-defined> char16_t; typedef <implementation-defined> char32_t;

I then propose a character_set_traits class:

template <character_set> class character_set_traits;

template <> class character_set_traits<utf8> { typdef char8_t char_t; const bool variable_width = true; ... };

For the fixed-width, compile-time-tagged strings I think it makes sense to inherit from std::basic_string< character_set_traits<charset>::char_t >. The only problem I can see with this is that

latin1string s1 = "hello world"; s1.substr(1,5) <--- this returns a std::string, not a latin1string

If latin1string has a constructor from std::string (which is its own base type) that's fine, i.e. we can still write:

latin1string s2 = s1.substr(1,5);

but unfortunately we can also write

latin2string s3 = s1.substr(1,5);

which is not so good.

So a different approach is to define a set of character-set-specific character types, and build string types from them:

typedef char8_t latin1char; typedef char8_t latin2char;

For variable-width character sets, the methods of std::string are less useful (though far from useless). I understand that there's already a utf8 iterator somewhere in Boost, can it help?

For run-time character sets, is there any way to provide e.g. run-time iterators?

I imagine these strings being used as follows: - Input to the program is either run-time or compile-time tagged with any character set. - Data that is not manipulated in any way it just passed through. - Data that will be processed will first be converted to a suitable, compile-time-tagged, character set, and if appropriate converted back afterwards.

So the absence of (useful) string operations on run-time-tagged or variable-width character set data is not a problem.

For conversions, there is the question of partial characters in variable-width character sets. If a program is processing data in chunks it may be legitimate for a chunk boundary to fall in the middle of a UTF8 character. IIRC, icode has a method to deal with this which we could expose in a stateful converter:

charset_converter utf8_to_ucs4(utf8,ucs4); while (!eof) { utf8string s = get_chunk(); ucs4string t = utf8_to_ucs4(s); send_chunk(t); } utf8_to_ucs4.flush();

- but many applications may only need a stateless converter.

I will be working on this over the next couple of weeks, so any feedback would be much appreciated.

Regards,

Phil.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

6414

Age (days ago)

6523

Last active (days ago)

List overview

Download

55 comments

16 participants

participants (16)

Andrey Semashev
Atry
David Rodríguez Ibeas
Felipe Magno de Almeida
Grubb, Jared
James Porter
Jeremy Maitin-Shepard
Joseph Gauterin
Kirit Sælensminde
Michael Marcin
Phil Endecott
Sebastian Redl
Sid Sacek
Simon Atanasyan
Steven Watanabe
Tobias Schwinger