Strings with character sets

Phil Endecott

15 Oct 2007 15 Oct '07

7:50 p.m.

Dear All, After a rather longer delay than I had planned, I have some proof-of-concept code for strings tagged with character sets. You might like to first look at the example usage, here: http://svn.chezphil.org/libpbe/trunk/examples/charsets.cc Note that this file is written using UTF8, but the web server seems to be declaring it to be latin1.... The actual implementation is here: http://svn.chezphil.org/libpbe/trunk/include/charset.hh This is far from complete, but it does have some useful functionality; mainly I have been using it to work out what is possible. Your comments would be very much appreciated. Regards, Phil.

Show replies by date

James Porter

17 Oct 17 Oct

12:35 a.m.

I've been thinking about this off and on as well, though have been a little too busy to give it the write-up it deserves. That said, I think your code is a pretty good start. While I agree that tagged strings shouldn't automatically convert on assignment, I think recode() isn't the most useful way to go about it. In practice, I expect that most code conversion would occur during I/O, so I'd prefer to see the conversion done by the stream itself. recode() could still exist as a convenience function, though. On the subject of converting between different encodings of strings, I noticed that you had some concerns about assignment between two different encodings using the same underlying type (latin1_string s = utf8_string("foo") for example). This could be resolved by using a nominally different char_traits class when inheriting from basic_string. However, this would cause problems with I/O streams, since they expect a particular character type and char_traits. This goes back to my point above: the I/O streams should be aware of string tagging (if not directly responsible for it). I'll need to think about how to specify character sets so that they're usable at compile time and run time, though my instinct would be to use subclasses that can be stored in a map of some sort. The subclassing would handle compile-time tagging, and the map would handle run-time tagging: class utf8 : public charset_base { ... }; charset_map["utf8"] = new utf8(); ... tagged_string<utf8> foo; rt_tagged_string bar; bar.set_encoding("utf8"); This should combine the benefits of your first and third choices (type tags and objects), though I haven't thought about this enough to be confident that it's the right way to go. If I get the chance, I'll try to come up with a proof of concept for my ideas, though I'm in the middle of some other things right now. - James Phil Endecott wrote:

...

Dear All,

After a rather longer delay than I had planned, I have some proof-of-concept code for strings tagged with character sets. You might like to first look at the example usage, here:

http://svn.chezphil.org/libpbe/trunk/examples/charsets.cc

Note that this file is written using UTF8, but the web server seems to be declaring it to be latin1....

The actual implementation is here:

http://svn.chezphil.org/libpbe/trunk/include/charset.hh

This is far from complete, but it does have some useful functionality; mainly I have been using it to work out what is possible.

Your comments would be very much appreciated.

Regards,

Phil.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Phil Endecott

8:23 p.m.

Hi James, thanks for replying. James Porter wrote:

...

I've been thinking about this off and on as well, though have been a little too busy to give it the write-up it deserves. That said, I think your code is a pretty good start. While I agree that tagged strings shouldn't automatically convert on assignment, I think recode() isn't the most useful way to go about it.

In practice, I expect that most code conversion would occur during I/O, so I'd prefer to see the conversion done by the stream itself. recode() could still exist as a convenience function, though.

Yes, other people have suggested similar things. Even if it were true that most charset conversion occured during I/O - and that's not been my experience in my own work - then I would still argue that charset conversion should be available for use in other contexts. I see my recode() member function (i.e. utf8_string s2 = s1.recode<utf8>()) ultimately being a convenience around some sort of free function or functor. The need to track shift-states and partial characters makes this a bit complex, though.

...

On the subject of converting between different encodings of strings, I noticed that you had some concerns about assignment between two different encodings using the same underlying type (latin1_string s = utf8_string("foo") for example). This could be resolved by using a nominally different char_traits class when inheriting from basic_string.

Yes; it has been suggested that they differ in their state_type. I plan to investigate this, but if someone more knowledgeable would like to do so, please go ahead.

...

However, this would cause problems with I/O streams, since they expect a particular character type and char_traits. This goes back to my point above: the I/O streams should be aware of string tagging (if not directly responsible for it).

I imagine that an I/O streams library or some sort of adapter layer compatible with these strings would be necessary.

...

I'll need to think about how to specify character sets so that they're usable at compile time and run time, though my instinct would be to use subclasses that can be stored in a map of some sort. The subclassing would handle compile-time tagging, and the map would handle run-time tagging:

class utf8 : public charset_base { ... }; charset_map["utf8"] = new utf8();

...

tagged_string<utf8> foo; rt_tagged_string bar; bar.set_encoding("utf8");

This should combine the benefits of your first and third choices (type tags and objects), though I haven't thought about this enough to be confident that it's the right way to go.

Yes, this has some advantages. But using a map has the disadvantage that lookups are more expensive, compared to the array indexed by enum that I have; in my code, getting the char* name of a charset is a compile-time-constant operation. I'm not sure how much that matters in practice. Thanks for your feedback. Does anyone else have any comments? Do please have a look at my example code (http://svn.chezphil.org/libpbe/trunk/examples/charsets.cc) and tell me how well it would fit in with your approaches to charset conversion. Regards, Phil.

James Porter

18 Oct 18 Oct

1:42 a.m.

Phil Endecott wrote:

...

Yes, other people have suggested similar things. Even if it were true that most charset conversion occured during I/O - and that's not been my experience in my own work - then I would still argue that charset conversion should be available for use in other contexts.

I don't mean to say that the recode function shouldn't exist, but that it should exist only as a convenience function. The actual conversion should be directly usable by I/O operations, so a string doesn't need to be fully converted (and allocated) before output. For all their problems (mostly runtime-specified conversion), the std::codecvt facets make it fairly easy to handle partial conversion and shift states.

...

I imagine that an I/O streams library or some sort of adapter layer compatible with these strings would be necessary.

I think this is key, and goes back to my argument that string conversion should be seen as an I/O operation, and separate from the strings themselves. Unless you merely want a raw byte array, you're (conceptually) converting bytes into code points and then into some internal storage container. For ASCII this is trivial, since each byte is equivalent to a code point and the storage container is just a char/byte, so you're back to where you started. For Unicode, this is considerably more complicated (or we wouldn't be discussing it!). The stream must at least be aware of the external (file) encoding, in order to keep track of shift states. I don't think we'd be able to delegate that responsibility to the strings we'd be filling with data.

...

Yes, this has some advantages. But using a map has the disadvantage that lookups are more expensive, compared to the array indexed by enum that I have; in my code, getting the char* name of a charset is a compile-time-constant operation. I'm not sure how much that matters in practice.

Another option would be, for every encoding, to create a class (for compile-time tagging) and a global instance of that class (for run-time tagging). - James

6517

Age (days ago)

6520

Last active (days ago)

List overview

Download

3 comments

2 participants

participants (2)

James Porter
Phil Endecott