
Sebastian Redl wrote:
Phil Endecott wrote:
Dear All,
Something that I have been thinking about for a while is storing strings tagged with their character set. Since I now have a practical need for this I plan to try to implement something. Your feedback would be appreciated.
Hi,
I've played around with this concept a lot already. I basically think that encoding-bound strings are a MUST for proper, safe, internationalized string handling. Everything else, in particular the current situation, is a mess.
If you want, I can package up what I've done so far (not really much, but a lot of comments containing concepts) and put it somewhere.
Yes please.
One thing: I think runtime-tagged strings are useless. Programming should happen with one or at most two fixed encodings, known at compile time. Because of the differences in behaviour in encodings (base unit 8, 16 or 32 bits, or 8 with various endians, fixed-length encodings vs variable-length encodings, ...), it is not good to write a type handling them all at runtime. I think that runtime-specified string conversion should be an I/O question. In other words, when character data enters your program, you convert it to the encoding you use internally, when it leaves the program, you convert it to an external encoding. In-between, you use whatever your program uses, and you specify it at compile time.
Consider processing a MIME email. It may have several parts each with a different character set. I would imagine a flow something like this: read in message as a sequence-of-bytes for each message part { find the character set put the body in a run-time-tagged string do something with the body } Now, "do something with the body" might be "save it in a file", i.e. f << "content-type: text/plain; charset=\"" << body.charset << "\"\n" << "\n"; << body.data; In this case, it would be wasteful to convert to and from a compile-time-fixed character set. On the other hand, "do something with the body" might be "search for <string>". In this case, converting to a compile-time-fixed character set, preferably a universal one, would be best: ucs4string body_ucs4 = body.data; // if we have implicit conversion... body_ucs4.find("hello"); What I'm saying is: yes, good practice is very often to convert to a fixed character set before doing anything to the data; but no I don't think that that can happen exclusively inside an I/O layer. So some method of representing run-time-tagged data - if only temporarily, before conversion - is needed.
I'd be willing to cooperate on this project, too. I'm mostly busy with my new I/O stuff, but the tagged strings form the foundation of the text I/O part, so I need the character library sooner or later anyway.
I have a small project in progress which needs a subset of this functionality, and I'm planning to use it as a testbed for these ideas. I'll post again when I have something more concrete. The area where I would most appreciate some input is in how to provide a "user-extensible enum or type tag" for character sets. Regards, Phil.