Re: [boost] Strings tagged with their character set

24 Sep 2007

      Sebastian Redl wrote:
...
Phil Endecott wrote:
...
Dear All,
Something that I have been thinking about for a while is storing 
strings tagged with their character set.  Since I now have a practical 
need for this I plan to try to implement something.  Your feedback 
would be appreciated.
Hi,
I've played around with this concept a lot already. I basically think
that encoding-bound strings are a MUST for proper, safe,
internationalized string handling. Everything else, in particular the
current situation, is a mess.
If you want, I can package up what I've done so far (not really much,
but a lot of comments containing concepts) and put it somewhere.
Yes please.
...
One thing: I think runtime-tagged strings are useless. Programming
should happen with one or at most two fixed encodings, known at compile
time. Because of the differences in behaviour in encodings (base unit 8,
16 or 32 bits, or 8 with various endians, fixed-length encodings vs
variable-length encodings, ...), it is not good to write a type handling
them all at runtime. I think that runtime-specified string conversion
should be an I/O question. In other words, when character data enters
your program, you convert it to the encoding you use internally, when it
leaves the program, you convert it to an external encoding. In-between,
you use whatever your program uses, and you specify it at compile time.
Consider processing a MIME email.  It may have several parts each with 
a different character set.  I would imagine a flow something like this:

read in message as a sequence-of-bytes
for each message part {
   find the character set
   put the body in a run-time-tagged string
   do something with the body
}

Now, "do something with the body" might be "save it in a file", i.e.

f << "content-type: text/plain; charset=\"" << body.charset << "\"\n"
   << "\n";
   << body.data;

In this case, it would be wasteful to convert to and from a 
compile-time-fixed character set.

On the other hand, "do something with the body" might be "search for 
<string>".  In this case, converting to a compile-time-fixed character 
set, preferably a universal one, would be best:

ucs4string body_ucs4 = body.data;  // if we have implicit conversion...
body_ucs4.find("hello");

What I'm saying is: yes, good practice is very often to convert to a 
fixed character set before doing anything to the data; but no I don't 
think that that can happen exclusively inside an I/O layer.  So some 
method of representing run-time-tagged data - if only temporarily, 
before conversion - is needed.
...
I'd be willing to cooperate on this project, too. I'm mostly busy with
my new I/O stuff, but the tagged strings form the foundation of the text
I/O part, so I need the character library sooner or later anyway.
I have a small project in progress which needs a subset of this 
functionality, and I'm planning to use it as a testbed for these 
ideas.  I'll post again when I have something more concrete.  The area 
where I would most appreciate some input is in how to provide a 
"user-extensible enum or type tag" for character sets.

Regards,

Phil.