Re: [boost] Strings tagged with their character set

26 Sep 2007

      Phil Endecott wrote:
...
Sebastian Redl wrote:
...
If you want, I can package up what I've done so far (not really much,
but a lot of comments containing concepts) and put it somewhere.
Yes please.
Here you are:
http://windmuehlgasse.getdesigned.at/characters.zip

Note two things about this archive:
1) The converters have a terrible interface. It's unfriendly and still
not powerful enough to do what I want. That part has to be completely
redesigned. That is not to say that there aren't some worthwhile ideas
there, though.
2) I make a very strict distinction between the terms "character set"
and "encoding". A character set is a mapping of abstract characters to
code points, which are integral values. ISO-10464 and Unicode define
such a mapping. US-ASCII is such a mapping, too. Early versions of the
ISO-8859 family of standards defined such mappings. An encoding is a way
to map these code points to sequences of bytes. UTF-8, UTF-16 and UTF-32
are all encodings of Unicode. US-ASCII is its own encoding. New
revisions of the ISO-8859 family are defined in terms of Unicode; they
are encodings of that character set, though incomplete ones.
This distinction goes quite against common (mis)usage: from MIME content
types (ContentType: text/html; charset=...) over Java
(String.getBytes(..., String charsetName), the entire java.nio.charset
package), to GCC's compiler flags (-fexec-charset=...) - they all use
charset or character set for what is really the encoding. Argue about
"common usage" all you want - it still doesn't make sense to call UTF-8
a character set, because it isn't. XML, for example, gets it right:
<?xml version="1.0" encoding="UTF-8"?>

Ah, well. Rant over.
...
Consider processing a MIME email.  It may have several parts each with 
a different character set.  I would imagine a flow something like this:
read in message as a sequence-of-bytes
for each message part {
   find the character set
   put the body in a run-time-tagged string
   do something with the body
}
I disagree with this flow. My flow is:
read in message as a sequence-of-bytes
for each message part {
  find the type
  do I want to do something with its string form?
  yes {
    find the character set
    put the body in a compile-time-tagged string, converting from the
found character set
    do something with the body
  }
  no {
    do something with the body as a byte sequence
  }
}
...
Now, "do something with the body" might be "save it in a file", i.e.
That would be something I do with the byte sequence.
...
f << "content-type: text/plain; charset=\"" << body.charset << "\"\n"
   << "\n";
   << body.data;
That actually doesn't make any sense, sorry. You can't just write a
runtime-tagged string to a text stream, not with C++ iostreams being
what they are. If they're open in binary mode, you should just push the
bytes through (and you shouldn't use the formatted I/O operators) - all
the bytes, including the MIME headers. If they're open in text mode,
then it all gets really weird. Either you actually convert the string to
some output encoding (in which case, why do you write the original
encoding into the file?), or you don't, in which case you risk
corruption. Oh, and did I mention that if the thing were a wide stream,
the output operator would have to convert the runtime-tagged string to a
wide string anyway? And then the file buffer would convert it back.
Meh. Just stay with the raw bytes.
...
In this case, it would be wasteful to convert to and from a 
compile-time-fixed character set.
Yes. It would also be wasteful to construct a runtime-tagged string,
when you could just access a section of the raw byte stream.
...
So some 
method of representing run-time-tagged data - if only temporarily, 
before conversion - is needed.
This representation is my converting input stream.
...
I have a small project in progress which needs a subset of this 
functionality, and I'm planning to use it as a testbed for these 
ideas.  I'll post again when I have something more concrete.  The area 
where I would most appreciate some input is in how to provide a 
"user-extensible enum or type tag" for character sets.
Maybe the archive I uploaded will help. I'm thinking of type tags with
some metafunctions to specialize.

Sebastian Redl