
Phil Endecott wrote:
Sebastian Redl wrote:
If you want, I can package up what I've done so far (not really much, but a lot of comments containing concepts) and put it somewhere.
Yes please.
Here you are: http://windmuehlgasse.getdesigned.at/characters.zip Note two things about this archive: 1) The converters have a terrible interface. It's unfriendly and still not powerful enough to do what I want. That part has to be completely redesigned. That is not to say that there aren't some worthwhile ideas there, though. 2) I make a very strict distinction between the terms "character set" and "encoding". A character set is a mapping of abstract characters to code points, which are integral values. ISO-10464 and Unicode define such a mapping. US-ASCII is such a mapping, too. Early versions of the ISO-8859 family of standards defined such mappings. An encoding is a way to map these code points to sequences of bytes. UTF-8, UTF-16 and UTF-32 are all encodings of Unicode. US-ASCII is its own encoding. New revisions of the ISO-8859 family are defined in terms of Unicode; they are encodings of that character set, though incomplete ones. This distinction goes quite against common (mis)usage: from MIME content types (ContentType: text/html; charset=...) over Java (String.getBytes(..., String charsetName), the entire java.nio.charset package), to GCC's compiler flags (-fexec-charset=...) - they all use charset or character set for what is really the encoding. Argue about "common usage" all you want - it still doesn't make sense to call UTF-8 a character set, because it isn't. XML, for example, gets it right: <?xml version="1.0" encoding="UTF-8"?> Ah, well. Rant over.
Consider processing a MIME email. It may have several parts each with a different character set. I would imagine a flow something like this:
read in message as a sequence-of-bytes for each message part { find the character set put the body in a run-time-tagged string do something with the body }
I disagree with this flow. My flow is: read in message as a sequence-of-bytes for each message part { find the type do I want to do something with its string form? yes { find the character set put the body in a compile-time-tagged string, converting from the found character set do something with the body } no { do something with the body as a byte sequence } }
Now, "do something with the body" might be "save it in a file", i.e.
That would be something I do with the byte sequence.
f << "content-type: text/plain; charset=\"" << body.charset << "\"\n" << "\n"; << body.data;
That actually doesn't make any sense, sorry. You can't just write a runtime-tagged string to a text stream, not with C++ iostreams being what they are. If they're open in binary mode, you should just push the bytes through (and you shouldn't use the formatted I/O operators) - all the bytes, including the MIME headers. If they're open in text mode, then it all gets really weird. Either you actually convert the string to some output encoding (in which case, why do you write the original encoding into the file?), or you don't, in which case you risk corruption. Oh, and did I mention that if the thing were a wide stream, the output operator would have to convert the runtime-tagged string to a wide string anyway? And then the file buffer would convert it back. Meh. Just stay with the raw bytes.
In this case, it would be wasteful to convert to and from a compile-time-fixed character set.
Yes. It would also be wasteful to construct a runtime-tagged string, when you could just access a section of the raw byte stream.
So some method of representing run-time-tagged data - if only temporarily, before conversion - is needed.
This representation is my converting input stream.
I have a small project in progress which needs a subset of this functionality, and I'm planning to use it as a testbed for these ideas. I'll post again when I have something more concrete. The area where I would most appreciate some input is in how to provide a "user-extensible enum or type tag" for character sets.
Maybe the archive I uploaded will help. I'm thinking of type tags with some metafunctions to specialize. Sebastian Redl