
Sebastian Redl wrote:
Phil Endecott wrote:
Dear All,
Something that I have been thinking about for a while is storing strings tagged with their character set. Since I now have a practical need for this I plan to try to implement something. Your feedback would be appreciated.
Hi,
I've played around with this concept a lot already. I basically think that encoding-bound strings are a MUST for proper, safe, internationalized string handling. Everything else, in particular the current situation, is a mess.
If you want, I can package up what I've done so far (not really much, but a lot of comments containing concepts) and put it somewhere.
One thing: I think runtime-tagged strings are useless. Programming should happen with one or at most two fixed encodings, known at compile time. Because of the differences in behaviour in encodings (base unit 8, 16 or 32 bits, or 8 with various endians, fixed-length encodings vs variable-length encodings, ...), it is not good to write a type handling them all at runtime. I think that runtime-specified string conversion should be an I/O question. In other words, when character data enters your program, you convert it to the encoding you use internally, when it leaves the program, you convert it to an external encoding. In-between, you use whatever your program uses, and you specify it at compile time.
Well, having I/O facilities provide the only means for converting strings of different encodings would make using compiled libraries that use a different string encoding than my program pretty awkward, wouldn't it? I agree that the "runtime tagging suggestion" seems overkill. Maybe providing lazily evaluated, possibly cached, compile- and runtime "string views" is a good idea, however (and might probably give a nice framework to implement encoding conversions as well). Examples: // ...given some strings a,b, and c string<utf8> s = a + b + ":" + c; // can get away with exactly one allocation since operator+ can // return a compile-time string view -- string<utf8> s = "world"; string_view<utf8> v = "Hello " + s + "!"; std::cout << v << std::endl; s = "you"; std::cout << v << std::endl; // Output: // Hello world! // Hello you! // For a more "real-world" use case of runtime string_views // consider a lexer taking apart an in-memory file, with SBO // applied to the string_view template... Regards, Tobias