
I've done some research, and it looks like it would require little effort to create an os::string_t type that uses the current locale, and assume all raw std::strings that contain eight-bit values are coded in that instead.
Design-wise, ascii_t would need to change slightly after this, to throw on anything that can't fit into a *seven*-bit value, rather than eight-bit. I'll add the default-character option to both types as well, and maybe make other improvements as I have time.
Unfortunately this is not the correct approach as well. For example why do you think it is safe to pass ASCII subset of utf-8 to current non-utf-8 locale? For example Shift-JIS that is in use on Windows/ANSI API has different subset in 0-127 range - it is not ASCII! Also if you want to use std::codecvt facet... Don't relay on them unless you know where they come from! 1. By default they are noop - in the default C locale 2. Under most compilers they are not implemented properly. OS \ Compiler MSVC GCC SunOS/stlport SunOS/standard ------------------------------------------------------------------- Windows ok none - - Linux - ok ? ? Mac OS X - none - - FreeBSD - none - - Solaris - none buggy! ok-but-non-standard Bottom lines don't relate on "current locale" :-)
Artyom, since you seem to have more experience with this stuff than I, what do you think? Would those alterations take care of your objections?
The rule of thumb is following: - When you hadle with strings as text storage just use std::string - When you do a system call a) on Posix - pass it as is b) on Windows - Convert to Wide API from UTF-8 - When handling text as text (i.e. formatting, collation etc.) use good library. I would strongly recommend to read the answer of Pavel Radzivilovsky on Stackoverflow: http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf... And he is hard-core-windows-programmer, designer, architext and developer and still he had chosen UTF-8! The problem that the issue is so completated that making it absolutly general and on the other hand right is only one - decide what you are working with and stick with it. In CppCMS project I work with (and I developed Boost.Locale because of it) I stick by default with UTF-8 and use plain std::string - works like a charm. Invening "special unicode strings or storage" does not improve anybody's understanding of Unicode neither improve its handing. Best, Artyom