
The system I'm now using for my programs might interest you.
I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t. Assigning one type to another automatically converts it to the target type during the copy. (Converting to ascii_t will throw an exception if a resulting character won't fit into eight bits.)
If so (and this is what I see in code) ASCII is misleading. It should be called Latin1/ISO-8859-1 but not ASCII.
An std::string is assumed to be ASCII-encoded. If you really do have UTF-8-encoded data to get into the system, you either assign it to a utf8_t using operator*, or use a static function utf8_t::precoded. std::wstring is assumed to be utf16_t- or utf32_t-encoded already, depending on the underlying character width for the OS.
This is very bad assumption. To be honest, I've written lots of code with direct UTF-8 strings in it (Boost.Locale tests) and this worked perfectly well with MSVC, GCC and Intel compilers (as long as I work with char * not L"") and this works file all the time. It is bad assumption, the encoding should be byte string which may be UTF-8 or may be not. There are two cases we need to treat strings and encoding: 1. We handle human language or text - collation, formatting etc. 2. We want to access Windows Wide API that is not locale agnostic.
For portable OS-interface functions, there's a typedef (os::native_t) to the type that the OS's API functions need. For Linux-based systems, it's utf8_t; for Windows, utf16_t. There's also a typedef (os::unicode_t) that is utf32_t on Linux and utf16_t on Windows, but I'm not sure there's a need for that.
When you work with Linux and Unix at all you should not change encoding. There were discussions about it. For example following code: #include <fstream> #include <cstdio> #include <assert.h> int main() { { std::ofstream t("\xFF\xFF.txt"); if(!t) { /// Not valid for this os - Mac OS X return 0; } t << "test"; t.close(); } { std::ifstream t("\xFF\xFF.txt"); std::string s; t >> s; assert( s=="test"); t.close(); } std::remove("\xFF\xFF.txt"); } Which is valid code and works regardless of current locale on POSIX platforms. Using your API it would fail as it holds some assumptions on encoding.
There are some parts of the code that could use polishing, but I like the overall design, and I'm finding it pretty easy to work with. Anyone interested in seeing the code?
IMHO, I don't think that inventing new strings or new text containers is a way to go. std::string is perfectly fine as long as you code in consistent way. Artyom