Re: [boost] [General] Always treat std::strings as UTF-8

16 Jan 2011

      ...
The system I'm now using for my programs  might interest you.
I have four classes: ascii_t, utf8_t, utf16_t, and  utf32_t. Assigning
one type to another automatically converts it to the  target type during
the copy. (Converting to ascii_t will throw an exception  if a resulting
character won't fit into eight bits.)
If so (and this is what I see in code) ASCII is misleading.
It should be called Latin1/ISO-8859-1 but not ASCII.
...
An  std::string is assumed to be ASCII-encoded. If you really do  have
UTF-8-encoded data to get into the system, you either assign it to  a
utf8_t using operator*, or use a static function  utf8_t::precoded.
std::wstring is assumed to be utf16_t- or utf32_t-encoded  already,
depending on the underlying character width for the OS.
This is very bad assumption. To be honest, I've written lots
of code with direct UTF-8 strings in it (Boost.Locale tests)
and this worked perfectly well with MSVC, GCC and Intel
compilers (as long as I work with char * not L"") and this works
file all the time.

It is bad assumption, the encoding should be byte string
which may be UTF-8 or may be not.

There are two cases we need to treat strings and encoding:

1. We handle human language or text - collation, formatting etc.
2. We want to access Windows Wide API that is not locale agnostic.
...
For portable OS-interface functions, there's a typedef  (os::native_t)
to the type that the OS's API functions need. For Linux-based  systems,
it's utf8_t; for Windows, utf16_t. There's also a  typedef
(os::unicode_t) that is utf32_t on Linux and utf16_t on Windows,  but
I'm not sure there's a need for that.
When you work with Linux and Unix at all you should not change encoding.
There were discussions about it. For example following code:

    #include <fstream>
    #include <cstdio>
    #include <assert.h>

    int main()
    {
        {
            std::ofstream t("\xFF\xFF.txt");
            if(!t) {
               /// Not valid for this os - Mac OS X
               return 0;
            }
            t << "test";
            t.close();
        }
        {
            std::ifstream t("\xFF\xFF.txt");
            std::string s;
            t >> s;
            assert( s=="test");
            t.close();
        }
        std::remove("\xFF\xFF.txt");
    }

Which is valid code and works regardless of current locale on POSIX
platforms.

Using your API it would fail as it holds some assumptions on encoding.
...
There are some parts of the  code that could use polishing, but I like
the overall design, and I'm finding  it pretty easy to work with. Anyone
interested in seeing the code?
IMHO, I don't think that inventing new strings or new text 
containers is a way to go. std::string is perfectly fine as long
as you code in consistent way.

Artyom