Re: [boost] [General] Always treat std::strings as UTF-8

17 Jan 2011

      ...
I've done some research,  and it looks like it would require little
effort to create an os::string_t  type that uses the current locale, and
assume all raw std::strings that  contain eight-bit values are coded in
that instead.
Design-wise,  ascii_t would need to change slightly after this, to throw
on anything that  can't fit into a *seven*-bit value, rather than
eight-bit. I'll add the  default-character option to both types as well,
and maybe make other  improvements as I have time.
Unfortunately this is not the correct approach as well.

For example why do you think it is safe to pass ASCII subset of utf-8
to current non-utf-8 locale?

For example Shift-JIS that is in use on Windows/ANSI API has different
subset in 0-127 range - it is not ASCII!

Also if you want to use std::codecvt facet...
Don't relay on them unless you know where they come from!

1. By default they are noop - in the default C locale

2. Under most compilers they are not implemented properly.

   OS \ Compiler      MSVC       GCC      SunOS/stlport  SunOS/standard
   -------------------------------------------------------------------
   Windows             ok         none        -              -
   Linux               -           ok         ?              ?
   Mac OS X            -          none        -              -
   FreeBSD             -          none        -              -
   Solaris             -          none      buggy!      ok-but-non-standard

Bottom lines don't relate on "current locale" :-)
...
Artyom, since you seem to have more experience with  this stuff than I,
what do you think? Would those alterations take care of  your objections?
The rule of thumb is following:

- When you hadle with strings as text storage just use std::string

- When you do a system call

  a) on Posix - pass it as is
  b) on Windows - Convert to Wide API from UTF-8

- When handling text as text (i.e. formatting, collation etc.)
  use good library.

I would strongly recommend to read the answer of Pavel Radzivilovsky
on Stackoverflow:

http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf...

And he is hard-core-windows-programmer, designer, architext and developer
and still he had chosen UTF-8!

The problem that the issue is so completated that making
it absolutly general and on the other hand right is only 
one - decide what you are working with and stick with it.

In CppCMS project I work with (and I developed Boost.Locale
because of it) I stick by default with UTF-8 and use plain
std::string - works like a charm.

Invening "special unicode strings or storage" does not
improve anybody's understanding of Unicode neither improve
its handing.

Best,
  Artyom