Re: [boost] [General] Always treat std::strings as UTF-8

17 Jan 2011

      On Sun, 16 Jan 2011 12:56:23 -0800 (PST)
Artyom <artyomtnk@yahoo.com> wrote:
...
...
The system I'm now using for my programs  might interest you.
I have four classes: ascii_t, utf8_t, utf16_t, and  utf32_t.
Assigning one type to another automatically converts it to the
target type during the copy. (Converting to ascii_t will throw an
exception  if a resulting character won't fit into eight bits.)
If so (and this is what I see in code) ASCII is misleading.
It should be called Latin1/ISO-8859-1 but not ASCII.
Probably, but latin1_t isn't very obvious, and iso_8859_1_t is a little
awkward to type. ;-) As I've said, this code was written solely for my
company, I'd make a number of changes if I were going to submit it to
Boost.
...
...
An  std::string is assumed to be ASCII-encoded. If you really do
have UTF-8-encoded data to get into the system, you either assign it
to  a utf8_t using operator*, or use a static function
utf8_t::precoded. std::wstring is assumed to be utf16_t- or
utf32_t-encoded  already, depending on the underlying character
width for the OS.
This is very bad assumption. To be honest, I've written lots of code
with direct UTF-8 strings in it (Boost.Locale tests) and this worked
perfectly well with MSVC, GCC and Intel compilers (as long as I work
with char * not L"") and this works file all the time.
It is bad assumption, the encoding should be byte string which may be
UTF-8 or may be not.
But if you assigned that byte string to a utf*_t type, how would you
treat it? I had to either make some assumption, or disallow assigning
from an std::string and char* entirely. And it's just too convenient to
use those assignments, for things like constants, to give that up.

The way I designed it, you're supposed to feed it only ASCII (or
Latin-1, if you prefer) text when you make an assignment that way. If
you have some differently-coded text, you'd feed it in through another
class, one that knows its coding and is designed to decode to UTF-32
the way that utf8_t and utf16_t are, so that the templated conversion
functions know how to handle it.
...
There are two cases we need to treat strings and encoding:
1. We handle human language or text - collation, formatting etc.
2. We want to access Windows Wide API that is not locale agnostic.
I'm not sure where you're coming from. Those are two broad categories
of uses for that code, but arguably not the only two.
...
...
For portable OS-interface functions, there's a typedef
(os::native_t) to the type that the OS's API functions need. For
Linux-based  systems, it's utf8_t; for Windows, utf16_t. There's
also a  typedef (os::unicode_t) that is utf32_t on Linux and utf16_t
on Windows,  but I'm not sure there's a need for that.
When you work with Linux and Unix at all you should not change
encoding. There were discussions about it. [...] Using your API it
would fail as it holds some assumptions on encoding.
Why would you feed "\xFF\xFF.txt" into a utf8_t type, if you didn't
want it encoded as UTF-8? If you have a function that requires some
different encoding, you'd use that encoding instead. For filenames,
you'd treat the strings entered by the user or obtained from the file
system as opaque blocks of bytes.

In any case, all modern Linux OSes use UTF-8 by default, so I haven't
seen any need to worry about other forms yet. I'm not even sure how I'd
tell what code-page a Linux system is set to use, so far I've never
needed to know that. Though if a Russian customer comes along and tells
me my code doesn't work right on his Linux system, I'll re-think that.
...
...
There are some parts of the  code that could use polishing, but I
like the overall design, and I'm finding  it pretty easy to work
with. Anyone interested in seeing the code?
IMHO, I don't think that inventing new strings or new text containers
is a way to go. std::string is perfectly fine as long as you code in
consistent way.
I have to respectfully disagree. std::string says nothing about the
encoding of the data within it. If you're using more than one type of
encoding in your program, like Latin-1 and UTF-8, then using
std::strings is like using void pointers -- no type safety, no way to
automate conversions when necessary, and no way to select overloaded
functions based on the encoding. A C++ solution pretty much requires
that they be unique types.
-- 
Chad Nelson
Oak Circle Software, Inc.
*
*
*