
On Sun, 16 Jan 2011 12:56:23 -0800 (PST) Artyom <artyomtnk@yahoo.com> wrote:
The system I'm now using for my programs might interest you.
I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t. Assigning one type to another automatically converts it to the target type during the copy. (Converting to ascii_t will throw an exception if a resulting character won't fit into eight bits.)
If so (and this is what I see in code) ASCII is misleading. It should be called Latin1/ISO-8859-1 but not ASCII.
Probably, but latin1_t isn't very obvious, and iso_8859_1_t is a little awkward to type. ;-) As I've said, this code was written solely for my company, I'd make a number of changes if I were going to submit it to Boost.
An std::string is assumed to be ASCII-encoded. If you really do have UTF-8-encoded data to get into the system, you either assign it to a utf8_t using operator*, or use a static function utf8_t::precoded. std::wstring is assumed to be utf16_t- or utf32_t-encoded already, depending on the underlying character width for the OS.
This is very bad assumption. To be honest, I've written lots of code with direct UTF-8 strings in it (Boost.Locale tests) and this worked perfectly well with MSVC, GCC and Intel compilers (as long as I work with char * not L"") and this works file all the time.
It is bad assumption, the encoding should be byte string which may be UTF-8 or may be not.
But if you assigned that byte string to a utf*_t type, how would you treat it? I had to either make some assumption, or disallow assigning from an std::string and char* entirely. And it's just too convenient to use those assignments, for things like constants, to give that up. The way I designed it, you're supposed to feed it only ASCII (or Latin-1, if you prefer) text when you make an assignment that way. If you have some differently-coded text, you'd feed it in through another class, one that knows its coding and is designed to decode to UTF-32 the way that utf8_t and utf16_t are, so that the templated conversion functions know how to handle it.
There are two cases we need to treat strings and encoding:
1. We handle human language or text - collation, formatting etc. 2. We want to access Windows Wide API that is not locale agnostic.
I'm not sure where you're coming from. Those are two broad categories of uses for that code, but arguably not the only two.
For portable OS-interface functions, there's a typedef (os::native_t) to the type that the OS's API functions need. For Linux-based systems, it's utf8_t; for Windows, utf16_t. There's also a typedef (os::unicode_t) that is utf32_t on Linux and utf16_t on Windows, but I'm not sure there's a need for that.
When you work with Linux and Unix at all you should not change encoding. There were discussions about it. [...] Using your API it would fail as it holds some assumptions on encoding.
Why would you feed "\xFF\xFF.txt" into a utf8_t type, if you didn't want it encoded as UTF-8? If you have a function that requires some different encoding, you'd use that encoding instead. For filenames, you'd treat the strings entered by the user or obtained from the file system as opaque blocks of bytes. In any case, all modern Linux OSes use UTF-8 by default, so I haven't seen any need to worry about other forms yet. I'm not even sure how I'd tell what code-page a Linux system is set to use, so far I've never needed to know that. Though if a Russian customer comes along and tells me my code doesn't work right on his Linux system, I'll re-think that.
There are some parts of the code that could use polishing, but I like the overall design, and I'm finding it pretty easy to work with. Anyone interested in seeing the code?
IMHO, I don't think that inventing new strings or new text containers is a way to go. std::string is perfectly fine as long as you code in consistent way.
I have to respectfully disagree. std::string says nothing about the encoding of the data within it. If you're using more than one type of encoding in your program, like Latin-1 and UTF-8, then using std::strings is like using void pointers -- no type safety, no way to automate conversions when necessary, and no way to select overloaded functions based on the encoding. A C++ solution pretty much requires that they be unique types. -- Chad Nelson Oak Circle Software, Inc. * * *