Re: [boost] [rfc] I/O Library Design

21 Jun 2007

      Mathias Gaunard wrote:
...
Andrey Semashev wrote:
...
I'd rather stick to UTF-16 if I had to use 
Unicode.
UTF-16 is a variable-length encoding too.
But anyway, Unicode itself is a variable-length format, even with the 
UTF-32 encoding, simply because of grapheme clusters.
Technically, yes. But most of the widely used character sets fit into 
UTF-16. That means that I, having said that my app is localized to 
languages A B and C, may treat UTF-16 as a fixed-length encoding if 
these languages fit in it. If they don't, I'd consider moving to UTF-32.
...
...
I'm not saying that we don't need Unicode support. We do!
I'm only saying that in many cases plain ASCII does its job perfectly 
well: logging, system messages, simple text formatting, texts in 
restricted character sets, like numbers, phone numbers, identifiers of 
all kinds, etc.
Identifiers of all kinds aren't text, they're just bytes.
Not always. I may get such an identifier from a text-based protocol 
primitive, thus I can handle it as a text. This assumption may allow 
more opportunities to various optimizations.
...
As for logging, I'm not too sure whether it should be localized or not.
I can think only of a single case where logging should i18n. It's when 
you have to log external data, such as client app queries or DB 
responses. This need questionable in the first place, because it may 
introduce serious security holes. As for regular logging, I feel quite 
fine with narrow logs and don't see why would I want to make them wide.
...
And I don't understand what you mean by system messages.
Error and warning descriptions that may come either from your 
application or from the OS, some third-party API or language runtime. 
Although, I may agree that these messages could be localized too, but to 
my mind it's an overkill. Generally, I don't need std::bad_alloc::what() 
returning Russian or Chinese description.
...
I still don't understand why you want to work with other character sets.
Because I have an impression that it may be done more efficiently and 
with less expenses. I don't want to pay for what I don't need - IMHO, 
the ground principle of C++.
...
That will just require duplicating the tables and algorithms required to 
process the text correctly.
What algorithms do you mean and why would they need duplication?
...
See http://www.unicode.org/reports/tr10/ for an idea of the complexity 
of collations, which allow comparison of strings.
As you can see, it has little to do with encoding, yet the tables etc. 
require the usage of the Unicode character set, preferably in a 
canonical form so that it can be quite efficient.
The collation is just an approach to perform string comparison and 
ordering. I don't see how it is related to efficiency questions I mentioned.
Besides, comparison is not the only operation on strings. I expect 
iterating over a string or operator[] complexity to rise significantly 
once we assume that the underlying string has variable-length chars.
...
...
There are cases where i18n is not needed at all - mostly 
server-side apps with minimal UI.
Any application that process or display non-trivial text (meaning 
something else than options) should have internationalization.
I have to disagree. I18n is good when it's needed, i.e. when there are 
users that will appreciate it or when it's required by application 
domain and functionality. Otherwise, IMO, it's waste of efforts on the 
development stage and system resources on the evaluation stage.
...
What encoding translation are you talking about?
Let's assume my app works with a narrow text file stream. If the stream 
is using Unicode internally, it has to translate between the file 
encoding and its internal encoding every time I output or input something.
I don't think that's the way it should be. I'd rather have an 
opportunity to chose the encoding I want to work with and have it 
through the whole formatting/streaming/IO tool chain with no extra 
overhead. That doesn't mean, though, that I wouldn't want some day to 
perform encoding translations with the same tools.

PS: I have a slight feeling that we have a misunderstanding at this point...