
Mathias Gaunard wrote:
Andrey Semashev wrote:
I'd rather stick to UTF-16 if I had to use Unicode.
UTF-16 is a variable-length encoding too.
But anyway, Unicode itself is a variable-length format, even with the UTF-32 encoding, simply because of grapheme clusters.
Technically, yes. But most of the widely used character sets fit into UTF-16. That means that I, having said that my app is localized to languages A B and C, may treat UTF-16 as a fixed-length encoding if these languages fit in it. If they don't, I'd consider moving to UTF-32.
I'm not saying that we don't need Unicode support. We do! I'm only saying that in many cases plain ASCII does its job perfectly well: logging, system messages, simple text formatting, texts in restricted character sets, like numbers, phone numbers, identifiers of all kinds, etc.
Identifiers of all kinds aren't text, they're just bytes.
Not always. I may get such an identifier from a text-based protocol primitive, thus I can handle it as a text. This assumption may allow more opportunities to various optimizations.
As for logging, I'm not too sure whether it should be localized or not.
I can think only of a single case where logging should i18n. It's when you have to log external data, such as client app queries or DB responses. This need questionable in the first place, because it may introduce serious security holes. As for regular logging, I feel quite fine with narrow logs and don't see why would I want to make them wide.
And I don't understand what you mean by system messages.
Error and warning descriptions that may come either from your application or from the OS, some third-party API or language runtime. Although, I may agree that these messages could be localized too, but to my mind it's an overkill. Generally, I don't need std::bad_alloc::what() returning Russian or Chinese description.
I still don't understand why you want to work with other character sets.
Because I have an impression that it may be done more efficiently and with less expenses. I don't want to pay for what I don't need - IMHO, the ground principle of C++.
That will just require duplicating the tables and algorithms required to process the text correctly.
What algorithms do you mean and why would they need duplication?
See http://www.unicode.org/reports/tr10/ for an idea of the complexity of collations, which allow comparison of strings. As you can see, it has little to do with encoding, yet the tables etc. require the usage of the Unicode character set, preferably in a canonical form so that it can be quite efficient.
The collation is just an approach to perform string comparison and ordering. I don't see how it is related to efficiency questions I mentioned. Besides, comparison is not the only operation on strings. I expect iterating over a string or operator[] complexity to rise significantly once we assume that the underlying string has variable-length chars.
There are cases where i18n is not needed at all - mostly server-side apps with minimal UI.
Any application that process or display non-trivial text (meaning something else than options) should have internationalization.
I have to disagree. I18n is good when it's needed, i.e. when there are users that will appreciate it or when it's required by application domain and functionality. Otherwise, IMO, it's waste of efforts on the development stage and system resources on the evaluation stage.
What encoding translation are you talking about?
Let's assume my app works with a narrow text file stream. If the stream is using Unicode internally, it has to translate between the file encoding and its internal encoding every time I output or input something. I don't think that's the way it should be. I'd rather have an opportunity to chose the encoding I want to work with and have it through the whole formatting/streaming/IO tool chain with no extra overhead. That doesn't mean, though, that I wouldn't want some day to perform encoding translations with the same tools. PS: I have a slight feeling that we have a misunderstanding at this point...