
On Sat, Aug 13, 2011 at 23:24, Robert Ramey <ramey@rrsd.com> wrote:
Dave Abrahams wrote:
std::string represents a sequence of "char" objects that happens to be useful for text processing. It can represent a text in any encoding.
The question is how we treat this sequence... And this is a matter of policy and requirements of the library.
I think I agree with Artyom here. *Somebody* has to decide how that datatype will be interpreted when we receive it. Unless we refuse altogether to accept std::string in our interfaces (which sounds like a bad idea to me), why not make the decision that it's UTF-8?
hmmm - why can't we just leave it at "std::string represents a sequence of "char""
Because we are talking here what 'a sequence of char' means, and you *must* define it somehow. and define some derivative class which defines it as a
"a refinement of std::string which supports UTF-8 functionality" ?
Even when wrapping it you must still define the conversions from 'sequences of chars'. Here we come to the original problem. On Mon, Aug 15, 2011 at 16:19, Stewart, Robert <Robert.Stewart@sig.com>wrote:
[...] As soon as the client did a cast, the client made the claim that non_utf_string met the requirements of the text class' constructor. The problem is that of the client misusing the class by an ill-advised cast. What's more, I think Soares indicated a debug-build validation that the argument indeed was UTF-8.
I don't see a problem in that design, once the constructor is explicit.
I don't want to do any explicit casts. I want UTF-8 by default, at least as an optional feature for me and others who think like me. I can afford the risk of writing wrong code, which is really small if you know what you're doing. And I'm saying this as a maintainer of ~1MLOC codebase which uses this convention on *windows*. Regarding UTF-8 validation, it's not bullet-proof. Many non-UTF8 sequences may pass the validation. 8-bit encodings that don't coincide with ASCII are even more likely to result in false positives.
Besided it does not harm you in any way
It does. I already use UTF-8 for all my strings, even on windows, and I don't want the code-bloat of all these conversions (even if they're no-ops).
What code bloat do you get from NOPs? Sure, there is more compilation time for the compiler to parse the text code and then for the optimizer to streamline it into a NOP, but even that is very likely negligible.
I'm talking about source-code bloat. About the boilerplate code I have to write even if I already use UTF-8 everywhere: std::string str = some_utf_8_string; boost::utf8_function(text(str)); // Yes, I like UTF-8 boost2::utf8_function(str); // but I like it more when it's the default. -- Yakov