Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter

Yakov Galka wrote:
On Fri, Aug 12, 2011 at 15:04, Matus Chochlik <chochlik@gmail.com> wrote:
On Fri, Aug 12, 2011 at 1:08 PM, Yakov Galka <ybungalobill@gmail.com> rote:
On Fri, Aug 12, 2011 at 12:00, Matus Chochlik <chochlik@gmail.com> wrote:
On Fri, Aug 12, 2011 at 9:57 AM, Daniel James <dnljms@gmail.com> wrote:
On 11 August 2011 12:57, Artyom Beilis <artyomtnk@yahoo.com> wrote:
// by default expect UTF8 text(const std::string& str) { assert(is_utf8(str.begin(), str.end())); store(str); }
What you are doing is, in fact, forcing the assumed encoding of std::string to UTF-8. You just said you think it's a bad idea.
No, I'm proposing to implement a *new* class that will store the text in UTF8 encoding and if during the construction no encoding is specified, then it is assumed that the particular std::string is already in UTF8.
This is *very* different from imposing an encoding on std::string which is already used in many situations with other encodings. i.e. my approach does not break any existing code.
Sorry, your arguments start to look non-constructive to me. Correct me where I'm wrong in the following reasoning.
(1) You object to UTF-8 strings in boost interface because someone may pass something other than UTF-8 there and it's going to be undetected at compile time:
namespace boost { void func(const std::string& a); } // UTF-8 boost::func(non_utf_string); //oops
You're proposing a `text` class that is meant to somehow overcome this problem. So you change the boost interface to accept `text` but user code is left unchanged...:
namespace boost { void func(const text& a); } boost::func(non_utf_string); //oops, the std::string default constructor is called.
Yes, you can make this constructor explicit, so the above code stops compiling and the user must write explicitly: boost::func(text(non_utf_string));
But then there is nothing in your proposal that makes std::string utf-8 encoded by 'default'. Default == implicit.
As soon as the client did a cast, the client made the claim that non_utf_string met the requirements of the text class' constructor. The problem is that of the client misusing the class by an ill-advised cast. What's more, I think Soares indicated a debug-build validation that the argument indeed was UTF-8. I don't see a problem in that design, once the constructor is explicit.
Besided it does not harm you in any way
It does. I already use UTF-8 for all my strings, even on windows, and I don't want the code-bloat of all these conversions (even if they're no-ops).
What code bloat do you get from NOPs? Sure, there is more compilation time for the compiler to parse the text code and then for the optimizer to streamline it into a NOP, but even that is very likely negligible. _____ Rob Stewart robert.stewart@sig.com Software Engineer using std::disclaimer; Dev Tools & Components Susquehanna International Group, LLP http://www.sig.com ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On Mon, Aug 15, 2011, Stewart, Robert wrote:
You're proposing a `text` class that is meant to somehow overcome this problem. So you change the boost interface to accept `text` but user code is left unchanged...:
namespace boost { void func(const text& a); } boost::func(non_utf_string); //oops, the std::string default constructor is called.
Yes, you can make this constructor explicit, so the above code stops compiling and the user must write explicitly: boost::func(text(non_utf_string));
But then there is nothing in your proposal that makes std::string utf-8 encoded by 'default'. Default == implicit.
As soon as the client did a cast, the client made the claim that non_utf_string met the requirements of the text class' constructor. The problem is that of the client misusing the class by an ill-advised cast. What's more, I think Soares indicated a debug-build validation that the argument indeed was UTF-8.
For my design the programmer must explicitly choose an encoding in the template instance and also explicitly call the constructor to construct the Unicode string. As long as we make it hard enough that programmer has to consciously call the constructor then the rest of the responsibility will fall on the programmer. Remember, only prevent accidental not intentional misusage.

On Sat, Aug 13, 2011 at 23:24, Robert Ramey <ramey@rrsd.com> wrote:
Dave Abrahams wrote:
std::string represents a sequence of "char" objects that happens to be useful for text processing. It can represent a text in any encoding.
The question is how we treat this sequence... And this is a matter of policy and requirements of the library.
I think I agree with Artyom here. *Somebody* has to decide how that datatype will be interpreted when we receive it. Unless we refuse altogether to accept std::string in our interfaces (which sounds like a bad idea to me), why not make the decision that it's UTF-8?
hmmm - why can't we just leave it at "std::string represents a sequence of "char""
Because we are talking here what 'a sequence of char' means, and you *must* define it somehow. and define some derivative class which defines it as a
"a refinement of std::string which supports UTF-8 functionality" ?
Even when wrapping it you must still define the conversions from 'sequences of chars'. Here we come to the original problem. On Mon, Aug 15, 2011 at 16:19, Stewart, Robert <Robert.Stewart@sig.com>wrote:
[...] As soon as the client did a cast, the client made the claim that non_utf_string met the requirements of the text class' constructor. The problem is that of the client misusing the class by an ill-advised cast. What's more, I think Soares indicated a debug-build validation that the argument indeed was UTF-8.
I don't see a problem in that design, once the constructor is explicit.
I don't want to do any explicit casts. I want UTF-8 by default, at least as an optional feature for me and others who think like me. I can afford the risk of writing wrong code, which is really small if you know what you're doing. And I'm saying this as a maintainer of ~1MLOC codebase which uses this convention on *windows*. Regarding UTF-8 validation, it's not bullet-proof. Many non-UTF8 sequences may pass the validation. 8-bit encodings that don't coincide with ASCII are even more likely to result in false positives.
Besided it does not harm you in any way
It does. I already use UTF-8 for all my strings, even on windows, and I don't want the code-bloat of all these conversions (even if they're no-ops).
What code bloat do you get from NOPs? Sure, there is more compilation time for the compiler to parse the text code and then for the optimizer to streamline it into a NOP, but even that is very likely negligible.
I'm talking about source-code bloat. About the boilerplate code I have to write even if I already use UTF-8 everywhere: std::string str = some_utf_8_string; boost::utf8_function(text(str)); // Yes, I like UTF-8 boost2::utf8_function(str); // but I like it more when it's the default. -- Yakov
participants (3)
-
Soares Chen Ruo Fei
-
Stewart, Robert
-
Yakov Galka