
At Thu, 20 Jan 2011 06:43:48 +0200, Peter Dimov wrote:
Dave Abrahams wrote:
IIUC, you're talking about changing the abstraction presented by std::string to "sequence of individually addressable and mutable chars that by convention represents text encoded as utf-8."
Something like that. string is just char[] with value semantics. It doesn't necessarily hold a valid UTF-8 sequence.
Right.
I would prefer to be handling something that presents the abstraction "character string." I'm not sure exactly what that looks like, but I'm pretty sure the "individually addressable and mutable chars" part should go. I'd like to see an interface that prevents corrupting the underlying data such that it no longer represents a valid sequence of characters (or at least makes it highly unlikely that such corruption could happen accidentally). Furthermore, there are lots of string-y things I'd want to do that aren't provided—or aren't provided well—by std::string, e.g. if (s1.starts_with(s2)) {...}
Does this make more sense?
It makes sense in the abstract. But there is no way to protect against corruption without also setting an invariant that the sequence is not corrupted (represents valid UTF-8), and I don't usually need such a string in the interfaces we're discussing, although it can certainly be useful on its own. The interfaces that talk to the OS need to be able to carry arbitrary char sequences (in the POSIX case).
Yup. Then they should be handling raw_string, right?
Even an interface that displays the string, one that by necessity must interpret it as UTF-8, should preferably handle invalid UTF-8 and display some placeholders instead of the invalid subsequence - it's better for the user to see parts of the string than nothing at all.
Yep. Then I guess that should be handling raw_string, too.
It's even worse to abort the whole operation with an invalid_utf8 exception.
Yowp. So you want a "resilient utf-8 string:" something that can represent any sequence of chars and, when interpretation is necessary, will interpret them as utf-8, using some kind of best-effort error recovery to avoid hard errors. Then you can have an is_valid_utf_8() routine that is used to check for validity when/if you need it. I can understand the argument that there's not much to be gained from the type system here. I still like the idea of using something with a real string interface: namespace boost { struct text { explicit text(std::string); operator std::string const&() const { return storage; } ... bool startswith(text const& s) const; bool endswith(text const& s) const; text trim() const; ... private: std::string storage; }; } but I do wonder whether it's worth writing (or paying for the copy in) x.startswith(text(some_std_string)) and in general whether the cost of copying std::strings into text::storage is too high. -- Dave Abrahams BoostPro Computing http://www.boostpro.com