
Dave Abrahams wrote:
IIUC, you're talking about changing the abstraction presented by std::string to "sequence of individually addressable and mutable chars that by convention represents text encoded as utf-8."
Something like that. string is just char[] with value semantics. It doesn't necessarily hold a valid UTF-8 sequence.
I would prefer to be handling something that presents the abstraction "character string." I'm not sure exactly what that looks like, but I'm pretty sure the "individually addressable and mutable chars" part should go. I'd like to see an interface that prevents corrupting the underlying data such that it no longer represents a valid sequence of characters (or at least makes it highly unlikely that such corruption could happen accidentally). Furthermore, there are lots of string-y things I'd want to do that aren't provided—or aren't provided well—by std::string, e.g. if (s1.starts_with(s2)) {...}
Does this make more sense?
It makes sense in the abstract. But there is no way to protect against corruption without also setting an invariant that the sequence is not corrupted (represents valid UTF-8), and I don't usually need such a string in the interfaces we're discussing, although it can certainly be useful on its own. The interfaces that talk to the OS need to be able to carry arbitrary char sequences (in the POSIX case). Even an interface that displays the string, one that by necessity must interpret it as UTF-8, should preferably handle invalid UTF-8 and display some placeholders instead of the invalid subsequence - it's better for the user to see parts of the string than nothing at all. It's even worse to abort the whole operation with an invalid_utf8 exception. I don't particularly like string's mutable chars, but they don't mutate themselves without my telling them to, so things tend to work out fine. :-)