
On Wed, Jan 26, 2011 at 3:47 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Fri, Jan 21, 2011 at 1:07 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
Mostly I'm interested in seeing a string class that is:
1. Immutable. No if's or but's about it. I don't want a string to be modifiable. Period. You can create it, and once it's created, that's it.
2. Has real value semantics. This means, once you've copied it, that's really copied. No funky copy-on-write reference-counting mumbo-jumbo.
I also prefer nothing too fancy. But most of these things are implementation details, let us get the interface right first and focus on the optimizations afterwards.
Actually, it's not an implementation detail. Value semantics has everything to do the interface and not the implementation. It's just that, at the time I was thinking about and writing this reply, I was just really wanting something lightweight and allowed for unbridled cross-thread access. That original assumption of mine that reference counting was a bad thing has since been clarified by others in the ensuing threads.
3. Has all the algorithms that apply to it defined externally.
[snip/]
Encoding is a matter of external interpretation and I think should not be part of a string's interface. You can have wrappers that interpret a string as a UTF-* string.
I am all for a generalized-*string* class in the pedantic interpretation of the word i.e. a sequence of chars, char16_ts, bytes, octets, words, dwords, etc. without any enforced encoding for use-cases that call for it, but again,
the reason why I participate in this whole discussion is because I think that C++ deserves also a class focused on the "everyday", *nice* and *convenient* handling of text, without having to worry about how do I need to "view" that raw-chunk-of-binary-data in this call to an OS API function and how do I have to "view" it in that other library call, explicitly specifying to which encoding I want to convert it using *ugly* :-) tag types, etc. (as much as this is possible).
But I we already have these everyday nice and convenient text handling algorithms in Boost.Algorithm's String_algo library. As a matter of fact, *all* the implementations cited about dealing with UTF-8 and UTF-16 have everything to do with wrapping raw data into a view of it that (unfortunately) allows for mutating transformations. Note also that I wasn't even going into the generic point of strings being a sequence of anything other than characters to be read. That's a different topic that I don't want to get into at this time. But even the pedantic definition of a string doesn't include mutability as an intrinsic requirement.
Another important concern for me is portability. I'd like (being very self-centered :-P) for example the following:
boost::string s = "Mat" + code_point(0x00FA/*u with acute*/) + code_point(0x0161/*s with caron*/); std::cout << s << std::endl;
(everywhere where the terminal can handle it) to print: Matúš // hope your email client can handle that :)
instead of: Mat$#@!% or completely upsetting the terminal.
A few things here: 1. This is totally fine with an immutable string implementation. I don't see any mutations going on here. 2. A string class that "works correctly while immutable" allows for dealing with arbitrary data interpreted as some thunk that is obtained from a given source (as long as you have a length of the data that is). 3. String I/O can be defined independently of the string especially if you're dealing with C++ streams. I don't see why the above would be a problem with an immutable string implementation. 4. I don't see why a hypothetical boost::string implementation that is immutable would have portability problems when it just deals with immutable thunks of memory that can be viewed in a different manner depending on the encoding you want at the point where you need to be dealing with a specific encoding.
auto it = encoded<utf8_encoding>(original_string), end = encoded<utf8_encoding>(); is perfectly generic and well-designed for some use-cases the first reaction of
Also, while I see that for example this the-average-joe-programmer-inside-me's when seeing it was, *yuck*. Sorry :-)
So you'd say yuck to any STL algorithm that dealt with iterators? Have you used the Boost.Iterators library yet because then you'd be calling all those chaining/wrapping operations "yucky" too. ;)
Sometimes it is more important for the code and people writing/maintaining it to be nice and easy to understand than to be really-really-generic and smart. That said, it *is* perfectly valid if someone uses the generic version above. Let's do both.
But the problem there is "nice" is really subjective. I absolutely abhor code like this: boost::string s = "Foo"; s.append("Bar").append("Baz"); When I can express it entirely with less characters and succinctly with this instead: boost::string s = "Foo" ^ "Bar" ^ "Baz";
The reason why I want to call it (std::)string is that many not-so-pedantic people would react to the question "What is your first thought when you hear 'string type'?" with "Some kind of type for handling text, eh?" and not with "Some kind of generalized sequence of elements without any intrinsic encoding having the following properties...". But if there is so much resistance to calling it that then I vote for (boost|std)::text (however this sounds a little awkward to me, I don't know why).
I think you're missing something here though. The point of creating a new string implementation is so that you can generalize a whole family of string-related algorithms around a well-defined abstraction. In this case there's really no question that a string of characters is used to represent "text" -- although it can very well represent a lot of other things too. However you cut it though the abstraction bears out of algorithms that have something to do with strings like: concatenation, compression, ordering, encoding, decoding, rendering, sub-string, parsing, lexical analysis, search, etc. These algorithms are applied to strings and there are a ton of algorithms dealing with different kinds of strings. Encoding (or interpreting) a string as UTF-8 is just one algorithm, and it will be naive IMO if we design a string implementation just around the idea that any string will need an encoding defined when the algorithms that deal with strings are much more general in reality.
Let us keep the basic_string<CharT> as that generalized string (I never suggested to dump it, just that std::string would be an another type and not defined as typedef std::basic_string<char>).
Like I said though, I think we're talking in different levels. I for one think that solving the std::string problem brings more to the world than just solving the encoding problem. Bold statement I know. ;) Also, last time I checked, there are already a ton of Unicode-encoding libraries out there, I don't see why there's a need for yet-another-encoding-library for character strings. This is why I think I'm liking the way Boost.Locale is handling it because it conveys that the library is about making a common interface through which different back-ends can be plugged into. If Boost.Locale dealt with iterators then I think having a string library that is better than std::string in more ways than one gives us a good way of tackling the cross-platform string encoding issue. But there I stress, I think C++ needs a better than the standard string implementation.
Regarding #1 above and the following ...
x = "Hello,"; x = x ^ " World!";
... would you be against, if the interface in addition also included a few convenience/backward compatibility member functions like ...
string& append(const string& s) { *this = *this ^ s; return *this; }
string& prepend(const string& s) { *this = s ^ *this; return *this; }
... etc? For the same reasons as above: clarity, simplicity (it may not be obvious what a fancy operator expression does, it is more obvious when using names like append, prepend, ...) and people are used to that programming style.
I think this is a slippery slope though. If we make the boost::string look like something that is mutable without it being really mutable, then you have a disconnect between the interface and the semantics you want to convey. Having member functions like 'append' and 'prepend' makes you think that you're modifying the string when in fact you're really building another string. I've already pointed out that string construction can very well be handled by the string streams so I don't think we want to encourage people to think of strings as state-ful objects with mutable semantics because that's not the original intention of the string. By forcing users of the string to make it look like they're building a string instead of "modifying and existing string" *should* be conveyed in the interface. This is largely an issue of documentation though. The short answer to your question would be "yes, I am opposed to having member functions similar to what you have pointed out above". :)
BR,
Thanks for taking the time and I hope this helps! -- Dean Michael Berris about.me/deanberris