
On Wed, Jan 26, 2011 at 9:25 AM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Wed, Jan 26, 2011 at 3:47 PM, Matus Chochlik <chochlik@gmail.com> wrote:
On Fri, Jan 21, 2011 at 1:07 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
[snip/] I also prefer nothing too fancy. But most of these things are implementation details, let us get the interface right first and focus on the optimizations afterwards.
Actually, it's not an implementation detail. Value semantics has everything to do the interface and not the implementation.
It's just that, at the time I was thinking about and writing this reply, I was just really wanting something lightweight and allowed for unbridled cross-thread access. That original assumption of mine that reference counting was a bad thing has since been clarified by others in the ensuing threads.
I didn't say that I regard the immutability or value semantics to be an implementation detail. But some part of the discussion focused on if we should employ COW, how to implement it, etc. Value semantics - a part of the interface specification - can be implemented in a number of ways.
3. Has all the algorithms that apply to it defined externally.
[snip/]
Encoding is a matter of external interpretation and I think should not be part of a string's interface. You can have wrappers that interpret a string as a UTF-* string.
OK, I give up :) I do not insist any more on calling it 'string'.
[snip/]
But I we already have these everyday nice and convenient text handling algorithms in Boost.Algorithm's String_algo library.
But still it is encoding agnostic, which is bad in many cases.
As a matter of fact, *all* the implementations cited about dealing with UTF-8 and UTF-16 have everything to do with wrapping raw data into a view of it that (unfortunately) allows for mutating transformations.
Note also that I wasn't even going into the generic point of stringsdo not being a sequence of anything other than characters to be read. That's a different topic that I don't want to get into at this time. But even the pedantic definition of a string doesn't include mutability as an intrinsic requirement.
I really do not have anything against the immutability and the value semantics, see above. I think you misunderstood me :)
Another important concern for me is portability. I'd like (being very self-centered :-P) for example the following:
boost::string s = "Mat" + code_point(0x00FA/*u with acute*/) + code_point(0x0161/*s with caron*/); std::cout << s << std::endl;
(everywhere where the terminal can handle it) to print: Matúš // hope your email client can handle that :)
instead of: Mat$#@!% or completely upsetting the terminal.
A few things here:
1. This is totally fine with an immutable string implementation. I don't see any mutations going on here.
Me neither :-) What I see however is that it fails because of encoding.
2. A string class that "works correctly while immutable" allows for dealing with arbitrary data interpreted as some thunk that is obtained from a given source (as long as you have a length of the data that is).
Agreed
3. String I/O can be defined independently of the string especially if you're dealing with C++ streams. I don't see why the above would be a problem with an immutable string implementation.
Agreed, but again it has to be convenient. [snip/]
auto it = encoded<utf8_encoding>(original_string), end = encoded<utf8_encoding>(); is perfectly generic and well-designed for some use-cases the first reaction of
Also, while I see that for example this the-average-joe-programmer-inside-me's when seeing it was, *yuck*. Sorry :-)
So you'd say yuck to any STL algorithm that dealt with iterators? Have you used the Boost.Iterators library yet because then you'd be calling all those chaining/wrapping operations "yucky" too. ;)
Some of them ? Yes, in many situations. [snip/]
But the problem there is "nice" is really subjective. I absolutely abhor code like this:
boost::string s = "Foo"; s.append("Bar").append("Baz");
When I can express it entirely with less characters and succinctly with this instead:
boost::string s = "Foo" ^ "Bar" ^ "Baz";
Agreed, this is a matter of opinion and while I see the beauty of what you propose, it may not be clear what you mean by "Foo" ^ "Bar". If I learned something from this whole discussion, then it is that it's not nice to shove anything (programming style included) down anyones throat :-)
The reason why I want to call it (std::)string is that many not-so-pedantic people would react to the question "What is your first thought when you hear 'string type'?" with "Some kind of type for handling text, eh?" and not with "Some kind of generalized sequence of elements without any intrinsic encoding having the following properties...". But if there is so much resistance to calling it that then I vote for (boost|std)::text (however this sounds a little awkward to me, I don't know why).
I think you're missing something here though.
The point of creating a new string implementation is so that you can generalize a whole family of string-related algorithms around a well-defined abstraction. In this case there's really no question that a string of characters is used to represent "text" -- although it can very well represent a lot of other things too. However you cut it though the abstraction bears out of algorithms that have something to do with strings like: concatenation, compression, ordering, encoding, decoding, rendering, sub-string, parsing, lexical analysis, search, etc.
And I think you misunderstand me, I *do not* want to stop us from doing such implementation of string. But just as it is important for you to have the generic string class, it is important for me to have the "nice" 'text' class :) I even don't have anything against boost::text to be implemented as a special case of boost::string if it is possible/wise.
[snip/]
Like I said though, I think we're talking in different levels.
I have exactly the same feeling :)
I for one think that solving the std::string problem brings more to the world than just solving the encoding problem. Bold statement I know. ;)
For you (and others) not for me (and others).
Also, last time I checked, there are already a ton of Unicode-encoding libraries out there, I don't see why there's a need for yet-another-encoding-library for character strings. This is why I think I'm liking the way Boost.Locale is handling it because it conveys that the library is about making a common interface through which different back-ends can be plugged into. If Boost.Locale dealt with iterators then I think having a string library that is better than std::string in more ways than one gives us a good way of tackling the cross-platform string encoding issue. But there I stress, I think C++ needs a better than the standard string implementation.
And what is their level of acceptance by different APIs ?
Regarding #1 above and the following ...
x = "Hello,"; x = x ^ " World!";
... would you be against, if the interface in addition also included a few convenience/backward compatibility member functions like ...
[snip/]
... etc? For the same reasons as above: clarity, simplicity (it may not be obvious what a fancy operator expression does, it is more obvious when using names like append, prepend, ...) and people are used to that programming style.
I think this is a slippery slope though. If we make the boost::string look like something that is mutable without it being really mutable, then you have a disconnect between the interface and the semantics you want to convey.
Having member functions like 'append' and 'prepend' makes you think that you're modifying the string when in fact you're really building another string. I've already pointed out that string construction can very well be handled by the string streams so I don't think we want to encourage people to think of strings as state-ful objects with mutable semantics because that's not the original intention of the string.
By forcing users of the string to make it look like they're building a string instead of "modifying and existing string" *should* be conveyed in the interface. This is largely an issue of documentation though.
Again, this is a matter of taste. Is the enforcing of our "superior" interface design really that much more important then level of acceptability by other people which do not share the same opinion ? Nobody forces you to use append/ prepend and you should not force others to use the operator ^. IMO in this case you are even in an advantage, because append/ prepend/etc. would be wrappers around "your" :) interface. And, yes, they should be clearly documented as such. Best, Matus