Re: [boost] [string] proposal

26 Jan 2011

      On Wed, Jan 26, 2011 at 9:25 AM, Dean Michael Berris
<mikhailberis@gmail.com> wrote:
...
On Wed, Jan 26, 2011 at 3:47 PM, Matus Chochlik <chochlik@gmail.com> wrote:
...
On Fri, Jan 21, 2011 at 1:07 PM, Dean Michael Berris
<mikhailberis@gmail.com> wrote:
...
[snip/]
I also prefer nothing too fancy. But most of these things
are implementation details, let us get the interface
right first and focus on the optimizations afterwards.
Actually, it's not an implementation detail. Value semantics has
everything to do the interface and not the implementation.
It's just that, at the time I was thinking about and writing this
reply, I was just really wanting something lightweight and allowed for
unbridled cross-thread access. That original assumption of mine that
reference counting was a bad thing has since been clarified by others
in the ensuing threads.
I didn't say that I regard the immutability or value semantics to
be an implementation detail. But some part of the discussion
focused on if we should employ COW, how to implement it,
etc. Value semantics - a part of the interface specification -
can be implemented in a number of ways.
...
...
...
3. Has all the algorithms that apply to it defined externally.
[snip/]
...
Encoding is a matter of external interpretation and I think should not
be part of a string's interface. You can have wrappers that interpret
a string as a UTF-* string.
OK, I give up :) I do not insist any more on calling it 'string'.
...
...
[snip/]
...
But I we already have these everyday nice and convenient text handling
algorithms in Boost.Algorithm's String_algo library.
But still it is encoding agnostic, which is bad in many cases.
...
As a matter of fact, *all* the implementations cited about dealing
with UTF-8 and UTF-16 have everything to do with wrapping raw data
into a view of it that (unfortunately) allows for mutating
transformations.
Note also that I wasn't even going into the generic point of stringsdo not
being a sequence of anything other than characters to be read. That's
a different topic that I don't want to get into at this time. But even
the pedantic definition of a string doesn't include mutability as an
intrinsic requirement.
I really do not have anything against the immutability
and the value semantics, see above. I think you
misunderstood me :)
...
...
Another important concern for me is portability.
I'd like (being very self-centered :-P) for example
the following:
boost::string s = "Mat" + code_point(0x00FA/*u with acute*/) +
code_point(0x0161/*s with caron*/);
std::cout << s << std::endl;
(everywhere where the terminal can handle it) to print:
Matúš // hope your email client can handle that :)
instead of:
Mat$#@!%
or completely upsetting the terminal.
A few things here:
1. This is totally fine with an immutable string implementation. I
don't see any mutations going on here.
Me neither :-) What I see however is that it fails because
of encoding.
...
2. A string class that "works correctly while immutable" allows for
dealing with arbitrary data interpreted as some thunk that is obtained
from a given source (as long as you have a length of the data that
is).
Agreed
...
3. String I/O can be defined independently of the string especially if
you're dealing with C++ streams. I don't see why the above would be a
problem with an immutable string implementation.
Agreed, but again it has to be convenient.

[snip/]
...
...
...
auto it = encoded<utf8_encoding>(original_string), end =
encoded<utf8_encoding>();
is perfectly generic and well-designed
for some use-cases the first reaction of
Also, while I see that for example this
the-average-joe-programmer-inside-me's
when seeing it was, *yuck*. Sorry :-)
So you'd say yuck to any STL algorithm that dealt with iterators? Have
you used the Boost.Iterators library yet because then you'd be calling
all those chaining/wrapping operations "yucky" too. ;)
Some of them ? Yes, in many situations.

[snip/]
...
But the problem there is "nice" is really subjective. I absolutely
abhor code like this:
 boost::string s = "Foo";
 s.append("Bar").append("Baz");
When I can express it entirely with less characters and succinctly
with this instead:
 boost::string s = "Foo" ^ "Bar" ^ "Baz";
Agreed, this is a matter of opinion and while
I see the beauty of what you propose, it may
not be clear what you mean by "Foo" ^ "Bar".
If I learned something from this whole discussion,
then it is that it's not nice to shove anything (programming
style included) down anyones throat :-)
...
...
The reason why I want to call it (std::)string
is that many not-so-pedantic people would react
to the question "What is your first thought when
you hear 'string type'?" with "Some kind of type
for handling text, eh?" and not with "Some kind
of generalized sequence of elements without any
intrinsic encoding having the following
properties...". But if there is so much resistance
to calling it that then I vote for (boost|std)::text
(however this sounds a little awkward to me, I don't
know why).
I think you're missing something here though.
The point of creating a new string implementation is so that you can
generalize a whole family of string-related algorithms around a
well-defined abstraction. In this case there's really no question that
a string of characters is used to represent "text" -- although it can
very well represent a lot of other things too. However you cut it
though the abstraction bears out of algorithms that have something to
do with strings like: concatenation, compression, ordering, encoding,
decoding, rendering, sub-string, parsing, lexical analysis, search,
etc.
And I think you misunderstand me, I *do not* want to stop us
from doing such implementation of string. But just as it is important
for you to have the generic string class, it is important for me to have
the "nice" 'text' class :) I even don't have anything against
boost::text to be implemented as a special case of boost::string
if it is possible/wise.
...
[snip/]
...
Like I said though, I think we're talking in different levels.
I have exactly the same feeling :)
...
I for one think that solving the std::string problem brings more to
the world than just solving the encoding problem. Bold statement I
know. ;)
For you (and others) not for me (and others).
...
Also, last time I checked, there are already a ton of Unicode-encoding
libraries out there, I don't see why there's a need for
yet-another-encoding-library for character strings. This is why I
think I'm liking the way Boost.Locale is handling it because it
conveys that the library is about making a common interface through
which different back-ends can be plugged into. If Boost.Locale dealt
with iterators then I think having a string library that is better
than std::string in more ways than one gives us a good way of tackling
the cross-platform string encoding issue. But there I stress, I think
C++ needs a better than the standard string implementation.
And what is their level of acceptance by different APIs ?
...
...
Regarding #1 above and the following ...
...
x = "Hello,";
x = x ^ " World!";
... would you be against, if the interface in addition also
included a few convenience/backward compatibility
member functions like ...
[snip/]
...
...
... etc? For the same reasons as above: clarity,
simplicity (it may not be obvious what a fancy
operator expression does, it is more obvious
when using names like append, prepend, ...) and
people are used to that programming style.
I think this is a slippery slope though. If we make the boost::string
look like something that is mutable without it being really mutable,
then you have a disconnect between the interface and the semantics you
want to convey.
Having member functions like 'append' and 'prepend' makes you think
that you're modifying the string when in fact you're really building
another string. I've already pointed out that string construction can
very well be handled by the string streams so I don't think we want to
encourage people to think of strings as state-ful objects with mutable
semantics because that's not the original intention of the string.
By forcing users of the string to make it look like they're building a
string instead of "modifying and existing string" *should* be conveyed
in the interface. This is largely an issue of documentation though.
Again, this is a matter of taste.
Is the enforcing of our "superior" interface design really that much
more important then level of acceptability by other people which
do not share the same opinion ? Nobody forces you to use append/
prepend and you should not force others to use the operator ^.
IMO in this case you are even in an advantage, because append/
prepend/etc. would be wrappers around "your" :) interface.
And, yes, they should be clearly documented as such.

Best,

Matus