Re: [boost] [string] proposal

26 Jan 2011

      On Thu, Jan 27, 2011 at 12:43 AM, Matus Chochlik <chochlik@gmail.com> wrote:
...
On Wed, Jan 26, 2011 at 5:06 PM, Dean Michael Berris
<mikhailberis@gmail.com> wrote:
...
On Wed, Jan 26, 2011 at 11:19 PM, Matus Chochlik <chochlik@gmail.com> wrote:
[snip/]
Right, but others seem to want to know about the implementation
details to try and work out whether the overall interface being
designed is actually going to be a viable implementation. So while I
say "value semantics" others have asked how that would be implemented
and -- being the gratuitous typer that I am ;) -- I would respond. :D
OK :)
:D
...
...
So what would be the point of implementing a string "wrapper" that
knew its encoding as part of the type if you didn't want to know the
encoding in most of the cases? I think I'm missing the logic there.
The logic would be that you no longer would have to
be concerned if A' (A with acute), etc, is encoded as ISO-8859-2 says,
or as UTF-8 says etc.. But you would *always* handle
the string as a sequence of *Unicode* code-points or even
"logical characters" and not as a sequence of bytes that are
being somehow encoded (generally).
I can imagine use-cases where it still would be OK to
get the underlying byte-sequence (read-only) for things
that are encoding-independent.
So really this wrapper is the 'view' that I talk about that carries
with it an encoding and the underlying data. Right?
...
...
So we're obviously talking about two different strings here -- your
"text" that knows the encoding and the immutable string that you may
or may not build upon. How then do you design the algorithms if you
*didn't* want to explicitly specify the encoding you want the
algorithms to use?
By saying that the *implicit* encoding is UTF-8 and that should
I need to use another encoding I will treat it as a special case.
I don't see the value in this though requiring that it be part of the
'text'. I could easily write something like:

  typedef view<utf8_encoded> utf8;

And have something like this be possible:

  utf8 u("The quick brown fox jumps over the lazy dog.");

Now, that's your default utf8-encoded view of the underlying string.

Right?
...
Every time when I do not specify an encoding it is assumed
by default to be UTF-8 i.e. when I'm reading text from
a TCP connection or from a file I expect that it already is
UTF-8 encoded and would like the string (optionally or always)
to validate it for me.
Hmmm... So then it's just a matter of using a type similar to what I
pointed out above as the default then?
...
Then there are two cases:
a) Default encoding of std::string depending upon std::locale
and encoding of std::wstring which is for example on Windows
be default treated as being encoded with UTF-16 and on Linux
as being encoded as UTF-32.
For these I would love to have some simple means of saying
to 'boost::text' give me your representation in the encoding
that std::string is expected to be encoded in or "build" yourself
from the native encoding, that std::string is supposed to be using.
+ the same for wstring.
b) Every other encoding. For example if I really needed
to convert my string to IBM CP850 because I want
to send it to an old printer then only in this case should
I be required (obviously) to specify the encoding explicitly.
I don't see why the default and the other encoding case are really
that different from an interface perspective. The underlying string
will still be a series of bytes in memory, and encoding is just a
matter of viewing it a given way. Right?
...
...
In one of the previous messages I laid out an algorithm template like so:
 template <class String>
 void foo(String s) {
   view<encoding> encoded(s);
   // deal with encoded from here on out
 }
Of course then from foo's user perspective, she wouldn't have to do
anything with his string to be passed in. From the algorithm
implementer perspective you would know exactly what encoding was
wanted and how to go about implementing the algorithm even potentially
having something like this as well:
 template <class Encoding>
 void foo(view<Encoding> encoded) {
   // deal with the encoded string appropriately here
 }
And you get the benefits in either case of being able to either
explicitly or implicitly deal with strings depending on whether they
have been explicitly encoded already or whether it's just a raw set of
bytes.
I see that this is OK for many use cases. But having a single
pre-defined, default encoding, has also it's advantages, because
usually you can skip the whole view<Encoding> part.
So what if `typedef view<Encoding> utf8` was there how far would that
be from the default encoding case? And why does it have to be
especially UTF for that matter?
...
...
So, if there was a way to "encode" (there's that word again) the data
in an immutable string into an acceptably-rendered `char const *`
would that solve the problem? The whole point of my assertion (and
Dave's question) is whether c_str() would have to be intrinsic to the
string, which I have pointed out in a different message (not too long
ago) that it could very well be an external algorithm.
generally speaking the syntax is not that important for me
I can get used to almost everything :) so c_str(my_str) is
OK with me, if it does not involve just copying the string
whatever the internal representation is. As Robert said
if the internal string data already is non-contiguous then
this should be no-op.
boost::string s = get_huge_string();
s = s ^ get_another_huge_string();
s = s ^ get_yet_another_huge_string();
std::string(s).c_str()
is too inefficient for my taste.
Why is it inefficient when there's no need for an actual copy to be involved?

s ^ get_huge_string()

would basically yield a lazily composed concatenation which could just
hold references to the original strings (again, with potential for
optimizations depending on the length of the strings, etc.).

So then you can layer that up and just need to linearize it when it's
actually required -- in the conversion for the std::string case. And
if you really wanted to just linearize the string into a void * buffer
somewhere then that should be perfectly fine as well.

I guess assuming that you have actual temporaries built (like how
std::string would have you believe) when concatenating strings will
make it look like it's really inefficient, but there should be a way
of making it more efficient *because* the string is immutable.
...
...
Right. This is Boost anyway, and I've always viewed libraries that get
proposed to an accepted into Boost are the kinds of libraries that are
developed to eventually be made part of the C++ standard library.
So while out of the gate the string implementation can very well be
not called std::string, I don't see why the current std::string can't
be deprecated later on (look at std::auto_ptr) and a different
implementation be put in its place? :D Of course that may very well be
C++21xx so I don't think I need to worry about it having to be a
std::string killer in the outset. ;)
If you pull this off (replacing std::string without having a transition period
with a backward compatible interface) then you will be my personal hero. :-)
Well don't hold your breath for that because, well, you won't have
'erase' and other things that std::string supports, so it won't be
backward compatible to std::string. :)
...
Wait .. provided, that the encoding-related stuff I said above
will be part of the string :) or there will be some wrapper around it
providing that functionality.
typedef view<utf8_encoding> utf8;

I don't see why that shouldn't work for your requirements. :)

-- 
Dean Michael Berris
about.me/deanberris