Re: [boost] [string] proposal

26 Jan 2011

      On Wed, Jan 26, 2011 at 5:06 PM, Dean Michael Berris
<mikhailberis@gmail.com> wrote:
...
On Wed, Jan 26, 2011 at 11:19 PM, Matus Chochlik <chochlik@gmail.com> wrote:
[snip/]
Right, but others seem to want to know about the implementation
details to try and work out whether the overall interface being
designed is actually going to be a viable implementation. So while I
say "value semantics" others have asked how that would be implemented
and -- being the gratuitous typer that I am ;) -- I would respond. :D
OK :)
...
...
...
I still don't understand this though. What does encoding have to do
with the string? Isn't encoding a separate process?
Hm, my ability to express myself obviously totally su*ks :)
you are completely right, that the encoding is a completely
separate process, and I'm saying that I want it *completely*
to be hidden from my sight, unless it is absolutely necessary
for me to be concerned about it :-)
So what would be the point of implementing a string "wrapper" that
knew its encoding as part of the type if you didn't want to know the
encoding in most of the cases? I think I'm missing the logic there.
The logic would be that you no longer would have to
be concerned if A' (A with acute), etc, is encoded as ISO-8859-2 says,
or as UTF-8 says etc.. But you would *always* handle
the string as a sequence of *Unicode* code-points or even
"logical characters" and not as a sequence of bytes that are
being somehow encoded (generally).
I can imagine use-cases where it still would be OK to
get the underlying byte-sequence (read-only) for things
that are encoding-independent.
...
...
The means for this would be: Let us build a string, that may
(or may not) be based on your general (encoding agnostic)
string. And this string would handle the transcoding in most
cases without me viewing the underlying byte sequence
by functors that need me *everytime* to specify what encoding
I want explicitly. By default I want UTF-8, if I talk to the OS I
say I want the string in an encoding that the OS expects, not
that I want it in UTF-16, ISO-8859-2, KOI8-R, etc.
If and only if I want to handle the string in another encoding
than Unicode should I have to specify that explicitly.
So we're obviously talking about two different strings here -- your
"text" that knows the encoding and the immutable string that you may
or may not build upon. How then do you design the algorithms if you
*didn't* want to explicitly specify the encoding you want the
algorithms to use?
By saying that the *implicit* encoding is UTF-8 and that should
I need to use another encoding I will treat it as a special case.
Every time when I do not specify an encoding it is assumed
by default to be UTF-8 i.e. when I'm reading text from
a TCP connection or from a file I expect that it already is
UTF-8 encoded and would like the string (optionally or always)
to validate it for me.

Then there are two cases:
a) Default encoding of std::string depending upon std::locale
and encoding of std::wstring which is for example on Windows
be default treated as being encoded with UTF-16 and on Linux
as being encoded as UTF-32.
For these I would love to have some simple means of saying
to 'boost::text' give me your representation in the encoding
that std::string is expected to be encoded in or "build" yourself
from the native encoding, that std::string is supposed to be using.
+ the same for wstring.

b) Every other encoding. For example if I really needed
to convert my string to IBM CP850 because I want
to send it to an old printer then only in this case should
I be required (obviously) to specify the encoding explicitly.
...
In one of the previous messages I laid out an algorithm template like so:
 template <class String>
 void foo(String s) {
   view<encoding> encoded(s);
   // deal with encoded from here on out
 }
Of course then from foo's user perspective, she wouldn't have to do
anything with his string to be passed in. From the algorithm
implementer perspective you would know exactly what encoding was
wanted and how to go about implementing the algorithm even potentially
having something like this as well:
 template <class Encoding>
 void foo(view<Encoding> encoded) {
   // deal with the encoded string appropriately here
 }
And you get the benefits in either case of being able to either
explicitly or implicitly deal with strings depending on whether they
have been explicitly encoded already or whether it's just a raw set of
bytes.
I see that this is OK for many use cases. But having a single
pre-defined, default encoding, has also it's advantages, because
usually you can skip the whole view<Encoding> part.
...
[snip/]
...
...
This is a different matter, Again I may be wrong but I live
under the expression that RangeEx has been implemented
to hide the ugliness of complex STL iterator-based algorithms.
impression (of course) :)
Of course the proof will be in the pudding. ;)
...
...
I think we need to qualify what you refer to as APIs. If just judging
from the amount of code that's written against Qt or MFC for example
then I'd say "they're pretty well accepted". If you look at the
libraries that use ICU as a backend I'd say we already have one in
Boost called Boost.Regex. And there's all these other libraries in the
Linux arena that have their own little niche to play in the Unicode
game -- there's Glib, the GNOME and KDE libraries, ad nauseam.
Besides what you mentioned an API for me is for example
WINAPI, POSIX API, OpenGL API, OpenSSL API, etc.
Basically all the functions "exported" by the various C/C++
libraries that I cannot imagine my life without :) and which
expect not a generic iterator range or a view or whatnot
but plain and simple pointer (const char*) pointing to a contiguous
block in memory containing a zero terminated C string,
or if we are luckier expects std::string.
So, if there was a way to "encode" (there's that word again) the data
in an immutable string into an acceptably-rendered `char const *`
would that solve the problem? The whole point of my assertion (and
Dave's question) is whether c_str() would have to be intrinsic to the
string, which I have pointed out in a different message (not too long
ago) that it could very well be an external algorithm.
generally speaking the syntax is not that important for me
I can get used to almost everything :) so c_str(my_str) is
OK with me, if it does not involve just copying the string
whatever the internal representation is. As Robert said
if the internal string data already is non-contiguous then
this should be no-op.

boost::string s = get_huge_string();
s = s ^ get_another_huge_string();
s = s ^ get_yet_another_huge_string();
std::string(s).c_str()

is too inefficient for my taste.
...
...
Right. This is Boost anyway, and I've always viewed libraries that get
proposed to an accepted into Boost are the kinds of libraries that are
developed to eventually be made part of the C++ standard library.
So while out of the gate the string implementation can very well be
not called std::string, I don't see why the current std::string can't
be deprecated later on (look at std::auto_ptr) and a different
implementation be put in its place? :D Of course that may very well be
C++21xx so I don't think I need to worry about it having to be a
std::string killer in the outset. ;)
If you pull this off (replacing std::string without having a transition period
with a backward compatible interface) then you will be my personal hero. :-)
Wait .. provided, that the encoding-related stuff I said above
will be part of the string :) or there will be some wrapper around it
providing that functionality.

Matus