
On Thu, Jan 27, 2011 at 12:43 AM, Matus Chochlik <chochlik@gmail.com> wrote:
On Wed, Jan 26, 2011 at 5:06 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Wed, Jan 26, 2011 at 11:19 PM, Matus Chochlik <chochlik@gmail.com> wrote: [snip/]
Right, but others seem to want to know about the implementation details to try and work out whether the overall interface being designed is actually going to be a viable implementation. So while I say "value semantics" others have asked how that would be implemented and -- being the gratuitous typer that I am ;) -- I would respond. :D
OK :)
:D
So what would be the point of implementing a string "wrapper" that knew its encoding as part of the type if you didn't want to know the encoding in most of the cases? I think I'm missing the logic there.
The logic would be that you no longer would have to be concerned if A' (A with acute), etc, is encoded as ISO-8859-2 says, or as UTF-8 says etc.. But you would *always* handle the string as a sequence of *Unicode* code-points or even "logical characters" and not as a sequence of bytes that are being somehow encoded (generally). I can imagine use-cases where it still would be OK to get the underlying byte-sequence (read-only) for things that are encoding-independent.
So really this wrapper is the 'view' that I talk about that carries with it an encoding and the underlying data. Right?
So we're obviously talking about two different strings here -- your "text" that knows the encoding and the immutable string that you may or may not build upon. How then do you design the algorithms if you *didn't* want to explicitly specify the encoding you want the algorithms to use?
By saying that the *implicit* encoding is UTF-8 and that should I need to use another encoding I will treat it as a special case.
I don't see the value in this though requiring that it be part of the 'text'. I could easily write something like: typedef view<utf8_encoded> utf8; And have something like this be possible: utf8 u("The quick brown fox jumps over the lazy dog."); Now, that's your default utf8-encoded view of the underlying string. Right?
Every time when I do not specify an encoding it is assumed by default to be UTF-8 i.e. when I'm reading text from a TCP connection or from a file I expect that it already is UTF-8 encoded and would like the string (optionally or always) to validate it for me.
Hmmm... So then it's just a matter of using a type similar to what I pointed out above as the default then?
Then there are two cases: a) Default encoding of std::string depending upon std::locale and encoding of std::wstring which is for example on Windows be default treated as being encoded with UTF-16 and on Linux as being encoded as UTF-32. For these I would love to have some simple means of saying to 'boost::text' give me your representation in the encoding that std::string is expected to be encoded in or "build" yourself from the native encoding, that std::string is supposed to be using. + the same for wstring.
b) Every other encoding. For example if I really needed to convert my string to IBM CP850 because I want to send it to an old printer then only in this case should I be required (obviously) to specify the encoding explicitly.
I don't see why the default and the other encoding case are really that different from an interface perspective. The underlying string will still be a series of bytes in memory, and encoding is just a matter of viewing it a given way. Right?
In one of the previous messages I laid out an algorithm template like so:
template <class String> void foo(String s) { view<encoding> encoded(s); // deal with encoded from here on out }
Of course then from foo's user perspective, she wouldn't have to do anything with his string to be passed in. From the algorithm implementer perspective you would know exactly what encoding was wanted and how to go about implementing the algorithm even potentially having something like this as well:
template <class Encoding> void foo(view<Encoding> encoded) { // deal with the encoded string appropriately here }
And you get the benefits in either case of being able to either explicitly or implicitly deal with strings depending on whether they have been explicitly encoded already or whether it's just a raw set of bytes.
I see that this is OK for many use cases. But having a single pre-defined, default encoding, has also it's advantages, because usually you can skip the whole view<Encoding> part.
So what if `typedef view<Encoding> utf8` was there how far would that be from the default encoding case? And why does it have to be especially UTF for that matter?
So, if there was a way to "encode" (there's that word again) the data in an immutable string into an acceptably-rendered `char const *` would that solve the problem? The whole point of my assertion (and Dave's question) is whether c_str() would have to be intrinsic to the string, which I have pointed out in a different message (not too long ago) that it could very well be an external algorithm.
generally speaking the syntax is not that important for me I can get used to almost everything :) so c_str(my_str) is OK with me, if it does not involve just copying the string whatever the internal representation is. As Robert said if the internal string data already is non-contiguous then this should be no-op.
boost::string s = get_huge_string(); s = s ^ get_another_huge_string(); s = s ^ get_yet_another_huge_string(); std::string(s).c_str()
is too inefficient for my taste.
Why is it inefficient when there's no need for an actual copy to be involved? s ^ get_huge_string() would basically yield a lazily composed concatenation which could just hold references to the original strings (again, with potential for optimizations depending on the length of the strings, etc.). So then you can layer that up and just need to linearize it when it's actually required -- in the conversion for the std::string case. And if you really wanted to just linearize the string into a void * buffer somewhere then that should be perfectly fine as well. I guess assuming that you have actual temporaries built (like how std::string would have you believe) when concatenating strings will make it look like it's really inefficient, but there should be a way of making it more efficient *because* the string is immutable.
Right. This is Boost anyway, and I've always viewed libraries that get proposed to an accepted into Boost are the kinds of libraries that are developed to eventually be made part of the C++ standard library.
So while out of the gate the string implementation can very well be not called std::string, I don't see why the current std::string can't be deprecated later on (look at std::auto_ptr) and a different implementation be put in its place? :D Of course that may very well be C++21xx so I don't think I need to worry about it having to be a std::string killer in the outset. ;)
If you pull this off (replacing std::string without having a transition period with a backward compatible interface) then you will be my personal hero. :-)
Well don't hold your breath for that because, well, you won't have 'erase' and other things that std::string supports, so it won't be backward compatible to std::string. :)
Wait .. provided, that the encoding-related stuff I said above will be part of the string :) or there will be some wrapper around it providing that functionality.
typedef view<utf8_encoding> utf8; I don't see why that shouldn't work for your requirements. :) -- Dean Michael Berris about.me/deanberris