
On Wed, Jan 26, 2011 at 5:06 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Wed, Jan 26, 2011 at 11:19 PM, Matus Chochlik <chochlik@gmail.com> wrote: [snip/]
Right, but others seem to want to know about the implementation details to try and work out whether the overall interface being designed is actually going to be a viable implementation. So while I say "value semantics" others have asked how that would be implemented and -- being the gratuitous typer that I am ;) -- I would respond. :D
OK :)
I still don't understand this though. What does encoding have to do with the string? Isn't encoding a separate process?
Hm, my ability to express myself obviously totally su*ks :) you are completely right, that the encoding is a completely separate process, and I'm saying that I want it *completely* to be hidden from my sight, unless it is absolutely necessary for me to be concerned about it :-)
So what would be the point of implementing a string "wrapper" that knew its encoding as part of the type if you didn't want to know the encoding in most of the cases? I think I'm missing the logic there.
The logic would be that you no longer would have to be concerned if A' (A with acute), etc, is encoded as ISO-8859-2 says, or as UTF-8 says etc.. But you would *always* handle the string as a sequence of *Unicode* code-points or even "logical characters" and not as a sequence of bytes that are being somehow encoded (generally). I can imagine use-cases where it still would be OK to get the underlying byte-sequence (read-only) for things that are encoding-independent.
The means for this would be: Let us build a string, that may (or may not) be based on your general (encoding agnostic) string. And this string would handle the transcoding in most cases without me viewing the underlying byte sequence by functors that need me *everytime* to specify what encoding I want explicitly. By default I want UTF-8, if I talk to the OS I say I want the string in an encoding that the OS expects, not that I want it in UTF-16, ISO-8859-2, KOI8-R, etc. If and only if I want to handle the string in another encoding than Unicode should I have to specify that explicitly.
So we're obviously talking about two different strings here -- your "text" that knows the encoding and the immutable string that you may or may not build upon. How then do you design the algorithms if you *didn't* want to explicitly specify the encoding you want the algorithms to use?
By saying that the *implicit* encoding is UTF-8 and that should I need to use another encoding I will treat it as a special case. Every time when I do not specify an encoding it is assumed by default to be UTF-8 i.e. when I'm reading text from a TCP connection or from a file I expect that it already is UTF-8 encoded and would like the string (optionally or always) to validate it for me. Then there are two cases: a) Default encoding of std::string depending upon std::locale and encoding of std::wstring which is for example on Windows be default treated as being encoded with UTF-16 and on Linux as being encoded as UTF-32. For these I would love to have some simple means of saying to 'boost::text' give me your representation in the encoding that std::string is expected to be encoded in or "build" yourself from the native encoding, that std::string is supposed to be using. + the same for wstring. b) Every other encoding. For example if I really needed to convert my string to IBM CP850 because I want to send it to an old printer then only in this case should I be required (obviously) to specify the encoding explicitly.
In one of the previous messages I laid out an algorithm template like so:
template <class String> void foo(String s) { view<encoding> encoded(s); // deal with encoded from here on out }
Of course then from foo's user perspective, she wouldn't have to do anything with his string to be passed in. From the algorithm implementer perspective you would know exactly what encoding was wanted and how to go about implementing the algorithm even potentially having something like this as well:
template <class Encoding> void foo(view<Encoding> encoded) { // deal with the encoded string appropriately here }
And you get the benefits in either case of being able to either explicitly or implicitly deal with strings depending on whether they have been explicitly encoded already or whether it's just a raw set of bytes.
I see that this is OK for many use cases. But having a single pre-defined, default encoding, has also it's advantages, because usually you can skip the whole view<Encoding> part.
[snip/]
This is a different matter, Again I may be wrong but I live under the expression that RangeEx has been implemented to hide the ugliness of complex STL iterator-based algorithms. impression (of course) :)
Of course the proof will be in the pudding. ;)
I think we need to qualify what you refer to as APIs. If just judging from the amount of code that's written against Qt or MFC for example then I'd say "they're pretty well accepted". If you look at the libraries that use ICU as a backend I'd say we already have one in Boost called Boost.Regex. And there's all these other libraries in the Linux arena that have their own little niche to play in the Unicode game -- there's Glib, the GNOME and KDE libraries, ad nauseam.
Besides what you mentioned an API for me is for example WINAPI, POSIX API, OpenGL API, OpenSSL API, etc. Basically all the functions "exported" by the various C/C++ libraries that I cannot imagine my life without :) and which expect not a generic iterator range or a view or whatnot but plain and simple pointer (const char*) pointing to a contiguous block in memory containing a zero terminated C string, or if we are luckier expects std::string.
So, if there was a way to "encode" (there's that word again) the data in an immutable string into an acceptably-rendered `char const *` would that solve the problem? The whole point of my assertion (and Dave's question) is whether c_str() would have to be intrinsic to the string, which I have pointed out in a different message (not too long ago) that it could very well be an external algorithm.
generally speaking the syntax is not that important for me I can get used to almost everything :) so c_str(my_str) is OK with me, if it does not involve just copying the string whatever the internal representation is. As Robert said if the internal string data already is non-contiguous then this should be no-op. boost::string s = get_huge_string(); s = s ^ get_another_huge_string(); s = s ^ get_yet_another_huge_string(); std::string(s).c_str() is too inefficient for my taste.
Right. This is Boost anyway, and I've always viewed libraries that get proposed to an accepted into Boost are the kinds of libraries that are developed to eventually be made part of the C++ standard library.
So while out of the gate the string implementation can very well be not called std::string, I don't see why the current std::string can't be deprecated later on (look at std::auto_ptr) and a different implementation be put in its place? :D Of course that may very well be C++21xx so I don't think I need to worry about it having to be a std::string killer in the outset. ;)
If you pull this off (replacing std::string without having a transition period with a backward compatible interface) then you will be my personal hero. :-) Wait .. provided, that the encoding-related stuff I said above will be part of the string :) or there will be some wrapper around it providing that functionality. Matus