
On Fri, Jan 28, 2011 at 10:31 PM, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On Sat, Jan 29, 2011 at 5:13 AM, Matus Chochlik <chochlik@gmail.com> wrote:
On Fri, Jan 28, 2011 at 9:46 PM, Dean Michael Berris
All the discussion in started because we need UTF-8 in strings now we are back to the beginning?
No, the discussion started because we need a UTF-8 view of data. You missed the point I was making. And you didn't understand the document I wrote.
Sorry, but no. The discussion started by the proposal that we should by default treat std::strings as if they were UTF-8 encoded. Artyom should know because he was the one who did the original proposal. The whole 'view' idea was brought up only much later.
And the point I was making was that, doing precisely this was the "wrong" way of doing it. Assuming a default encoding is "unnecessary" as an encoding is largely a matter of interpretation of data ultimately.
I was attempting to solve the problem that is std::string. In the process I'm moving the issue away from the underlying data and moving it to a matter of interpretation. To do that in a manner that would make sense as how I see it, that means moving it into a view of the data that is held in a string. The string would be the data structure, the view an interpretation of it.
I never precluded that the string can hold UTF-8 encoded data, but saying that is the default achieves nothing and is ultimately unnecessary. In the design I've been proposing the point of the matter is, interpreting data in a given encoding is separate from how the data is actually stored. Now let's say you have a UTF-8 string builder, what else would that write in memory aside from UTF-8 encoded data? It will though still yield a string, which could be interpreted many different ways -- I just don't see the encoding as something intrinsic to the string. That means a string can hold UTF-8 encoded data and I can wrap that in a view for UTF-16 and see that it will not validate correctly -- unless I wrap the string with a view for UTF-8 first then pass that into a view for UTF-16 and transcoding can happen on the fly.
Writing algorithms that deal with strings, is different from writing algorithms that deal with encoded text. That's two different levels.
This explaining, and trying to explain again, the whole point of the matter makes me sound like a broken record. If you still don't get what I'm saying then I guess I'm going to have to try a different route and just show what I mean in terms of code at some point in time.
Dean, believe me, I got what you said the first time you said it, like 200 posts ago. I know that the string data is ultimately stored in the memory as a sequence of bytes. But then you proposed to solve my problem by suggesting the view<Encoding> template. Then like 50 posts ago we finally agreed on typedef-ing and naming it 'text' since using something called view<encoding_tag> is not acceptable for me. Now, if this typedef view<utf8_encoding_tag> text; is the only line of code where I see the encoding and I'll be able to do all the text handling, i.e.: searching for code points/characters (not only bytes), searching for words, concatenation, splitting, writing it into a file, socket, etc. and reading it from file, socket, etc., using it with some c_str-like adapter with C APIs, etc., basically doing (nearly) everything that I was able to do with std::string *without* ever mentioning the encoding again, the You already have me convinced. If I cannot do those things without specifying the encoding (unless necessary) then this is useless for me for text handling. Peace, Love, Best regards, Matus