
On Sat, Jan 29, 2011 at 5:24 PM, Artyom <artyomtnk@yahoo.com> wrote:
-
From: Dean Michael Berris <mikhailberis@gmail.com>
On Sat, Jan 29, 2011 at 3:02 PM, Artyom <artyomtnk@yahoo.com> wrote:
It would turn away 90% of users.
It might turn you away because you obviously love std::string. Generalizing is a different matter and is largely a hot-air blowing exercise that is futile for convincing anybody.
I would say it more clear:
1. All users that use C libraries and need c_str() at boundaries And this is a huge amount of users that need to communicate with modules that are already working and ready but written in C.
And this is about of half of libraries there is C is the lowest level API that allows easy bindings to all languages.
But c_str() doesn't have to be part of the string's interface.
2. All users of GUI toolkits like GTK, Qt, wxWidgets, MFC as they require conversion of "boost::what_ever_is_it_called" to QString, ustring, wxString, CString. and it is done via C string.
So, what was the point again?
3. All users who actually use Operating System API that uses C strings and require char *.
Plenty? Isn't it?
I know, so what is your point?
Please take a look on frequent cases of string usage and you'll see how much do you indeed need rope like structure and how much normal string.
Don't forget that almost all string implementations in all languages are continuous single memory chunks.
So what if all other string implementations in all languages are contiguous? Does that mean that's the *only* way to do it? Look, in my paper -- if you read and *understood* it -- I pointed out that linearizing a string is an algorithm that deals with a string. Much like how std::copy is an algorithm that is external to a container, I see linearization as something not part of the string interface. That was towards the end part. I never said that a string shouldn't be linearizable.
2. In such case it would be even better to have non-shared strings
Weh?
Because of memory locality, think of part of string references to "other memory"
Memory locality is solved by making it available to the cache. If you have a contiguous chunk of 4kb *that never ever changes* then accessing that memory from all the cores in a NUMA machine is largely a matter of the cache reading part of that and making it available. Making copies of the string is *unnecessarily wasteful*.
I beg your pardon? It is efficient as all functions are as efficient as memcpy with exceptions of overflow/underflow happens which require some virtual functions calls which are pretty fast as well...
Also 99% of issues are just solved with reserve. (and I work with text parsing, combining and processing a lot)
And you obviously don't work with systems that have to do this multiple thousand times in one second to not know what the effects of NUMA are and why allocating a contiguous amount of memory is the performance killer that it is.
I know, but I hadn't suggested that streambuf should use single memory chunk.
So then what's the point of making strings use a single contiguous memory chunk if it's not necessary?
This article written from wrong understanding of real problems - instead of solving a problem it suggests some idea for some cases not looking to the problem in hole.
[snip]
The article was written from the understanding that the real problem stems from how std::string is broken. It already identifies why it's broken. It seems that you're just happy to attack people and the work they do more than you are interested in solving problems.
If you disagree with what's being said argue on the merits of "why". Mud-slinging and sitting on a high horse and just saying "blech, you're wrong" is not helping solve any technical problems.
I'm sorry but I think that much more real problem is that:
- My father in law can't use Thunderbird because he defined non-ascii user name and Thunderbird fails to open the profile. So he needs to create a new account because half of the other programs are broken when Unicode paths are used!
So fix Thunderbird.
- That acrobat reader can't open files with Unicode file names that user have (at least it was last time I've tried it)
So go work for Adobe and fix Acrobat Reader.
- That you can't write cross platform Unicode aware code using simple std::string/char * or what even encoding non-aware string there.
You can, there are already libraries for that sort of thing if you insist on using std::string.
What you are doing is classical example if micro-optimization that is concentrated on string storage.
No. If you really did read the document and understood it, I was talking at a high level about how to fix the problem by going through the rationale for immutable strings. The storage issue is a necessary component of the implementation efficiency concerns. If you design something without thinking about the efficiency of the solution, then you're doing art not engineering. I'm not an artist and I sure want to think I'm an engineer by training and by trade. Notice that I haven't mentioned any micro-optimizations or micro-benchmarks in the document as well so I don't know where you're coming from when you say I'm micro-optimizing anything.
Note: all strings in all toolkits around don't do much beyond what std::string does in terms of storage (QString, ustring, UnicodeString, wxString) all use same storage model, some immutable that I can accept it but all:
1. Are Unicode aware 2. Use single memory chunk
There is a very good reason for this but it seems that you just to get it why all strings designs around this same principle.
I do know why it's designed the same way: because that's the naive thing to do. Someone thought "oh well, we can malloc a chunk of memory and put a \0 in the end and call that a string". That worked up to a certain level, and then when people started to look at a better way of doing things, they saw that this isn't enough. Notice how the applications you mention use a segmented data structure when dealing with things like edit buffers or similar things. Strings are largely reserved for "short" data and anytime you need anything "longer" you'd use something else -- the question I'm trying to address is why can't you use one data structure that will be efficient for both cases? That's the point of the title which aims to come up with a singular way of explaining how strings should be so that they're suitable for short and long "strings of characters".
Take a look on these fundamental operations you had written:
- Concatenation (generally ok) - Substring - should be Unicode aware in most of cases - Filtration - should be Unicode and Locale aware - Tokenization - should be Unicode and Locale aware - Search/Pattern Matching - should be Unicode and Locale aware.
So please if you don't understand why these fundamental operations and why the string should relate to encoding then you need to reread this thread.
But these are *algorithms* that should be aware of the encoding, *not the string*. If you don't understand that point then you need to read the document *again*. The point is if you viewed a string a given way then that's largely an implementation of the view. I haven't gotten to the explanation of the view but I hinted that interpretation is a matter of composition. So if you "wrap" a string and say that it should be interpreted one way, then that's the whole point of enforcing an encoding on the view of the string. The string itself doesn't *have* to be encoded a certain way. Now if strings are just values then the encoding in which they come are largely a matter of implementation. Think of an int -- you don't really know if it's big/small endian -- or a float -- whether it's IEEE xxx or yyy. All that matters though is how the operations on these strings are defined. The point of the abstraction is that you have one way of dealing with the string as a value and write algorithms around that abstraction.
New string that does not solve any of Unicode issues has no place - and this is *real* problem.
You missed the point. It's not the string you want, it's the view of the string you want when you're talking about encoding.
Please don't write theories on C++ string if you do not see what string is - text human readable text that is much more complex then set of byte chunks.
See, I defined what a string is in that document. If you don't agree with that definition then I can't help you. As much as humans want to think that computers see the world the same way, unfortunately that's not the case. A string is a data structure. How you view a string in a given encoding is a matter of algorithm. If you don't see that then I'm sorry for you. -- Dean Michael Berris about.me/deanberris