
From: Dean Michael Berris <mikhailberis@gmail.com>
On Sat, Jan 29, 2011 at 11:25 PM, Artyom <artyomtnk@yahoo.com> wrote:
From: Dean Michael Berris <mikhailberis@gmail.com> On Sat, Jan 29, 2011 at 8:06 PM, Artyom <artyomtnk@yahoo.com> wrote:
No, it's not obvious. Here's why:
fd = creat(file.c_str(), 0666);
What does c_str() here imply? It implies that there's a buffer somewhere that is a `char const *` which is either created and returned and then held internally by whatever 'file' is.
It implies that const file owns const buffer that holds null terminated string that can be passed to "char const *" API.
Yes, which is the problem in the first place. Every instance of string would then need to have that same buffer even if that string is just a temporary or worse just a copy.
It would not happen if it holds the data as linear single chunk :-)
Now let's say what if file changed in a different thread, what happens to the buffer pointed to by the pointer returned by c_str()? Explain to me that because *it is possible and it can happen in real code*.
I'm sorry but string as anything else has value semantics that is:
- safe for "const" access from multiple threads - safe for mutable access from single thread
I don't see why string should be different from any other value type like "int" because following
x+=y + y
is not safe for integer as well.
The code I had shown is **perfectly** safe with string has has value semantics (which std::string has)
Now the point I was making was, in the case of a string that is immutable, you don't worry about the string changing *ever*, don't need a contiguous buffer for something that isn't explicitly required. It's value semantics *plus* immutability.
So you either should forbid your_string const &operator=(your_string const &) Which makes it even more useless (IMHO) Or your statement is wrong because assignment can happen for example from other thread like str = str + " suffux" And this would change the string in run time. Your statement is false unless I miss something or you want to put a mutex inside a string or some other atomic variable. I don't see any reason to not to treat a string as any other value.
I think we both and 95% of C++ programmers that use STL know what is the semantics of std::string::c_str()
I like this better:
char * filename = (char *)malloc(255, sizeof(char)); // I know I want 255 characters max if (filename == NULL) { // deal with the error here } linearize(substr(file, 0, 255), filename); fd = creat(filename, 0666);
Sorry? Is this better then:
fd=create(filename.substr(0,256).c_str(),O_EXCL...)
Which by the way is 100% thread safe as well (but still may throw). Even thou I can't see any reason to cut 256 bytes before create
No, not better because std::string's substr() will return a temporary, which means it will be a copy -- meaning another allocation and a call to memcpy(...). You don't get the benefit of COW on this one because you need to cut the string down to a "maximum size".
Actually small notice you will have to copy because C API expects NULL terminated string, so you can't avoid memory copy if the last byte is not NULL. So no difference there
I don't know if you know, but them C APIs from OSes have a defined maximum on the lengths of filenames and things like that...
Not exactly maximal length is something much more complicated, run-time and specific OS dependent, but this is other story.
And then the problem is not addressed then of the unnecessary contiguous buffer.
There is good idea to have some non-linear data storage but it should used in very specific cases.
Also what is really large string for you that would have performance advantage not being stored lineary.
Talk to me in numbers?
Say on a machine+OS combo that has 4kb pages, a large string would be something that spans more than one memory page -- i.e. >4kb.
Now a "short" string is one that can fit within a page. For concatenated strings, it's fine to compact/copy the substrings into a growable/shrinkable block. The overhead for a short string would be constant, as the concatenation tree would be a pointer and a length with a reference count integer.
Actually I actually mean benchmarks how long string should be that it would be more efficient to use as chunks. So you suggest 4k Which: 1. Covers almost all full file path names on every operating system 2. Covers almost all messages in text dialogs around 3. Covers all possible user data etc. So basically a single memory chunk is efficient for 99% of use cases. Now lets take it to extreme: - Longest Wikipedia article: 388k = ~ 39 pages - Book: War and Peace: text size... 3.1MB = 800 pages. How frequent this case? Very rare. See text is basically something quite short, long books were successfully written in days where 640K was more then enough. So the real benefits of non-linear data structure are rare, on the other had they add too much complexity for every day use. So basically: 1. 99% of use cases fit to single page. 2. Very few cases would actually make a use of multi-page architecture. 3. The extreme cases that may benefit of this data structure are very rare and probably should use their own structures. Just to clean up all the things 1. I do think that what you suggest is interesting and fine data structure. 2. I do think that in certain cases it would be very useful. 3. I do think that you have good experience with situations were such structure may be very useful. 4. I know your works (netlib) and I really appreciate what you do . However I still think: 1. Such structure does not give much benefits in main stream cases for string data structure in its text meaning (which what string for 99% programmers is). I mean non-linear memory is not really so useful for common use case. 2. I think you should take a look on major string use cases to decide what is better for "next C++ string" if you want to develop some. Also as you probably know I'm author of several projects most of them are strongly tied to text processing and handing: 1. Boost.Locale - strongly text and Unicode oriented. 2. CppCMS - C++ web framework that deals with strings and networking in most of its code. That by the was was the reason for Boost.Locale to be developed. 3. BidiTeX - bidirectional support for LaTeX/Hebrew which is mostly deals with text. So I do have some basic view on what are the use cases of strings, I don't say I know them all but I developed some "feeling" about what text processing needs and in my opinion you just miss the major use case. I think I'll stop trying to convince you that it is wrong way to look at string just because I think that users would know what to pic in real applications. Best Regards and Good Luck, Artyom