Re: [boost] [string] proposal

30 Jan 2011

      ...
From: Dean Michael Berris <mikhailberis@gmail.com>
On Sat, Jan 29, 2011 at 11:25 PM, Artyom <artyomtnk@yahoo.com> wrote:
...
...
From: Dean Michael Berris <mikhailberis@gmail.com>
 On Sat, Jan 29, 2011 at 8:06 PM, Artyom <artyomtnk@yahoo.com>   wrote:
No, it's not obvious. Here's  why:
fd =  creat(file.c_str(),  0666);
What does c_str() here imply? It implies that   there's a buffer
somewhere that is a `char const *` which is either  created  and
returned and then held internally by whatever 'file'  is.
It implies that const file owns const buffer that holds null  terminated
string that can be passed to "char const *"  API.
Yes, which is the problem in the first place. Every instance  of string
would then need to have that same buffer even if that string is  just a
temporary or worse just a copy.
It would not happen if it holds the data as linear single chunk :-)
...
...
...
Now let's   say
what if file changed in a different thread, what happens to the   buffer
pointed to by the pointer returned by c_str()? Explain to me   that
because *it is possible and it can happen in real  code*.
I'm sorry but string as anything else has value  semantics
that is:
- safe for "const" access from  multiple threads
  - safe for mutable access from single  thread
I don't see why string should be different from
 any other value type like "int" because following
x+=y +  y
is not safe for integer as well.
The code I  had shown is **perfectly** safe with
string has has value semantics  (which std::string has)
Now the point I was making was, in the  case of a string that is
immutable, you don't worry about the string changing  *ever*, don't
need a contiguous buffer for something that isn't explicitly  required.
It's value semantics *plus* immutability.
So you either should forbid 

    your_string const &operator=(your_string const &)

Which makes it even more useless (IMHO)

Or your statement is wrong because assignment
can happen for example from other thread like

   str = str + " suffux"

And this would change the string in run time.

Your statement is false unless I miss
something or you want to put a mutex inside
a string or some other atomic variable.

I don't see any reason to not to treat
a string as any other value.
...
...
I  think we both and 95% of C++ programmers that use STL
know what is the  semantics of std::string::c_str()
...
I like this  better:
char * filename = (char *)malloc(255,  sizeof(char)); // I know I
want 255  characters max
  if  (filename == NULL) {
    // deal with  the error here
   }
  linearize(substr(file, 0, 255),  filename);
  fd =  creat(filename, 0666);
Sorry? Is this better  then:
fd=create(filename.substr(0,256).c_str(),O_EXCL...)
Which by  the way is 100% thread safe as well (but still may throw).
Even thou I  can't see any reason to cut 256 bytes before create
No, not  better because std::string's substr() will return a temporary,
which means it  will be a copy -- meaning another allocation and a call
to memcpy(...). You  don't get the benefit of COW on this one because
you need to cut the string  down to a "maximum size".
Actually small notice you will have to copy because C API
expects NULL terminated string, so you can't avoid memory
copy if the last byte is not NULL.

So no difference there
...
I don't know if you know, but them C APIs from  OSes have a defined
maximum on the lengths of filenames and things like  that...
Not exactly maximal length is something much more complicated,
run-time and specific OS dependent, but this is other story.
...
...
...
And then the problem is not addressed  then of the  unnecessary  contiguous
buffer.
There is good idea to  have some non-linear data storage but
it should used in very specific  cases.
Also what is really large string for you that would  have
performance advantage not being stored lineary.
Talk to me in numbers?
Say on a machine+OS combo that has 4kb  pages, a large string would be
something that spans more than one memory page  -- i.e. >4kb.
Now a "short" string is one that can fit within a page.  For
concatenated strings, it's fine to compact/copy the substrings into  a
growable/shrinkable block. The overhead for a short string would  be
constant, as the concatenation tree would be a pointer and a  length
with a reference count integer.
Actually 

I actually mean benchmarks how long string should be that it would be
more efficient to use as chunks.

So you suggest 4k

Which:

1. Covers almost all full file path names on every operating system
2. Covers almost all messages in text dialogs around
3. Covers all possible user data etc.

So basically a single memory chunk is efficient for 99% of use
cases.

Now lets take it to extreme:

- Longest Wikipedia article: 388k = ~ 39 pages
- Book: War and Peace: text size... 3.1MB = 800 pages.

How frequent this case? Very rare.

See text is basically something quite short, 
long books were successfully written in days
where 640K was more then enough. So the real
benefits of non-linear data structure are rare,
on the other had they add too much complexity
for every day use.

So basically:

1. 99% of use cases fit to single page.
2. Very few cases would actually make a use of multi-page
   architecture.
3. The extreme cases that may benefit of this data
   structure are very rare and probably should use their
   own structures.

Just to clean up all the things

1. I do think that what you suggest is interesting
   and fine data structure.
2. I do think that in certain cases it would be very
   useful.
3. I do think that you have good experience with situations
   were such structure may be very useful.
4. I know your works (netlib) and I really appreciate
   what you do .

However I still think:

1. Such structure does not give much benefits in main
   stream cases for string data structure in its
   text meaning (which what string for 99% programmers is).

   I mean non-linear memory is not really so
   useful for common use case.

2. I think you should take a look on major string
   use cases to decide what is better for
   "next C++ string" if you want to develop some.

Also as you probably know I'm author of several
projects most of them are strongly tied to text 
processing and handing:

1. Boost.Locale - strongly text and Unicode
   oriented.

2. CppCMS - C++ web framework that deals with
   strings and networking in most of its code.

   That by the was was the reason for Boost.Locale
   to be developed.

3. BidiTeX - bidirectional support for LaTeX/Hebrew
   which is mostly deals with text.

So I do have some basic view on what are the use
cases of strings, I don't say I know them all
but I developed some "feeling" about what text
processing needs and in my opinion you
just miss the major use case.

I think I'll stop trying to convince you
that it is wrong way to look at string
just because I think that users would
know what to pic in real applications.

Best Regards and Good Luck,
  Artyom