Re: [boost] [string] proposal

29 Jan 2011

      On Sat, Jan 29, 2011 at 5:24 PM, Artyom <artyomtnk@yahoo.com> wrote:
...
-
...
From: Dean Michael Berris <mikhailberis@gmail.com>
On Sat, Jan 29, 2011 at 3:02 PM, Artyom <artyomtnk@yahoo.com>  wrote:
...
It would turn away 90% of  users.
It might turn you away because you obviously love  std::string.
Generalizing is a different matter and is largely a hot-air  blowing
exercise that is futile for convincing anybody.
I would say it more clear:
1. All users that use C libraries and need c_str() at boundaries
  And this is a huge amount of users that need to communicate
  with modules that are already working and ready but written in C.
  And this is about of half of libraries there is C is the lowest
  level API that allows easy bindings to all languages.
But c_str() doesn't have to be part of the string's interface.
...
2. All users of GUI toolkits like GTK, Qt, wxWidgets, MFC as they require
conversion
  of "boost::what_ever_is_it_called" to QString, ustring, wxString, CString.
  and it is done via C string.
So, what was the point again?
...
3. All users who actually use Operating System API that uses C strings
  and require char *.
Plenty? Isn't it?
I know, so what is your point?
...
Please take a look on frequent cases of string usage and you'll
see how much do you indeed need rope like structure and how
much normal string.
Don't forget that almost all string implementations in all languages
are continuous single memory chunks.
So what if all other string implementations in all languages are
contiguous? Does that mean that's the *only* way to do it?

Look, in my paper -- if you read and *understood* it -- I pointed out
that linearizing a string is an algorithm that deals with a string.
Much like how std::copy is an algorithm that is external to a
container, I see linearization as something not part of the string
interface. That was towards the end part. I never said that a string
shouldn't be linearizable.
...
...
...
2. In such case it would be even better to have  non-shared
  strings
Weh?
Because of memory locality, think of part of string references
to "other memory"
Memory locality is solved by making it available to the cache. If you
have a contiguous chunk of 4kb *that never ever changes* then
accessing that memory from all the cores in a NUMA machine is largely
a matter of the cache reading part of that and making it available.
Making copies of the string is *unnecessarily wasteful*.
...
...
...
I beg your pardon? It is efficient as all  functions
are as efficient as memcpy with exceptions of  overflow/underflow
happens which require some virtual functions  calls
which are pretty fast as well...
Also 99% of  issues are just solved with reserve.
(and I work with text parsing,  combining and processing a lot)
And you obviously don't work with  systems that have to do this
multiple thousand times in one second to not  know what the effects of
NUMA are and why allocating a contiguous amount of  memory is the
performance killer that it is.
I know, but I hadn't suggested that streambuf should use single memory
chunk.
So then what's the point of making strings use a single contiguous
memory chunk if it's not necessary?
...
...
...
This  article written from wrong understanding of real
problems - instead of  solving a problem it suggests
some idea for some cases not looking to  the problem
in hole.
[snip]
The article was written from the understanding that the real  problem
stems from how std::string is broken. It already identifies why  it's
broken. It seems that you're just happy to attack people and the  work
they do more than you are interested in solving problems.
If you  disagree with what's being said argue on the merits of "why".
Mud-slinging  and sitting on a high horse and just saying "blech,
you're wrong" is not  helping solve any technical problems.
I'm sorry but I think that much more real problem is that:
- My father in law can't use Thunderbird because he defined non-ascii
 user name and Thunderbird fails to open the profile. So he needs to create a
new
 account because half of the other programs are broken when Unicode
 paths are used!
So fix Thunderbird.
...
- That acrobat reader can't open files with Unicode file names that
 user have (at least it was last time I've tried it)
So go work for Adobe and fix Acrobat Reader.
...
- That you can't write cross platform Unicode aware code using
 simple std::string/char * or what even encoding non-aware
 string there.
You can, there are already libraries for that sort of thing if you
insist on using std::string.
...
What you are doing is classical example if micro-optimization
that is concentrated on string storage.
No. If you really did read the document and understood it, I was
talking at a high level about how to fix the problem by going through
the rationale for immutable strings. The storage issue is a necessary
component of the implementation efficiency concerns.

If you design something without thinking about the efficiency of the
solution, then you're doing art not engineering. I'm not an artist and
I sure want to think I'm an engineer by training and by trade.

Notice that I haven't mentioned any micro-optimizations or
micro-benchmarks in the document as well so I don't know where you're
coming from when you say I'm micro-optimizing anything.
...
Note: all strings in all toolkits around don't do much beyond
what std::string does in terms of storage (QString, ustring,
UnicodeString, wxString) all use same storage model, some
immutable that I can accept it but all:
1. Are Unicode aware
2. Use single memory chunk
There is a very good reason for this but it seems
that you just to get it why all strings designs
around this same principle.
I do know why it's designed the same way: because that's the naive
thing to do. Someone thought "oh well, we can malloc a chunk of memory
and put a \0 in the end and call that a string". That worked up to a
certain level, and then when people started to look at a better way of
doing things, they saw that this isn't enough. Notice how the
applications you mention use a segmented data structure when dealing
with things like edit buffers or similar things. Strings are largely
reserved for "short" data and anytime you need anything "longer" you'd
use something else -- the question I'm trying to address is why can't
you use one data structure that will be efficient for both cases?
That's the point of the title which aims to come up with a singular
way of explaining how strings should be so that they're suitable for
short and long "strings of characters".
...
Take a look on these fundamental operations you had written:
- Concatenation (generally ok)
- Substring - should be Unicode aware in most of cases
- Filtration - should be Unicode and Locale aware
- Tokenization - should be Unicode and Locale aware
- Search/Pattern Matching - should be Unicode and Locale aware.
So please if you don't understand why these fundamental
operations and why the string should relate to encoding
then you need to reread this thread.
But these are *algorithms* that should be aware of the encoding, *not
the string*. If you don't understand that point then you need to read
the document *again*.

The point is if you viewed a string a given way then that's largely an
implementation of the view. I haven't gotten to the explanation of the
view but I hinted that interpretation is a matter of composition. So
if you "wrap" a string and say that it should be interpreted one way,
then that's the whole point of enforcing an encoding on the view of
the string. The string itself doesn't *have* to be encoded a certain
way.

Now if strings are just values then the encoding in which they come
are largely a matter of implementation. Think of an int -- you don't
really know if it's big/small endian -- or a float -- whether it's
IEEE xxx or yyy. All that matters though is how the operations on
these strings are defined. The point of the abstraction is that you
have one way of dealing with the string as a value and write
algorithms around that abstraction.
...
New string that does not solve any of Unicode issues
has no place - and this is *real* problem.
You missed the point. It's not the string you want, it's the view of
the string you want when you're talking about encoding.
...
Please don't write theories on C++ string if
you do not see what string is - text human
readable text that is much more complex
then set of byte chunks.
See, I defined what a string is in that document. If you don't agree
with that definition then I can't help you. As much as humans want to
think that computers see the world the same way, unfortunately that's
not the case.

A string is a data structure. How you view a string in a given
encoding is a matter of algorithm. If you don't see that then I'm
sorry for you.

-- 
Dean Michael Berris
about.me/deanberris