Re: [boost] [string] proposal

28 Jan 2011

      On Fri, Jan 28, 2011 at 1:30 PM, Steven Watanabe <watanabesj@gmail.com> wrote:
...
On 1/27/2011 2:52 AM, Dean Michael Berris wrote:
...
5. Because of the contiguous requirement, using it for any "text"
that's larger than a memory page's worth of data will kill your cache
coherency -- and then when you modify parts of it then you can thank
your virtual memory manager when the modifications are done. Then you
see that you would have to implement your own segmented data structure
to act as a string and then you realize you're better off not using
std::string for situations where the amount of data you're going to
deal with is potentially larger than cache line.
I beg your pardon, but this makes no sense to me.
Would you mind explaining exactly what kind of
usage makes the hardware unhappy and why?
For multi-core set-ups where you have a NUMA architecture, having one
thread allocate memory that's in a given memory controller (in a given
core/CPU) that has to be a given size spanning multiple pages will
give your OS a hard time finding at least two contiguous memory pages
(especially when memory is a scarce resource). That's the virtual
memory manager at work and that's a performance killer on most modern
(and even not so modern) platforms.

This means, lets say that you have a memory page that's 4kb long, and
you need a contiguous string that is say 9kb, then that means your OS
would have to find 3 contiguous memory pages to be able to fit a
single string. Say that again, a single string needing 3 *contiguous*
memory pages. If this allocation happens on one core and then the
process is migrated to a different core for whatever reason that
controls a different memory module, then just accessing that string
that's even in L3 cache would be a problem.

Then when you change anything in the pages, your write-through cache
will see it, and the OS virtual memory manager will mark that page
dirty. Next time you need to access memory in the same page in a
different core and especially if the data isn't in L3 cache yet, OS
sees a dirty page and does all page management necessary to mark that
page clean just for reading the data.

Now that's fine if the string doesn't grow. But now let's say you
concatenate a 4kb string to a 9kb string -- you now then need 5
*contiguous* pages to fit all the 13kb data in a single string. That's
on top of the storage that's already required by either string. Then
all hell breaks loose as your VMM will have to find these 5 pages that
sit right next to each other, potentially paging out stuff just so
that it can satisfy this simple concatenation request. The time it
takes to get something like that done is just unacceptable whenever
you need to do something remotely interesting on a system that should
be handling thousands of transactions per second -- or even just on a
user's machine where you have absolutely no idea what other programs
are running and what kind of architecture they have. :)
...
I'm
not even sure what this has to do with cache coherency
(http://en.wikipedia.org/wiki/Cache_coherence).
In the case where you have two strings trying to access the same
string, and the string changes on a process that's working on one
processor, the processor has to keep the caches of both up-to-date
especially if it's in L1 cache. That means if you're building a string
in one thread and simultaneously reading it from another, you're
stressing the cache coherence mechanisms used by them multi-core
machines. On single-core machines that's largely not too much of a
problem unless you run into non-local memory access and have to swap
things in and out of the cache -- the case for when you have large
contiguous chunks of memory that have to be accessed randomly (like
strings and vectors).

HTH

-- 
Dean Michael Berris
about.me/deanberris