
On Fri, Jan 28, 2011 at 1:30 PM, Steven Watanabe <watanabesj@gmail.com> wrote:
On 1/27/2011 2:52 AM, Dean Michael Berris wrote:
5. Because of the contiguous requirement, using it for any "text" that's larger than a memory page's worth of data will kill your cache coherency -- and then when you modify parts of it then you can thank your virtual memory manager when the modifications are done. Then you see that you would have to implement your own segmented data structure to act as a string and then you realize you're better off not using std::string for situations where the amount of data you're going to deal with is potentially larger than cache line.
I beg your pardon, but this makes no sense to me. Would you mind explaining exactly what kind of usage makes the hardware unhappy and why?
For multi-core set-ups where you have a NUMA architecture, having one thread allocate memory that's in a given memory controller (in a given core/CPU) that has to be a given size spanning multiple pages will give your OS a hard time finding at least two contiguous memory pages (especially when memory is a scarce resource). That's the virtual memory manager at work and that's a performance killer on most modern (and even not so modern) platforms. This means, lets say that you have a memory page that's 4kb long, and you need a contiguous string that is say 9kb, then that means your OS would have to find 3 contiguous memory pages to be able to fit a single string. Say that again, a single string needing 3 *contiguous* memory pages. If this allocation happens on one core and then the process is migrated to a different core for whatever reason that controls a different memory module, then just accessing that string that's even in L3 cache would be a problem. Then when you change anything in the pages, your write-through cache will see it, and the OS virtual memory manager will mark that page dirty. Next time you need to access memory in the same page in a different core and especially if the data isn't in L3 cache yet, OS sees a dirty page and does all page management necessary to mark that page clean just for reading the data. Now that's fine if the string doesn't grow. But now let's say you concatenate a 4kb string to a 9kb string -- you now then need 5 *contiguous* pages to fit all the 13kb data in a single string. That's on top of the storage that's already required by either string. Then all hell breaks loose as your VMM will have to find these 5 pages that sit right next to each other, potentially paging out stuff just so that it can satisfy this simple concatenation request. The time it takes to get something like that done is just unacceptable whenever you need to do something remotely interesting on a system that should be handling thousands of transactions per second -- or even just on a user's machine where you have absolutely no idea what other programs are running and what kind of architecture they have. :)
I'm not even sure what this has to do with cache coherency (http://en.wikipedia.org/wiki/Cache_coherence).
In the case where you have two strings trying to access the same string, and the string changes on a process that's working on one processor, the processor has to keep the caches of both up-to-date especially if it's in L1 cache. That means if you're building a string in one thread and simultaneously reading it from another, you're stressing the cache coherence mechanisms used by them multi-core machines. On single-core machines that's largely not too much of a problem unless you run into non-local memory access and have to swap things in and out of the cache -- the case for when you have large contiguous chunks of memory that have to be accessed randomly (like strings and vectors). HTH -- Dean Michael Berris about.me/deanberris