
On Sun, Jan 30, 2011 at 1:23 AM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
Dean Michael Berris wrote:
Yes, and this is the problem with sbrk: if you rely on your allocator to use sbrk, sbrk is not guaranteed to just expand the available segment in place -- which means the VMM no matter what you do will actually have to find the memory for you and give you that chunk of memory. This means potentially swapping pages in/out.
Let me try to understand that; are you saying the same as this:
sbrk is not guaranteed to expand into memory that is physically contiguous with the previously-allocated memory. Which means that the OS will have to spend time studying its memory layout data structures to find a suitable area of physical memory to map for you.
If that's what you're saying, I believe it, though I'm not convinced that the time taken to do this is often a bottleneck. References would be appreciated. But you might mean something else.
Pretty much the same thing. I'm not sure where that quote came from, but just doing `man sbrk` on Linux will give a lot of information on what `sbrk` will do in cases where it can't grow the current segment for the application. Also, it's not the *only* problem (finding suitable memory to use): on systems where you have many threads competing to allocate/deallocate memory, since threads share the address space basically of the spawning process, you can end up with each thread trying to call sbrk and growing the heap multiple times than you really want -- although I'm positive that ptmalloc would be doing some locking to make sure this happens on a process' address space, stranger things have happened before. ;)
This means potentially swapping pages in/out.
That's the crucial bit. Why do you believe that this allocation could possibly lead to swapping? Do you mean that only on systems that are already low on RAM, or are you suggesting that this could happen even when the overall amount of free RAM is reasonable?
So let's do a thought experiment here: let's say you have an application that's already built and you run it on any given system. The best case performance is that it's the only thing running on the system, in which case you have absolute control and you can use all the memory available to you -- games that run on consoles are like these. Even in this situation where you know the physical limits and characteristics of each machine, you may or may not be talking to a virtual memory manager (I don't know whether Sony PS3's, XboX, etc. have an OS service handling the memory, but let's assume it does). Even in this best case scenario your VMM can choose to page in/out areas of memory mapping the virtual address space to physical addresses, depending on the time of the day (random) or in the case of multi-threaded code running, the actual order in which operations are actually scheduled to run. The best thing you can do even in this best-case scenario is to be prudent with the use of memory -- and if you have a library that does that for you in the best-case scenario, imagine how it would work in the worst-case scenario. ;) Now let's take the worst-case scenario that your application is actually run on a system that has very little RAM available and that it's not the only program running. Worse, the VMM isn't very smart about handling multiple applications that require RAM in general. If you design for this case and it actually works imagine how it would work on the best case scenario? :D
You limit this likelihood by asking for page-sized and page-aligned chunks
When you say "page-sized", do you really mean "page-sized", or do you mean "multiple of the page size"? If you really mean "page-sized", you're suggesting that there should be one call to sbrk or mmap for each 4 kbytes of RAM, right? That seems very wrong to me.
No, see when you call sbrk, you typically don't know how much RAM is actually given to your application. And you only usually call sbrk (in the allocator implementation) when you don't have any more available heap space to deal with. :) What I'm suggesting is that given you already have enough heap to deal with, make your data structure just use page-sized and page-aligned chunks, so that: 1) since it's page-boundary aligned you know that data that's a page's worth doesn't spill into another (contiguous) page and 2) so that when the VMM pages in/out data from physical ram to the cache that it has all the data it needs to deal with in that page. And even if you don't have enough heap to deal with and you get more (with sbrk or mmap), you use the same strategy from a higher level so that your data structures exploit the cache and the page mechanism available from the VMM and the hardware you're running on. Does that make sense? We can split off this discussion if it proves to be more general in the context of a string implementation. ;)
[snip]
Note that there's a hardware page size and a virtual page size.
Really? Can you give an example?
http://en.wikipedia.org/wiki/Page_size -- too lazy to give a specific example, I hope this helps.
Try a simple benchmark: [mmap] a 1GB randomly generated file into a 64-bit multi-core Intel nehalem based processor, spawn 1,000 threads on Linux and randomly on each thread change a value in a randomly picked offset. This will show you how cache misses, VMM paging, and the cost of mutability of data will actually kill your efficiency.
Right, it will be slow.
Now if you change the program to just randomly read from a part of memory, you'll see how immutability helps.
Right, it will be much faster.
But if your program actually needs to change the data then it needs to change the data; you would need to compare e.g.
std::string s; .... s[n]=c;
vs.
immutable_string s; ... s = s.substr(0,n) + c + s.substr(n+1);
Now if you can demonstrate that being faster, I'll be impressed.
Note that in the immutable string case, you won't actually be allocating more memory -- since what you just built is a concatenation tree that refers to the appropriate immutable data blocks or internal tree nodes that's referenced here. You may need to allocate an extra block (or using a "free store" block for individual character storage), but then largely because what you're really doing is building a new string, then it's not really a good representative of the cost of mutation, right? :D If you really need to mutate data, the point is you don't use an immutable data structure to do it. :D
Then change it yet again to [mmap] 1GB worth of data but this time on multiple files breaking them files up into smaller pieces. The dramatic improvement in performance should be evident there. :)
That's an interesting suggestion. Have you actually done this?
Yep. And so do other open-source projects like MongoDB. :)
Can you post some numbers? I would try it myself, but I don't have a suitable NUMA test system right now. Presumably the idea is that the files are distributed randomly between the different processors' memory controllers.
Not just that -- what you're doing is giving the VMM a chance to find the appropriate set of smaller virtual page table chunks than you are giving it with a 1GB allocation in one go. :)
Would you get a similar benefit in the case of the monolithic 1 GB file by mmapping it in e.g. 4 or 8 chunks? In any case, I'm not convinced that this is relevant to the memory allocation issue.
From the OS's view it's an optimization. Form a developer/user
Assuming that you have an SSD storage device where random acccess to memory is pretty much "uniform" basically having the bus speed and the controller's throughput as the limit for the throughput you can get for moving data from "disk" to main memory, what you think about is really the virtual page table that's reserved for the 1GB file. If you mmap it in 4 or 8 chunks what's going to happen is the kernel will see it's the same FD and just give you different addresses that are still contiguous -- and what's even going to happen is it's going to try and do the math and see that "well you're using the same fd so I'll treat these mmap's the same way I would normally treat a single mmap". perspective, in some cases it might be good, in others it might be bad. However when you pass in different fd's the kernel can avoid the synchronization it would do compared to the case if you were using the single fd. And now your contiguous virtual memory chunks would have pretty much unbridled "equal opportunity access" to more memory at very little synchronization. That's why systems like MongoDB typically have a chunk size of 200MB or thereabouts, so that if you want to read data in a given chunk you get unbridled almost unsynchronized access to that 200MB chunk as compared to mapping a whole XXGB file. ~200MB is not just a magic number either, at least in Linux there is a way of doing a "slab allocation" with the kernel's cooperation where you can get large chunks of address space for your process without having giving other processes too much of a hassle. This is interesting discussion and is largely very low-level and system-specific, but the wisdom in the whole discussion really is: prudence with memory use pays dividends especially if you work with the VMM instead of against it. There's a good write-up that goes along this same line at the ACM Queue written by Poul-Henning Kamp espousing the way we should be thinking about effectively using memory pages that are given to applications by the OS. It's an interesting read which you can find here: http://queue.acm.org/detail.cfm?id=1814327 HTH :) -- Dean Michael Berris about.me/deanberris