Re: [boost] [string] proposal

29 Jan 2011

      On Sun, Jan 30, 2011 at 1:23 AM, Phil Endecott
<spam_from_boost_dev@chezphil.org> wrote:
...
Dean Michael Berris wrote:
...
Yes, and this is the problem with sbrk: if you rely on your allocator
to use sbrk, sbrk is not guaranteed to just expand the available
segment in place -- which means the VMM no matter what you do will
actually have to find the memory for you and give you that chunk of
memory. This means potentially swapping pages in/out.
Let me try to understand that; are you saying the same as this:
  sbrk is not guaranteed to expand into memory that is physically
  contiguous with the previously-allocated memory.  Which means that
  the OS will have to spend time studying its memory layout data
  structures to find a suitable area of physical memory to map for
  you.
If that's what you're saying, I believe it, though I'm not convinced that
the time taken to do this is often a bottleneck.  References would be
appreciated.  But you might mean something else.
Pretty much the same thing. I'm not sure where that quote came from,
but just doing `man sbrk` on Linux will give a lot of information on
what `sbrk` will do in cases where it can't grow the current segment
for the application. Also, it's not the *only* problem (finding
suitable memory to use): on systems where you have many threads
competing to allocate/deallocate memory, since threads share the
address space basically of the spawning process, you can end up with
each thread trying to call sbrk and growing the heap multiple times
than you really want -- although I'm positive that ptmalloc would be
doing some locking to make sure this happens on a process' address
space, stranger things have happened before. ;)
...
...
This means potentially swapping pages in/out.
That's the crucial bit.  Why do you believe that this allocation could
possibly lead to swapping?  Do you mean that only on systems that are
already low on RAM, or are you suggesting that this could happen even when
the overall amount of free RAM is reasonable?
So let's do a thought experiment here: let's say you have an
application that's already built and you run it on any given system.
The best case performance is that it's the only thing running on the
system, in which case you have absolute control and you can use all
the memory available to you -- games that run on consoles are like
these. Even in this situation where you know the physical limits and
characteristics of each machine, you may or may not be talking to a
virtual memory manager (I don't know whether Sony PS3's, XboX, etc.
have an OS service handling the memory, but let's assume it does).
Even in this best case scenario your VMM can choose to page in/out
areas of memory mapping the virtual address space to physical
addresses, depending on the time of the day (random) or in the case of
multi-threaded code running, the actual order in which operations are
actually scheduled to run. The best thing you can do even in this
best-case scenario is to be prudent with the use of memory -- and if
you have a library that does that for you in the best-case scenario,
imagine how it would work in the worst-case scenario. ;)

Now let's take the worst-case scenario that your application is
actually run on a system that has very little RAM available and that
it's not the only program running. Worse, the VMM isn't very smart
about handling multiple applications that require RAM in general. If
you design for this case and it actually works imagine how it would
work on the best case scenario? :D
...
...
You limit this
likelihood by asking for page-sized and page-aligned chunks
When you say "page-sized", do you really mean "page-sized", or do you mean
"multiple of the page size"?  If you really mean "page-sized", you're
suggesting that there should be one call to sbrk or mmap for each 4 kbytes
of RAM, right?  That seems very wrong to me.
No, see when you call sbrk, you typically don't know how much RAM is
actually given to your application. And you only usually call sbrk (in
the allocator implementation) when you don't have any more available
heap space to deal with. :)

What I'm suggesting is that given you already have enough heap to deal
with, make your data structure just use page-sized and page-aligned
chunks, so that: 1) since it's page-boundary aligned you know that
data that's a page's worth doesn't spill into another (contiguous)
page and 2) so that when the VMM pages in/out data from physical ram
to the cache that it has all the data it needs to deal with in that
page. And even if you don't have enough heap to deal with and you get
more (with sbrk or mmap), you use the same strategy from a higher
level so that your data structures exploit the cache and the page
mechanism available from the VMM and the hardware you're running on.

Does that make sense? We can split off this discussion if it proves to
be more general in the context of a string implementation. ;)
...
[snip]
...
Note that there's a hardware page size and a virtual
page size.
Really?  Can you give an example?
http://en.wikipedia.org/wiki/Page_size -- too lazy to give a specific
example, I hope this helps.
...
...
Try a simple benchmark: [mmap] a 1GB randomly generated file into a
64-bit multi-core Intel nehalem based processor, spawn 1,000 threads
on Linux and randomly on each thread change a value in a randomly
picked offset. This will show you how cache misses, VMM paging, and
the cost of mutability of data will actually kill your efficiency.
Right, it will be slow.
...
Now
if you change the program to just randomly read from a part of memory,
you'll see how immutability helps.
Right, it will be much faster.
But if your program actually needs to change the data then it needs to
change the data; you would need to compare e.g.
   std::string s; .... s[n]=c;
vs.
   immutable_string s; ...  s = s.substr(0,n) + c + s.substr(n+1);
Now if you can demonstrate that being faster, I'll be impressed.
Note that in the immutable string case, you won't actually be
allocating more memory -- since what you just built is a concatenation
tree that refers to the appropriate immutable data blocks or internal
tree nodes that's referenced here. You may need to allocate an extra
block (or using a "free store" block for individual character
storage), but then largely because what you're really doing is
building a new string, then it's not really a good representative of
the cost of mutation, right? :D

If you really need to mutate data, the point is you don't use an
immutable data structure to do it. :D
...
...
Then change it yet again to [mmap]
1GB worth of data but this time on multiple files breaking them files
up into smaller pieces. The dramatic improvement in performance should
be evident there. :)
That's an interesting suggestion.  Have you actually done this?
Yep. And so do other open-source projects like MongoDB. :)
...
Can you
post some numbers?  I would try it myself, but I don't have a suitable NUMA
test system right now.  Presumably the idea is that the files are
distributed randomly between the different processors' memory controllers.
Not just that -- what you're doing is giving the VMM a chance to find
the appropriate set of smaller virtual page table chunks than you are
giving it with a 1GB allocation in one go. :)
...
 Would you get a similar benefit in the case of the monolithic 1 GB file by
mmapping it in e.g. 4 or 8 chunks?  In any case, I'm not convinced that this
is relevant to the memory allocation issue.
...
From the OS's view it's an optimization. Form a developer/user
Assuming that you have an SSD storage device where random acccess to
memory is pretty much "uniform" basically having the bus speed and the
controller's throughput as the limit for the throughput you can get
for moving data from "disk" to main memory, what you think about is
really the virtual page table that's reserved for the 1GB file. If you
mmap it in 4 or 8 chunks what's going to happen is the kernel will see
it's the same FD and just give you different addresses that are still
contiguous -- and what's even going to happen is it's going to try and
do the math and see that "well you're using the same fd so I'll treat
these mmap's the same way I would normally treat a single mmap".

perspective, in some cases it might be good, in others it might be
bad.

However when you pass in different fd's the kernel can avoid the
synchronization it would do compared to the case if you were using the
single fd. And now your contiguous virtual memory chunks would have
pretty much unbridled "equal opportunity access" to more memory at
very little synchronization. That's why systems like MongoDB typically
have a chunk size of 200MB or thereabouts, so that if you want to read
data in a given chunk you get unbridled almost unsynchronized access
to that 200MB chunk as compared to mapping a whole XXGB file. ~200MB
is not just a magic number either, at least in Linux there is a way of
doing a "slab allocation" with the kernel's cooperation where you can
get large chunks of address space for your process without having
giving other processes too much of a hassle.

This is interesting discussion and is largely very low-level and
system-specific, but the wisdom in the whole discussion really is:
prudence with memory use pays dividends especially if you work with
the VMM instead of against it.

There's a good write-up that goes along this same line at the ACM
Queue written by Poul-Henning Kamp espousing the way we should be
thinking about effectively using memory pages that are given to
applications by the OS. It's an interesting read which you can find
here: http://queue.acm.org/detail.cfm?id=1814327

HTH :)

-- 
Dean Michael Berris
about.me/deanberris