Re: [Boost-users] shared memory overhead

Modern systems use on demand allocation. I.e. you can allocate a (f.ex.) 32 MB SHM chunk, but the actual resource usage (RAM) will correspond to what you actually use. For example:
0 1M 32M [****|..................] | | | +- unused part | +- used part of the SHM segment As long as your program does not touch (neither reads nor writes) the unused part, the actual physical memory usage will be 1M + small amount for page tables (worst case: 4kB of page tables for 4MB of virtual address space). This is at least how SYSV SHM works on Solaris 10 (look up DISM - dynamic intimate shared memory); I would expect it to work in the same way on new linux kernels too. I'm not sufficiently acquainted with NT kernel to be able to comment on it.
Interesting. I hadn't thought about that. I tried a test program (running on Windows XP), and had a number of separate processes each allocate a new managed_windows_shared_memory object (with a different name for each process) with size 2^29 bytes (=512MB). I'm not exactly sure what resources it allocates; using TaskInfo to view resource usage, each process's "virtual KB" usage goes up by 512MB, but its "working set KB" usage doesn't increase until I actually allocate memory within the shared memory segment. Sounds good, but the 6th one of these failed and I got a warning saying my system was low on virtual memory. So it sounds like there is a 4GB total system limit for WinXP even for just reserving virtual address space -- which seems silly since each process should have its own address space and therefore as long as I don't actually allocate the memory, and each process's reserved address space doesn't exceed 2^32 (or 2^31 or whatever the per-process limit is), I should be able to reserve an unlimited amount of total address space. No can do. :( So strategy #1 of being profligate in choosing shared memory segment size fails on WinXP; there's a significant resource cost even if you don't actually allocate any memory. Drat.

On Fri, May 30, 2008 at 09:56:45AM -0400, Jason Sachs wrote:
usage, each process's "virtual KB" usage goes up by 512MB, but its "working set KB" usage doesn't increase until I actually allocate memory within the shared memory segment.
Excellent :-) Working set is the actual amount of RAM used, while virtual memory is just the size of the virtual address space which might not have yet entered into the working set, or might not have been "committed" at all. (I believe that "commit" is the NT's technical term for first-time faulting in a page and thus also reserving physical RAM.) Is there a separate column for "committed memory"? Virtual is, well, just reserved; committed is actually allocated; working set is what is currently in RAM (usually less than committed). Again, I'm not an NT expert -- plese cross-check the above paragraph(s) with other sources.
Sounds good, but the 6th one of these failed and I got a warning saying my system was low on virtual memory. So it sounds like there is a 4GB total
I'd rather say that you're low on swap. Each SHM segment needs a corresponding amount of swap space which can be used as backing store, should you decide to really use all of the reserved memory. I.e. when the total working set size of all programs exceeds the total amount of physical memory (minus kernel memory), some of the pages need to be swapped out to backing store -- in this case swap. Also note that the swap space is also just _reserved_ -- the kernel needs to ensure that it's there before it hands you out the SHM segment, but it will not be used unless you become short on physical memory. I.e. a mere swap space _reservation_ will not slow down your system or program. Try increasing the amount of swap space (so that it's [64MB * # of programs] larger[*] than the [SHM segment size * # of programs]), repeat the experiment and see what happens. 6 programs x 512MB, so you should be safe at 3GB + amount of physical RAM + extra ~1GB for everything else on the system. [*] Rule of thumb. Every process needs additional VM for stack, data, code, etc.
which seems silly since each process should have its own address space and
it does.
the per-process limit is), I should be able to reserve an unlimited amount of total address space. No can do. :(
what do you mean by "total address space"? total address space == RAM + swap (and that is, I guess, what NT calls "virtual memory"), so it is not unlimited. it is very reasonable that the kernel refuses to overcommit memory (i.e. does not allow you to reserve more than the "total address space"); simulation of truly unlimited memory quickly leads to nasty situations (read about linux's out-of-memory killer).
So strategy #1 of being profligate in choosing shared memory segment size fails on WinXP; there's a significant resource cost even if you don't actually allocate any memory. Drat.
Well, the only resource cost that I can see is disk space reserved for swap. Given today's disks, I don't see that being a problem if it buys you a simpler programming model. (And to make it clear, just in case: this is my comment on your particular application; I do *not* recommend this approach as a general programming practice!)

I havn't followed this whole thread, but I seem to recall that HDF5 supports MPI with Parallel HDF5. http://www.hdfgroup.org/HDF5/PHDF5/ Or does that not solve your requirements?
Alas, Parallel HDF5 != concurrent file access. As I understand it, parallel HDF5 = cooperating threads within a process writing in parallel, and I need one process to write & others to monitor/display the data.
Could you maybe use a raw memory-mapped file instead, and convert it to HDF5 off-line?
well, technically yes, but for robustness reasons I want to decouple the HDF5 logging from the shared memory logging. I'm very happy with the file format's storage efficiency and robustness + have not had to worry about file corruption (though oddly enough, the "official" HDF5 editor from the HDF5 maintainers has caused corruption in a few logs when I added some attributes after the fact), so would like to maintain independent paths: the HDF5 file as a (possibly) permanent record, and my shared memory structure, which could possibly become corrupt if I have one of those impossible-to-reproduce bugs -- but I don't care since I have the log file. I'm also dealing with a very wide range of storage situations; most are going to be consecutive packets of data that are written to the file + left there, but in some cases I may actually delete portions of previously-written data that has been deemed discardable, in order to make room for a long test run... more complicated than a vector that grows with time, or a circular buffer. I've defined structures within the HDF5 file which handle this fine; in the shared memory I was going to do essentially the same thing & have a boost::interprocess::list<> or map<> of moderately-sized data chunks (64K-256K) that I can keep/discard. But back to the topic at hand -- let me restate my problem: Suppose you have N processes where each process i=0,1,...N-1 is going to need a pool of related memory with a maximum usage of sz[i] bytes. This size sz[i] is not known beforehand but is guaranteed less than some maximum M; it has a mean expected value of m where m is much smaller than M. From a programmer's standpoint, the best way to handle this would be to reserve a single shared memory segment and ask Boost::interprocess to make the segment size equal to M. If I do this then my resource usage in the page file (or on disk if I use a memory-mapped file) is N*M which is much higher than I need. (I figured out the source of this: windows_shared_memory pre-commits space in the page file equal to the requested size) So what's a reasonable way to architect shared memory use to support this kind of demand? I guess maybe I could use a vector of shared memory segments, starting with something like 256KB and increasing this number as I need to add additional segments. It just seems like a pain to have to maintain separate memory segments and have to remember which items live where. Just for numbers, I may have an occasional log going on that needs to be in the 512MB range (though most of the time though it will be in the 50-500K range, occasionally several megabytes), and I can have 4-6 of these going on at once (though usually just one or two). On my own computer I have increased my max swap file size from 3GB to 7GB (so the hard limit is somewhat adjustable), though it didn't take effect until I restarted my PC. I'm going to be using my programs on several computers + it seems silly to have to go to this extent.

On Fri, May 30, 2008 at 12:25:25PM -0400, Jason Sachs wrote:
Could you maybe use a raw memory-mapped file instead, and convert it to HDF5 off-line?
well, technically yes, but for robustness reasons I want to decouple
ok, i just wanted to know more about your problem.
Boost::interprocess to make the segment size equal to M. If I do this then my resource usage in the page file (or on disk if I use a memory-mapped file) is N*M which is much higher than I need. (I
Which is much higher than you need on the average. I fail to see why having a 10GB, or even a 20GB, swap-file is a problem for you. Too much of a hassle to configure it on all workstations?
of these going on at once (though usually just one or two). On my own computer I have increased my max swap file size from 3GB to 7GB (so the hard limit is somewhat adjustable), though it didn't take effect until I restarted my PC. I'm going to be using my programs on several computers + it seems silly to have to go to this extent.
Ok, and you'll run your job on a machine with e.g. 1GB of swap[*], and this particular instance will need 4GB of swap. What will happen when the allocation fails? Note that growing the SHM segment in small chunks will not help you with insufficient virtual memory, so you might as well allocate M*N at once and exit immediately if the memory is not available. [*] I'm using "swap" somewhat imprecisely to refer to total virtual memory (RAM + swap). Next-best solution: use binary search to find the maximum size you can allocate and use that instead of M*N. Neither way is particularly friendly towards other processes on the machine (I assumed that you were running the jobs on dedicated machines), but is least painful. Which is least expensive: your time spent developing multi- chunk SHM management or just allocating a big chunk and reconfiguring all computers *once*? (I'm sorry, I'm very pragmatic, and I don't seem to have enough info to really understand why you're making such a fuss over the swap size issue. I'm afraid I can't offer you any further suggestions, since I consider this a non-problem unless you have further constraints.)

I havn't followed this whole thread, but I seem to recall that HDF5 supports MPI with Parallel HDF5. http://www.hdfgroup.org/HDF5/PHDF5/ Or does that not solve your requirements?
Alas, Parallel HDF5 != concurrent file access. As I understand it, parallel HDF5 = cooperating threads within a process writing in parallel, and I need one process to write & others to monitor/display the data.
Coincidently on the HDF5 mail list today, here were a few possibly related comments (one user indicates that they do simultaneous writes, which may or may not be similar to what is needed here : ================ Hello; i've got an implementation which uses HL API and i run multiple writers and possibly one reader. The writers go to the same os file but different hdf files. in use case scenario, the reader and writer are operational on same hdf asset at the same time. this reader is also written in a manner that if it reaches EOF, then it'll wait sometime and then proceed reading. all this is for win32/vc++... not sure if the same applies to *nix. and it works fine. the only thing i needed to do was to enable multi-threading building of HDF5 and HL. i think there is a link on how to do that... i believe one need only define the symbol "H5_HAVE_THREADSAFE" and uncomment some commented out lines in H5pubconf.h. not sure that answers your questions... and... hope it helps. regards, Sheshadri Jason Sachs wrote:
I was wondering where I could find some more technical details about concurrent reading/writing.
The FAQ discusses it briefly (http://www.hdfgroup.org/hdf5-quest.html#grdwt):
<excerpt> It is possible for multiple processes to read an HDF5 file when it is being written to, and still read correct data. (The following steps should be followed, EVEN IF the dataset that is being written to is different than the datasets that are read.)
Here's what needs to be done:
* Call H5Fflush() from the writing process.
* The writing process _must_ wait until either a copy of the file is made for the reading process, or the reading process is done accessing the file (so that more data isn't written to the file, giving the reader an inconsistent view of the file's state).
* The reading process _must_ open the file (it cannot have the file open before the writing process flushes its information, or it runs the risk of having its data cached in memory being incorrect with respect to the state of the file) and read whatever information it wants.
* The reading process must close the file.
* The writing process may now proceed to write more data to the file.
There must also be some mechanism for the writing process to signal the reading process that the file is ready for reading and some way for the reading process to signal the writing process that the file may be written to again. </excerpt>
Could someone elaborate in a more technical manner? e.g. SWMR (single-writer multiple-reader) can occur if the following is true (not sure if I have this correct; I use "process" rather than "threads" here & am not sure if HDF5 in-memory caches have thread affinity):
1. At all times the file is in one of the following states: (a) unmodified (b) modified (written to, but not flushed)
2. In the unmodified state, zero or more processes may have the file open. No process may write to the data.
3. In the modified state, exactly one process may have the file open. This is the process that can write to it.
4. A successful transition from the unmodified state -> modified state takes place when exactly one process has the file open and begins writing to it.
5. A successful transition from the modified state -> unmodified state takes place when the process that has written to the file completes a successful call to H5Fflush().
The facilities to ensure that only one process has the file open for (4) above are not provided by the HDF5 library and must be provided by OS-specific facilities e.g. mutexes/semaphores/messaging/etc.
================== -- Scanned for viruses and dangerous content at http://www.oneunified.net and is believed to be clean.
participants (3)
-
Jason Sachs
-
Ray Burkholder
-
Zeljko Vrba