
Hi Luke, [...]
Luke> Peak memory will be a good metric. Do you have access to VTune? You seem to struggle to identify the cause of performance loss and are reduced to guesswork.
Peak memory measured by an external tool is no good, as boost::pool and boost::fast_pool both leak memory. I need to be able to sample the memory used at certain times in the application in a cross-platform way. I'll get to this in due time. I have been focused on getting the benchmark results more than attempting to do a complete analysis of their implications. The latest results are here http://tinyurl.com/lj6nab. I still can't explain why monotonic is faster at sorting a 500,000 element pre-reserved vector, but I have only reported the result and have not investigated deeply. I have added mean, standard deviation, min and max factors for each of the small, medium, and large benchmark sets. I print a cumulative total at the end of each set, and a summary of all results at the end. These summaries are: GCC: scheme mean std-dev min max fast 36.3 173 0.25 1.63e+03 pool 27.8 1.02e+04 0.857 897 std 1.69 0.91 0.333 5 tbb 1.59 0.849 0.333 5 MSVC: scheme mean std-dev min max fast 35.4 132 0.603 1.32e+003 pool 27.1 1.13e+004 0.693 878 std 2.7 1.7 0.628 7 tbb 1.44 0.727 0.291 6.4 The mean is the average speedup factor provided by monotonic allocation over the given scheme. So for MSVC, summarised over all tests, monotonic is 1.4X faster than TBB with a standard deviation of 0.7. TBB was 3.4X faster at its best and 6.4X slower at its worst. Note that monotonic was on average 35X faster than boost::fast_pool, but notice too that the standard deviation is very high. At its worst, fast_pool was 1,300X slower than monotonic and at its best was 1.6X faster. boost::pool faired little better, with an even worse standard deviation of 10,000(!). One could argue that the tests are skewed, so I invite you to look at them and suggest any changes or additions. See http://tinyurl.com/l6vmgq for all the tests, and http://tinyurl.com/l89llqfor the test harness. It is no surprise that TBB performs best across both platforms with the smallest standard deviation. I have VTune but I am yet to apply it. I have been attempting to gather the raw data from which more analysis can be made. There also exist memory profiling tools. In linux you can poll /proc to get
memory usage. There are little linux utilities out there called memtime and such that work this way.
These can't be used for reasons mentioned above.
I'd like to suggest another benchmark. This is an example of something I think would be very bad for your allocator. Build a std::map of size n. Loop for O(n^2) iterations. In each iteration insert one random element and lookup with lower_bound one random element and remove it. Precompute the random numbers and don't include the rand() calls in the time measurement of the benchmark.
I have done so. The code is here http://tinyurl.com/l6vmgq in a structure called "test_map_erase", the results are here http://tinyurl.com/mp6sub. For your test, monotonic was at least as fast if not faster than the other schemes. Of course, for large N memory will be exhausted. But for N=5000, the summary is: scheme mean std-dev min max fast 0.996 0.0307 0.957 1.04 pool 1.2 0.00107 1.16 1.25 std 1.44 0.025 1.41 1.48 tbb 1.06 0.0228 1.03 1.1
Your allocator will use O(n^2) memory since it never recovers deallocated memory, so if you set n close to the size of the L1 or L2 cache you should see dramatic performance loss. This is my typical use case for a std::map.
This is a common enough idiom, yes. But then again, another common idiom is to first populate maps then use them for fast lookup, like a dictionary. I do not suggest, nor have I ever suggested, that monotonic is a general-purpose allocator that is a drop-in replacement for std::allocator in all cases. I have repeatedly stated the intended limititation of monotonic allocation, including mentioning that it is not designed for continual erasure from and insertion into containers. But it does what it was designed for very well, which is to produce fast and small datastructures with very fast allocation. Your test in this case was specifically designed to be pathological for monotonic allocation. Even so, the above figures show that it still stands up very well against other schemes. I exhausted memory for N=100,000, but it didn't t help that boost::pool leaks it for this test either.
Regards, Luke
Thanks for your suggestions and comments Luke. I will be adding more benchmarks soon, to test this ability of monotonic as well: { length = 4; // storage starts on the stack (in this case, 10k of it), then merges into the heap as needed monotonic::storage<10*1024> storage; for (size_t n = 0; n < length; ++n) { // create a new int from storage int &n0 = storage.create<int>(); // create a new string (uses correct alignment) string const &s1 = storage.create<string>("foo"); BOOST_ASSERT(s1 == "foo"); // allocate 37 bytes with alignment 1 char *array0 = storage.allocate_bytes(37); fill_n(array0, 37, 42); // allocate 2537 bytes with 64-byte alignment char *array1 = storage.allocate_bytes(2537, 64); fill_n(array1, 2537, 123); // allocate 1283 bytes with optimal machine alignment char *array2 = storage.allocate_bytes<1283>(); fill_n(array2, 1283, 42); array<Unaligned, 42> &array3 = storage.create<array<Unaligned, 42> >(); // destroy objects. this only calls the destructors; it does not release memory storage.destroy(s1); cout << "storage.stack, heap used: " << storage.fixed_used() << ", " << storage.heap_used() << endl; } // storage is released. if this was only ever on the stack, no work is done } //>>> storage.stack, heap used: 4882, 0 storage.stack, heap used: 9042, 0 storage.stack, heap used: 9298, 3827 storage.stack, heap used: 9554, 7667 This is harder to test, as no other allocation model that I am currently testing can do this. But I'll try to test against auto_buffer<>, boost::pool and boost::object_pool and see how it goes. I am interested in your thoughts on the latest results, and/or the methodology I used. Regards, Christian.