
Hi Luke, On Sat, Jun 13, 2009 at 3:51 AM, Simonson, Lucanus J < lucanus.j.simonson@intel.com> wrote:
Christian Schladetsch wrote:
Note that your test case is ~8% faster when implemented using monotonic::vector over using std::vector. This is because the monotonic vector is using the stack for storage, where the loop controls and other fields are also.
How do you know? Unless you went in and measured with vtune how many cache misses you get with each you are just making things up and stating them as fact. I highly doubt the difference is due to improved spatial locality from allocation being on the stack.
I agree that I was brash in my statement, and that I have been wrong before. I hypothesised that it would be the case. I argued my case in theory, then someone else wrote a benchmark. I tested that benchmark using my system and using the std::system. The measurements I took showed my system was faster, so I logically took it as support of my theory. If you could find the time to look at the test in question, I would be delighted to hear alternative theories.
Frank Mori Hess wrote:
Are you setting _SCL_SECURE to zero on msvc? It's a common cause of performance drops with std containers on msvc.
How many times has someone come to boost with an optimization that isn't because they don't know about _SCL_SECURE? This is how the stl gets the reputation of being slow which so damages people's ability to write sensible code. Forcing everyone who is ignorant of what the compiler is doing to write containers with bugs in them because they don't like the fact that stl is made slow by default in MSVC is hardly improving security.
I tested it on GCC and VS. I found that yes indeed, my initial report of a 100x improvement was still demonstrable on VS, but it was much more modest or non-existant, on GCC. So I dug a little deeper, ran and re-ran the tests. Same result; VS was just much slower than GCC for that test. At first I put it down to being a difference in the compiler. After all I thought, 100x improvement on VS, even if not reproducible on GCC, was still worthwhile to everyone using VS for realtime code. Including, of course, myself. However, it nagged at me, and later more elaborate tests (which are now in the sandbox with revised implementation at http://svn.boost.org/svn/boost/sandbox/monotonic) showed that MSVC was getting pathologically poor results. Something was wrong. To cut a short story long, I had once made a change to the project settings to use /Zi, which includes debug symbols in the /O2 optimised build. That stayed, and also just so happened to give results that showed that the std::containers were far slower than the monotonic::containers for MSVC. A newbie mistake to be sure, compounded by the effect of 'wishful thinking', which in this case was compounded by the effect that that specific fault gave just the results I had hoped for. My current benchmarks for VS2008 /O2 (and *not* /Zi) are: Incrementing ord = 3.16 ps per iteration Incrementing ord = 3.18 ps per iteration STD: Incrementing ord = 3.43 ps per iteration STD: Incrementing ord = 4.04 ps per iteration monotonic: 0.001 std: 0.002 monotonic: 0.011 std: 0.019 <<--- big drop from ~1.00!! test_map_list: mono: 19.628 test_map_list: std: 46.226 And for GCC4.3.3 -O4: Incrementing ord = 3.1 ps per iteration Incrementing ord = 3.1 ps per iteration STD: Incrementing ord = 2.8 ps per iteration STD: Incrementing ord = 2.9 ps per iteration monotonic: 0 std: 0 monotonic: 0.02 std: 0.01 <<--- std:: is actually faster on GCC; deviation is high for this test test_map_list: mono: 14.5 test_map_list: std: 21.76 So the results are mixed; except for the test_map_list test, which is 2.3x faster using MSVC and 1.5x faster using GCC. That test is the first more-real-world test provided, and I think is starting to get outside of the 'toy benchmarks' area. The implementation is at https://svn.boost.org/svn/boost/sandbox/monotonic/libs/monotonic/test/main.c... . This test creates and uses a map<int, list<int> > a few times. It also more accurately fits the original motivations outlined in https://svn.boost.org/svn/boost/sandbox/monotonic/libs/monotonic/doc/index.h... . Is a 50%-200% improvement worth investigating further? Perhaps, perhaps not. It is certainly not 'compelling'. Storing all the nodes of a tree in a buffer instead of scattered in the heap
is a good idea. Storing all the nodes of a tree in a buffer of a fixed size that causes a crash when you exceed it is a bad idea.
There are many cases in high-performance, realtime code that you can or should know and set high-level water marks for memory usage, either per-frame or per-configuration or both. In any case, the same argument can be made against exhausting any memory pool, including the heap itself. To claim that the proposed system is faulty because it can exhaust an explicitly-sized allocation buffer seems a little mischievious.
We get massive performance improvements by using a (non-standard) memory buffer allocator in internal code that implements a variant of R*Tree. We don't crash if we keep adding elements to the tree. We find that in toy benchmarks like Christian shows the performance advantage of storing the nodes of a tree in a contiguous buffer is greatly less than in a real application because even the system allocator will cluster them together in memory when memory is completely free and because everything in a toy benchmark fits in cache anyway.
Perhaps. I didn't think "gee, how can I make some wacky boost proposal; I know, I'll come up with a strange allocator and try to fefuddle things with poorly executed benchmarks!" Perhaps that's how it has progressed thus far, however. Even so, the original idea and implementation came from performance analysis of real-world game-system code that I write for 360 and PS3. I saw that, especially on the PS3, cache was really important and I didn't want to be jumping around all over the heap. I wanted to use standard containers (not custom ones), which used either the stack or the heap for memory. I especially liked the idea of using the stack, as it was often the case that these temporary containers only existing for the lifetime of a stack-frame. I also knew that I didn't want memory to be "allocated and freed"; I knew how much memory would be required by a static analysis of the code, so I did not need a "memory allocator" in the traditional sense.
What Christian is trying to do is in general good, but this idea that allocating the buffer on the stack is somehow a great idea seems to be getting in the way of the portion of his overall proposal that is actually good. He also hasn't yet addressed the design flaw which makes his data structures unsafe to use.
If you are referring to the memory alignment issues, the current implementation finally does a very basic alignment, after some wrong-headedness from me. As noted by Thorsten, it should be changed again to use boost::align_memory. Another important part of my motivation, which hasn't been repeatedly explicitly mentioned because it is a "soft" argument and hard to measure, is that of heap fragmentation. This is a huge issue for any real-time system that has to stay up and running indefinately, and was also a major motivation for the monotonic proposal. Being able to 'wipe the slate clean' after each frame or configuration state is vitally important for continuous realtime systems. This is what the 'real-world-like' test (test_map_list in the output pasted above) that currently provides %50-%250 gains in performance demonstrates. On top of the performance gains, the test also demonstrates that the monotonic system does not fragment the heap over time. I haven't been pushing this aspect of my proposal a lot because I know it's soft and hard to measure, and because it could be argued to be too 'niche' of a concern to be much of an issue. So, I've pushed for results, perhaps too agressively, in other areas like performance and ease of use.
Performance at the expense of safety is easy to achieve, just program in C.
Instead of reinventing the wheel perhaps can we just find some good allocators people have written and compare them? Since there is no std::R*Tree and no way Christian would implement monotonic::R*Tree
I wouldn't need to implement monotonic::R*Tree, which would just be said hypothetical implementation using a monotonic::allocator.
it would be better to provide a standard compliant allocator people can use on their own containers that are modeled after the standard than re-implement the stl containers with a non-standard compliant allocator.
I haven't re-implemented std::containers. I have, perhaps in poor judgement, made a proposal that includes a wrapping of them that facilitates using them with a monotonic allocator. The proposal can be used either way: typedef boost::monotonic::map<int, boost::monotonic::list<int> > Map; or: typedef std::map<int, std::list<int, boost::monotonic::allocator<int> >, std::less<int>, boost::monotonic::allocator<int> > Map; So far, I have avoided a discussion about the merits and lack thereof of the design and practical use of std::allocators. Monotonic::allocator stores a pointer to storage. If this is an affront to the standard, why can and indeed must allocators be compared for equality? Put another way, if allocators really cannot have any state at all, not even a pointer, why must they be compared? If they really cannot have state, then a work-around is to have a static global that points to the storage to use. Hack? Yes, but so are std::allocators.
Luke
Thanks for your comments Luke, I really appreciate your thoughts on the matter. At the end of the day, we all want the same thing; standards-compliant, cross-platform, high-performance programs. If monotonic never sees the light of day, but at least starts a discussion about the issues it tries to address, and something else comes of it completely, then I would still consider that a complete win. I also need to look at using auto_buffer<T> for storage, boost::align_memory, and research R*Trees, and a lot of other things besides, so for me personally it has been very beneficial. So I say thanks guys for all the comments, and I apologise for sometimes being obtuse. Regards, Christian.