Re: [boost] Proposal: Monotonic Containers

13 Jun 2009

      Hi Luke,

On Sat, Jun 13, 2009 at 3:51 AM, Simonson, Lucanus J <
lucanus.j.simonson@intel.com> wrote:
...
Christian Schladetsch wrote:
...
Note that your test case is ~8% faster when implemented using
monotonic::vector over using std::vector. This is because the monotonic
vector is using the stack for storage, where the loop controls and other
fields are also.
How do you know?  Unless you went in and measured with vtune how many cache
misses you get with each you are just making things up and stating them as
fact.  I highly doubt the difference is due to improved spatial locality
from allocation being on the stack.
I agree that I was brash in my statement, and that I have been wrong before.
I hypothesised that it would be the case. I argued my case in theory, then
someone else wrote a benchmark. I tested that benchmark using my system and
using the std::system. The measurements I took showed my system was faster,
so I logically took it as support of my theory. If you could find the time
to look at the test in question, I would be delighted to hear alternative
theories.
...
Frank Mori Hess wrote:
...
Are you setting _SCL_SECURE to zero on msvc?  It's a common cause of
performance drops with std containers on msvc.
How many times has someone come to boost with an optimization that isn't
because they don't know about _SCL_SECURE?  This is how the stl gets the
reputation of being slow which so damages people's ability to write sensible
code.  Forcing everyone who is ignorant of what the compiler is doing to
write containers with bugs in them because they don't like the fact that stl
is made slow by default in MSVC is hardly improving security.
I tested it on GCC and VS. I found that yes indeed, my initial report of a
100x improvement was still demonstrable on VS, but it was much more modest
or non-existant, on GCC.

So I dug a little deeper, ran and re-ran the tests. Same result; VS was just
much slower than GCC for that test. At first I put it down to being a
difference in the compiler. After all I thought, 100x improvement on VS,
even if not reproducible on GCC, was still worthwhile to everyone using VS
for realtime code. Including, of course, myself.

However, it nagged at me, and later more elaborate tests (which are now in
the sandbox with revised implementation at
http://svn.boost.org/svn/boost/sandbox/monotonic) showed that MSVC was
getting pathologically poor results. Something was wrong.

To cut a short story long, I had once made a change to the project settings
to use /Zi, which includes debug symbols in the /O2 optimised build. That
stayed, and also just so happened to give results that showed that the
std::containers were far slower than the monotonic::containers for MSVC.

A newbie mistake to be sure, compounded by the effect of 'wishful thinking',
which in this case was compounded by the effect that that specific fault
gave just the results I had hoped for.

My current benchmarks for VS2008 /O2 (and *not* /Zi) are:
Incrementing ord = 3.16 ps per iteration
Incrementing ord = 3.18 ps per iteration
STD: Incrementing ord = 3.43 ps per iteration
STD: Incrementing ord = 4.04 ps per iteration
monotonic: 0.001
std: 0.002
monotonic: 0.011
std: 0.019                                            <<--- big drop from
~1.00!!
test_map_list: mono: 19.628
test_map_list:  std: 46.226

And for GCC4.3.3 -O4:
Incrementing ord = 3.1 ps per iteration
Incrementing ord = 3.1 ps per iteration
STD: Incrementing ord = 2.8 ps per iteration
STD: Incrementing ord = 2.9 ps per iteration
monotonic: 0
std: 0
monotonic: 0.02
std: 0.01                                    <<--- std:: is actually faster
on GCC; deviation is high for this test
test_map_list: mono: 14.5
test_map_list:  std: 21.76

So the results are mixed; except for the test_map_list test, which is 2.3x
faster using MSVC and 1.5x faster using GCC.

That test is the first more-real-world test provided, and I think is
starting to get outside of the 'toy benchmarks' area. The implementation is
at
https://svn.boost.org/svn/boost/sandbox/monotonic/libs/monotonic/test/main.c...
.

This test creates and uses a map<int, list<int> > a few times. It also more
accurately fits the original motivations outlined in
https://svn.boost.org/svn/boost/sandbox/monotonic/libs/monotonic/doc/index.h...
.

Is a 50%-200% improvement worth investigating further? Perhaps, perhaps not.
It is certainly not 'compelling'.

Storing all the nodes of a tree in a buffer instead of scattered in the heap
...
is a good idea.  Storing all the nodes of a tree in a buffer of a fixed size
that causes a crash when you exceed it is a bad idea.
There are many cases in high-performance, realtime code that you can or
should know and set high-level water marks for memory usage, either
per-frame or per-configuration or both.

In any case, the same argument can be made against exhausting any memory
pool, including the heap itself. To claim that the proposed system is faulty
because it can exhaust an explicitly-sized allocation buffer seems a little
mischievious.
...
We get massive performance improvements by using a (non-standard) memory
buffer allocator in internal code that implements a variant of R*Tree.  We
don't crash if we keep adding elements to the tree.  We find that in toy
benchmarks like Christian shows the performance advantage of storing the
nodes of a tree in a contiguous buffer is greatly less than in a real
application because even the system allocator will cluster them together in
memory when memory is completely free and because everything in a toy
benchmark fits in cache anyway.
Perhaps. I didn't think "gee, how can I make some wacky boost proposal; I
know, I'll come up with a strange allocator and try to fefuddle things with
poorly executed benchmarks!" Perhaps that's how it has progressed thus far,
however.

Even so, the original idea and implementation came from performance analysis
of real-world game-system code that I write for 360 and PS3. I saw that,
especially on the PS3, cache was really important and I didn't want to be
jumping around all over the heap. I wanted to use standard containers (not
custom ones), which used either the stack or the heap for memory. I
especially liked the idea of using the stack, as it was often the case that
these temporary containers only existing for the lifetime of a stack-frame.
I also knew that I didn't want memory to be "allocated and freed"; I knew
how much memory would be required by a static analysis of the code, so I did
not need a "memory allocator" in the traditional sense.
...
What Christian is trying to do is in general good, but this idea that
allocating the buffer on the stack is somehow a great idea seems to be
getting in the way of the portion of his overall proposal that is actually
good.  He also hasn't
 yet addressed the design flaw which makes his data structures unsafe to
use.
If you are referring to the memory alignment issues, the current
implementation finally does a very basic alignment, after some
wrong-headedness from me. As noted by Thorsten, it should be changed again
to use boost::align_memory.

Another important part of my motivation, which hasn't been repeatedly
explicitly mentioned because it is a "soft" argument and hard to measure, is
that of heap fragmentation.

This is a huge issue for any real-time system that has to stay up and
running indefinately, and was also a major motivation for the monotonic
proposal. Being able to 'wipe the slate clean' after each frame or
configuration state is vitally important for continuous realtime systems.

This is what the 'real-world-like' test (test_map_list in the output pasted
above) that currently provides %50-%250 gains in performance demonstrates.
On top of the performance gains, the test also demonstrates that the
monotonic system does not fragment the heap over time.

I haven't been pushing this aspect of my proposal a lot because I know it's
soft and hard to measure, and because it could be argued to be too 'niche'
of a concern to be much of an issue. So, I've pushed for results, perhaps
too agressively, in other areas like performance and ease of use.
...
Performance at the expense of safety is easy to achieve, just program in
C.
Instead of reinventing the wheel perhaps can we just find some good
allocators people have written and compare them?  Since there is no
std::R*Tree and no way Christian would implement monotonic::R*Tree
I wouldn't need to implement monotonic::R*Tree, which would just be said
hypothetical implementation using a monotonic::allocator.
...
it would be better to provide a standard compliant allocator people can use
on their own containers that are modeled after the standard than
re-implement the stl containers with a non-standard compliant allocator.
I haven't re-implemented std::containers. I have, perhaps in poor judgement,
made a proposal that includes a wrapping of them that facilitates using them
with a monotonic allocator. The proposal can be used either way:

typedef boost::monotonic::map<int, boost::monotonic::list<int> > Map;

or:

typedef std::map<int, std::list<int, boost::monotonic::allocator<int> >,
std::less<int>, boost::monotonic::allocator<int> > Map;

So far, I have avoided a discussion about the merits and lack thereof of the
design and practical use of std::allocators.

Monotonic::allocator stores a pointer to storage. If this is an affront to
the standard, why can and indeed must allocators be compared for equality?

Put another way, if allocators really cannot have any state at all, not even
a pointer, why must they be compared?

If they really cannot have state, then a work-around is to have a static
global that points to the storage to use. Hack? Yes, but so are
std::allocators.
...
Luke
Thanks for your comments Luke, I really appreciate your thoughts on the
matter.

At the end of the day, we all want the same thing; standards-compliant,
cross-platform, high-performance programs. If monotonic never sees the light
of day, but at least starts a discussion about the issues it tries to
address, and something else comes of it completely, then I would still
consider that a complete win.

I also need to look at using auto_buffer<T> for storage,
boost::align_memory, and research R*Trees, and a lot of other things
besides, so for me personally it has been very beneficial.

So I say thanks guys for all the comments, and I apologise for sometimes
being obtuse.

Regards,
Christian.