
Dear All, I have done some quick benchmarks; the code can be found here: http://chezphil.org/tmp/bench.tgz This is a "work queue", i.e. it's a queue of boost::function<void(void)> "work items" with float priorities. It's similar to a real application with fast (burst) fill and gradual drain; the pattern here is 10 pushes and 1 pop, repeated until all the work has been added, and then all of the work is popped in sequence. The priorities are essentially random, and there are 100 000 items in total. Currently my real application uses a std::multimap. I've compared that with a std::vector using std::push|pop_heap, std::priority_queue using std::vector and std::deque, and the proposed heap::priority_queue and heap::b_heap. The columns below are median-of-3 runtimes in seconds for: 1. 50 runs on a 32-bit 3 GHz AMD Athlon II Linux system, using g++ 4.4.4 -O4. 2. 10 runs on a 32-bit 1.2 GHz VIA C7-M Linux syste, using g++4.4.2 -O4. 3. 5 runs on a 1.2 GHz ARMv5 Linux system, using g++ 4.4.2 -O4. std::multimap 3.20 3.69 3.37 std::vector + std::push|pop_heap 2.35 3.36 2.71 std::priority_queue<vector> 2.58 3.48 2.84 std::priority_queue<deque> 4.27 5.77 3.72 heap::priority_queue 2.49 3.37 2.74 heap::b_heap 5.14 7.62 5.82 I was unable to try the proposed d_ary_heap because I couldn't find a working syntax for the arity template parameter - any ideas anyone? The 3 implementations based on std::vector perform similarly, as expected. I'm a bit surprised that my one is fastest, but the difference is small. Also as expected the version using multimap is slower, though maybe not by as much as I had expected. I was expecting the b_heap to be faster, but it is not; it is substantially slower. What is going on there? Does it all come down to size of data, cache size, etc? Ideally, an "improved" implementation like this would be able to fall back to the naive implementation in cases where it does not function any better. Or maybe I'm using it wrongly. My understanding is that the fibonacci, binomial and other heaps only improve things when other operations e.g. merging are needed - is that correct? Regards, Phil.

std::vector + std::push|pop_heap 2.35 3.36 2.71 std::priority_queue<vector> 2.58 3.48 2.84
Hmm... I know that these aren't significantly different times, but they should be much closer together since, AFAIK priority queue's default implementation is a vector using push_heap/pop_heap. There should be virtually no difference between the two (unless I'm misreading something).
I was unable to try the proposed d_ary_heap because I couldn't find a working syntax for the arity template parameter - any ideas anyone?
I think you're supposed to use d_ary_heap<T, arity<N>>. I filed a defect about this a couple of days ago :)
I was expecting the b_heap to be faster, but it is not; it is substantially slower. What is going on there? Does it all come down to size of data, cache size, etc? Ideally, an "improved" implementation like this would be able to fall back to the naive implementation in cases where it does not function any better. Or maybe I'm using it wrongly.
That's a good question. The data structure is supposed to to be cache-friendly. What's the sizeof(work item)? It looks like it could be "pretty big". Maybe large enough that it breaks the cache friendliness.
My understanding is that the fibonacci, binomial and other heaps only improve things when other operations e.g. merging are needed - is that correct?
That's also my understanding. Their node-based nature makes them less performant, unless you're merging them. Andrew

hi phil,
http://chezphil.org/tmp/bench.tgz
This is a "work queue", i.e. it's a queue of boost::function<void(void)> "work items" with float priorities. It's similar to a real application with fast (burst) fill and gradual drain; the pattern here is 10 pushes and 1 pop, repeated until all the work has been added, and then all of the work is popped in sequence. The priorities are essentially random, and there are 100 000 items in total.
thanks for your benchmark ... the pattern is different from the ones that i have tested.
I was unable to try the proposed d_ary_heap because I couldn't find a working syntax for the arity template parameter - any ideas anyone?
:) the scope operator uses two colons: boost::heap::d_ary_heap< value_t, boost:heap::arity<4>, // This doesn't work boost::heap::compare<compare_priorities> > boost::heap::d_ary_heap< value_t, boost::heap::arity<4>, boost::heap::compare<compare_priorities> >
I was expecting the b_heap to be faster, but it is not; it is substantially slower. What is going on there? Does it all come down to size of data, cache size, etc? Ideally, an "improved" implementation like this would be able to fall back to the naive implementation in cases where it does not function any better. Or maybe I'm using it wrongly.
well, benchmarking this can be tricky ... first, your work is boost::funtion<> (iirc 3 pointers) and a float ... so 16 bytes on 32bit or 28 bytes on 64bit ... on 32bit systems you will have a better matching with actual physical cache lines ... second ... synthetical benchmarks are not necessarily the best way to test cache-aware data structures: when stressing the heap, all its content will be located in the cpu caches ... so you won't necessarily see the benefit of the cache-awareness, since the data of the non-cache-aware data structure are still in the CPU cache. unless you manually add some work to thrash the caches ... in your case, you will have a maximum of 90000 items in the queue, on 32bit platforms this is rougly 1.4mb ... you won't see a big effect, if this fits all into the caches ... not sure, if there is a good way to artificially thrash the caches in order to observe the effect in a microbenchmark ...
My understanding is that the fibonacci, binomial and other heaps only improve things when other operations e.g. merging are needed - is that correct?
more or less ... unlike the container adaptors, the fibonacci heap is the only implementation, which provides an O(1) insert and increase ... cheers, tim

Tim Blechmann wrote:
I was unable to try the proposed d_ary_heap because I couldn't find a working syntax for the arity template parameter - any ideas anyone?
:) the scope operator uses two colons:
D'oh!
I was expecting the b_heap to be faster, but it is not; it is substantially slower. What is going on there? Does it all come down to size of data, cache size, etc? Ideally, an "improved" implementation like this would be able to fall back to the naive implementation in cases where it does not function any better. Or maybe I'm using it wrongly.
well, benchmarking this can be tricky ... first, your work is boost::funtion<> (iirc 3 pointers) and a float ... so 16 bytes on 32bit or 28 bytes on 64bit ... on 32bit systems you will have a better matching with actual physical cache lines ... second ... synthetical benchmarks are not necessarily the best way to test cache-aware data structures: when stressing the heap, all its content will be located in the cpu caches ... so you won't necessarily see the benefit of the cache-awareness, since the data of the non-cache-aware data structure are still in the CPU cache. unless you manually add some work to thrash the caches ... in your case, you will have a maximum of 90000 items in the queue, on 32bit platforms this is rougly 1.4mb ... you won't see a big effect, if this fits all into the caches ...
The data cache sizes on the test systems are I think: 1. AMD Athlon II X2: 64 kB L1 + 1 MB L2. 2. VIA C7-M: 32 kB L1 + 128 kB L2. 3. ARM v5 (Marvell Kirkwood): 16 kB L1 + 256 kB L2. So this data set really shouldn't fit in the cache. Even if it did, it would be great if the b_heap implementation performed no worse than the "naive" version. Also I've looked at your own benchmark graphs and it seems that there is a fairly constant performance difference between the different heaps - I don't see b_heaps getting faster than the alternatives as the queue size increases. (BTW, I find it almost impossible to distinguish the colours in those graphs...) Regards, Phil.

well, benchmarking this can be tricky ... first, your work is boost::funtion<> (iirc 3 pointers) and a float ... so 16 bytes on 32bit or 28 bytes on 64bit ... on 32bit systems you will have a better matching with actual physical cache lines ... second ... synthetical benchmarks are not necessarily the best way to test cache-aware data structures: when stressing the heap, all its content will be located in the cpu caches ... so you won't necessarily see the benefit of the cache-awareness, since the data of the non-cache-aware data structure are still in the CPU cache. unless you manually add some work to thrash the caches ... in your case, you will have a maximum of 90000 items in the queue, on 32bit platforms this is rougly 1.4mb ... you won't see a big effect, if this fits all into the caches ...
The data cache sizes on the test systems are I think:
1. AMD Athlon II X2: 64 kB L1 + 1 MB L2. 2. VIA C7-M: 32 kB L1 + 128 kB L2. 3. ARM v5 (Marvell Kirkwood): 16 kB L1 + 256 kB L2.
on machine 2 and 3, i assume the size of a cacheline is 32/16 kb
So this data set really shouldn't fit in the cache. Even if it did, it would be great if the b_heap implementation performed no worse than the "naive" version.
it cannot: it uses a different tree structure than binary heaps, causing a higher tree. with d-ary heaps, the arity is a tradeoff between tree height and number of comparisons (higher arity, lower tree height but more comparisons). in a similar manner, the b-heap adds a tradeoff between memory locality and tree height ...
Also I've looked at your own benchmark graphs and it seems that there is a fairly constant performance difference between the different heaps - I don't see b_heaps getting faster than the alternatives as the queue size increases.
maybe because my workstation has 8mb of cache ... the
(BTW, I find it almost impossible to distinguish the colours in those graphs...)
gnuplot defaults ... i generally find it hard to visualize multidimensional data ... if the plots will be included in the final documentation, i will try to find a better visualization ... tim

Tim Blechmann wrote:
it would be great if the b_heap implementation performed no worse than the "naive" version.
it cannot: it uses a different tree structure than binary heaps, causing a higher tree. with d-ary heaps, the arity is a tradeoff between tree height and number of comparisons (higher arity, lower tree height but more comparisons). in a similar manner, the b-heap adds a tradeoff between memory locality and tree height ...
Also I've looked at your own benchmark graphs and it seems that there is a fairly constant performance difference between the different heaps - I don't see b_heaps getting faster than the alternatives as the queue size increases.
maybe because my workstation has 8mb of cache
Tim, you're proposing this b-heap purely as a performance optimisation compared to the "traditional" heap, but we've not yet seen any cases where it is actually faster. If it can be faster we need to see some examples, and the documentation should qualify when it can be expected to be better. If it's not faster, it should be dropped. Regards, Phil.

Tim, you're proposing this b-heap purely as a performance optimisation compared to the "traditional" heap, but we've not yet seen any cases where it is actually faster. If it can be faster we need to see some examples, and the documentation should qualify when it can be expected to be better. If it's not faster, it should be dropped.
I don't think removal would be the right solution, but it certainly shouldn't be recommended as a best choice. Part of what I like about this library is that it provides a framework for evaluating comparative heap performance. While it doesn't appear to perform as well now, it's possible that future architectures may be more amenable to b-heap implementations. If not, then Boost has an implementation that can be used to empirically demonstrate otherwise. The same argument can probably be made for binomial heaps and fibonacci heaps. Andrew

Tim, you're proposing this b-heap purely as a performance optimisation compared to the "traditional" heap, but we've not yet seen any cases where it is actually faster. If it can be faster we need to see some examples, and the documentation should qualify when it can be expected to be better. If it's not faster, it should be dropped.
I don't think removal would be the right solution, but it certainly shouldn't be recommended as a best choice. Part of what I like about this library is that it provides a framework for evaluating comparative heap performance. While it doesn't appear to perform as well now, it's possible that future architectures may be more amenable to b-heap implementations. If not, then Boost has an implementation that can be used to empirically demonstrate otherwise.
i read somewhere that binary heaps are usually the first and the last answer when choosing a heap data structure ... although they have an amortized complexity of O(log(n)) for most operations.
The same argument can probably be made for binomial heaps and fibonacci heaps.
binomial heaps implement most operations in O(log(n)) worst-case and are efficiently mergable ... fibonacci heaps are of interest, since the support `push' and `merge' in O(1) amortized complexity (at the cost of implementing some operations in O(n) worst- case complexity) so the selection of the best heap data structure is really dependent on: * the type (efficiency of std::swap and comparison like std::less) * the size of the problem (for small sizes you do not need to care about the amortized complexity so much) * is the program focussed on throughput or worst-case latency * the use case: different access patterns have different performance characteristics. so instead of telling the people to `use this', i prefer to give people a choice to select the best data structure for their applications tim

it would be great if the b_heap implementation performed no worse than the "naive" version.
it cannot: it uses a different tree structure than binary heaps, causing a higher tree. with d-ary heaps, the arity is a tradeoff between tree height and number of comparisons (higher arity, lower tree height but more comparisons). in a similar manner, the b-heap adds a tradeoff between memory locality and tree height ...
Also I've looked at your own benchmark graphs and it seems that there is a fairly constant performance difference between the different heaps - I don't see b_heaps getting faster than the alternatives as the queue size increases.
maybe because my workstation has 8mb of cache
Tim, you're proposing this b-heap purely as a performance optimisation compared to the "traditional" heap, but we've not yet seen any cases where it is actually faster. If it can be faster we need to see some examples, and the documentation should qualify when it can be expected to be better. If it's not faster, it should be dropped.
just for the record: core i7 920, x86_64, linux-3.0-rc2, hyperthreading/frequency scaling disabled, 12gb triple-channel ddr3 ram stressed with: stress -m 1 --vm-bytes 4G --io 16 -c 3 and two boinc processes running at nice 10 configured the problem size to 15000000 changed work_t from boost::function to int (24byte to 4byte) comparison of b-heap and dary_heap (with arity 2) ... running all at nice 1 to increase the number of context switches. Performance counter stats for './bench_boostbheap': 532245.601009 task-clock # 0.865 CPUs utilized 150,066 context-switches # 0.000 M/sec 3,497 CPU-migrations # 0.000 M/sec 1,333,350 page-faults # 0.003 M/sec 1,421,230,219,856 cycles # 2.670 GHz 928,444,569,746 stalled-cycles-frontend # 65.33% frontend cycles idle 643,217,557,503 stalled-cycles-backend # 45.26% backend cycles idle 1,517,920,671,665 instructions # 1.07 insns per cycle # 0.61 stalled cycles per insn 328,233,864,211 branches # 616.696 M/sec 2,800,247,538 branch-misses # 0.85% of all branches 615.223322417 seconds time elapsed Performance counter stats for './bench_boostdary': 575427.533478 task-clock # 0.873 CPUs utilized 160,863 context-switches # 0.000 M/sec 3,841 CPU-migrations # 0.000 M/sec 1,333,346 page-faults # 0.002 M/sec 1,536,548,826,172 cycles # 2.670 GHz 1,304,937,894,945 stalled-cycles-frontend # 84.93% frontend cycles idle 999,244,398,224 stalled-cycles-backend # 65.03% backend cycles idle 747,525,733,877 instructions # 0.49 insns per cycle # 1.75 stalled cycles per insn 145,188,482,482 branches # 252.314 M/sec 1,661,903,337 branch-misses # 1.14% of all branches 659.415916004 seconds time elapsed in a second run, i am comparing the cache misses: Performance counter stats for './bench_boostbheap': 11,989,045,556 L1-dcache-load-misses [33.34%] 96,664,490 L1-dcache-store-misses [50.00%] 8,082,318,794 L1-dcache-prefetch-misses [33.32%] 5,457,166,960 LLC-load-misses [49.99%] 86,358,587 LLC-store-misses [16.67%] 72,903,925 LLC-prefetch-misses [16.66%] 602.784184115 seconds time elapsed Performance counter stats for './bench_boostdary': 14,139,852,204 L1-dcache-load-misses [33.33%] 96,895,924 L1-dcache-store-misses [50.00%] 752,673,940 L1-dcache-prefetch-misses [33.33%] 6,081,358,271 LLC-load-misses [49.99%] 85,684,373 LLC-store-misses [16.67%] 106,694,619 LLC-prefetch-misses [16.67%] 652.591757245 seconds time elapsed ... now i tweaked the benchmark: * using 10 heaps instead of one but only run 5 times instead of 50 - increases cache thrashing * remove work so that the heaps only contains floats (4byte) - more objects per cachline * 4 processes on the idle quad-core machine (2 for default stats, 4 for cache misses events) - increases cache thrashing results: Performance counter stats for './bench_boostbheap': 651624.952248 task-clock # 0.780 CPUs utilized 138,831 context-switches # 0.000 M/sec 2,543 CPU-migrations # 0.000 M/sec 674,172 page-faults # 0.001 M/sec 1,740,597,175,799 cycles # 2.671 GHz 1,292,857,574,357 stalled-cycles-frontend # 74.28% frontend cycles idle 831,566,179,360 stalled-cycles-backend # 47.77% backend cycles idle 1,441,574,215,357 instructions # 0.83 insns per cycle # 0.90 stalled cycles per insn 328,096,551,218 branches # 503.505 M/sec 1,152,407,287 branch-misses # 0.35% of all branches 834.898435350 seconds time elapsed Performance counter stats for './bench_boostdary': 759395.182963 task-clock # 0.815 CPUs utilized 156,451 context-switches # 0.000 M/sec 2,781 CPU-migrations # 0.000 M/sec 674,170 page-faults # 0.001 M/sec 2,028,502,096,477 cycles # 2.671 GHz 1,806,658,397,323 stalled-cycles-frontend # 89.06% frontend cycles idle 988,635,679,553 stalled-cycles-backend # 48.74% backend cycles idle 698,231,661,607 instructions # 0.34 insns per cycle # 2.59 stalled cycles per insn 145,877,024,263 branches # 192.096 M/sec 907,579,344 branch-misses # 0.62% of all branches 931.525434663 seconds time elapsed Performance counter stats for './bench_boostbheap': 23,480,288,668 L1-dcache-load-misses [33.34%] 264,061,426 L1-dcache-store-misses [50.00%] 14,720,355,800 L1-dcache-prefetch-misses [33.32%] 7,914,697,865 LLC-load-misses [49.99%] 46,737,807 LLC-store-misses [16.67%] 25,875,481 LLC-prefetch-misses [16.67%] 832.307436282 seconds time elapsed Performance counter stats for './bench_boostdary': 27,020,681,775 L1-dcache-load-misses [33.33%] 326,151,939 L1-dcache-store-misses [50.00%] 3,795,316,756 L1-dcache-prefetch-misses [33.33%] 10,296,069,595 LLC-load-misses [50.00%] 45,933,195 LLC-store-misses [16.67%] 43,923,748 LLC-prefetch-misses [16.67%] 949.094734119 seconds time elapsed :) tim

Tim Blechmann wrote:
Tim, you're proposing this b-heap purely as a performance optimisation compared to the "traditional" heap, but we've not yet seen any cases where it is actually faster. If it can be faster we need to see some examples, and the documentation should qualify when it can be expected to be better. If it's not faster, it should be dropped.
just for the record: core i7 920, x86_64, linux-3.0-rc2, hyperthreading/frequency scaling disabled, 12gb triple-channel ddr3 ram
stressed with: stress -m 1 --vm-bytes 4G --io 16 -c 3 and two boinc processes running at nice 10
configured the problem size to 15000000 changed work_t from boost::function to int (24byte to 4byte)
comparison of b-heap and dary_heap (with arity 2) ... running all at nice 1 to increase the number of context switches.
Performance counter stats for './bench_boostbheap': [snip] 615.223322417 seconds time elapsed
Performance counter stats for './bench_boostdary': [snip] 659.415916004 seconds time elapsed
That's great. But I was comparing with your priority_queue<> and the std::push|pop_heap, not your d_ary_heap<>. Could you do the same test with one of those?
Performance counter stats for './bench_boostbheap': 1,421,230,219,856 cycles 928,444,569,746 stalled-cycles-frontend 643,217,557,503 stalled-cycles-backend 1,517,920,671,665 instructions
Performance counter stats for './bench_boostdary': 1,536,548,826,172 cycles 1,304,937,894,945 stalled-cycles-frontend 999,244,398,224 stalled-cycles-backend 747,525,733,877 instructions
That's the interesting part: the b-heap takes twice as many instructions as the d-array heap, so the improvement in cache locality needs to be substantial enough to double the instruction throughput before the b-heap does better. You've had to pull all the stops out to make that happen. Regards, Phil.

Phil Endecott wrote:
I have done some quick benchmarks; the code can be found here:
http://chezphil.org/tmp/bench.tgz
This is a "work queue", i.e. it's a queue of boost::function<void(void)> "work items" with float priorities.
You have specified -O4 as compile option, but not -DNDEBUG. For me, some of the heaps from Boost.Heap are actually faster than the other options when I specify -DNDEBUG. Not specifying -DNDEBUG is an unfair comparison, since the extensive use of assertions is not a bad thing.
The columns below are median-of-3 runtimes in seconds for:
1. 50 runs on a 32-bit 3 GHz AMD Athlon II Linux system, using g++ 4.4.4 -O4. 2. 10 runs on a 32-bit 1.2 GHz VIA C7-M Linux syste, using g++4.4.2 -O4. 3. 5 runs on a 1.2 GHz ARMv5 Linux system, using g++ 4.4.2 -O4.
std::multimap 3.20 3.69 3.37 std::vector + std::push|pop_heap 2.35 3.36 2.71 std::priority_queue<vector> 2.58 3.48 2.84 std::priority_queue<deque> 4.27 5.77 3.72 heap::priority_queue 2.49 3.37 2.74 heap::b_heap 5.14 7.62 5.82
I get the following times on a ubuntu system with gcc-4.4.3 (I also added another heap to the comparison): time ./bench_seqheap 3.36user 0.18system 0:03.61elapsed 98%CPU (0avgtext+0avgdata 14736maxresident)k 0inputs+0outputs (0major+28908minor)pagefaults 0swaps time ./bench_boostbheap 11.94user 0.18system 0:12.26elapsed 98%CPU (0avgtext+0avgdata 12064maxresident)k 0inputs+0outputs (0major+22377minor)pagefaults 0swaps time ./bench_boostqueue 6.20user 0.18system 0:07.51elapsed 85%CPU (0avgtext+0avgdata 12000maxresident)k 0inputs+0outputs (0major+22372minor)pagefaults 0swaps time ./bench_boostdary 5.90user 0.14system 0:06.13elapsed 98%CPU (0avgtext+0avgdata 12016maxresident)k 0inputs+0outputs (0major+22373minor)pagefaults 0swaps time ./bench_stdqueuedeque 9.17user 0.14system 0:09.46elapsed 98%CPU (0avgtext+0avgdata 12192maxresident)k 0inputs+0outputs (0major+21257minor)pagefaults 0swaps time ./bench_stdqueuevec 6.23user 0.42system 0:06.85elapsed 97%CPU (0avgtext+0avgdata 15808maxresident)k 0inputs+0outputs (0major+52910minor)pagefaults 0swaps time ./bench_stdheap 6.00user 0.15system 0:06.24elapsed 98%CPU (0avgtext+0avgdata 12016maxresident)k 0inputs+0outputs (0major+22373minor)pagefaults 0swaps time ./bench_map 8.58user 0.04system 0:08.72elapsed 98%CPU (0avgtext+0avgdata 19136maxresident)k 0inputs+0outputs (0major+1250minor)pagefaults 0swaps My own conclusion is that besides b_heap, all heaps perfrom quite reasonable. I also see that the d_ary heap performs quite well. Regards, Thomas
participants (4)
-
Andrew Sutton
-
Phil Endecott
-
Thomas Klimpel
-
Tim Blechmann