Re: [boost] proposed new library "histogram"

6 May 2016

      Some performance metrics, as requested.

For more information, please have a look at the updated docs.

Test system: Intel Core i7-4500U CPU clocked at 1.8 GHz, 8 GB of DDR3 RAM

=================  =======  =======  =======  ======= =======  =======
distribution                uniform normal
-----------------  ------------------------- -------------------------
dimension          1D       3D       6D       1D 3D       6D
=================  =======  =======  =======  ======= =======  =======
No. of fills       12M      4M       2M       12M 4M       2M
C++: ROOT  [t/s]   0.127    0.199    0.185    0.168 0.143    0.179
C++: boost [t/s]   0.172    0.177    0.155    0.172 0.171    0.150
Py: numpy [t/s]    0.825    0.727    0.436    0.824 0.426    0.401
Py: boost [t/s]    0.209    0.229    0.192    0.207 0.194    0.168
=================  =======  =======  =======  ======= =======  =======

Using boost::histogram in Python is considerably faster than using 
numpy.histogram.

On 05/05/2016 04:36 PM, Thijs van den Berg wrote:
...
On 5 May 2016 at 00:21, Hans Dembinski <hans.dembinski@gmail.com> wrote:
...
Hi everybody,
I recently added a new library called "histogram" to the Boost Incubator.
I would like to advertise it a little here in the hope to find a person
interested in reviewing it. I hope that shameless self-advertisement is not
going against some rule of this list, but I am sure you will let me know.
My background is in analysis of big data in the fields of particle physics
and astroparticle physics. Boost is very popular among my peers, since it
is a free, high-quality, rich, and very well maintained collection of
libraries. There is a growing number of tools to do statistical analysis in
Boost and I think this project would fit in nicely, and fill a gap. We work
with histograms a lot, so that's why my interest came from.
I am a senior programmer in C++ and Python with 10 years of experience.
Guiding development through code reviews and tickets, as well as taking on
responsibility for continuous maintenance, are natural for me. Naturally, I
am willing to commit free time to maintain the project should it be
accepted, and do my share of the work in this community.
I put a lot of thought and effort into this project, the rationale and my
design choices are explained in the documentation, which I wrote according
to the advice given at the Boost Incubator website. The project is feature
complete from my side. What it needs now is the input from the Boost
community to round off possible edges and to make the interface rich enough
for everybody. I am good at considering the user perspective, but I cannot
anticipate everyone's needs.
In case you got interested, here are the links:
Incubator link:
http://rrsd.com/blincubator.com/bi_library/histogram-2/?gform_post_id=1582
github link:
https://github.com/HDembinski/histogram
Best regards,
Hans
Hi Hans,
Interesting ideas.
I have some algorithmic questions: I'd like to learn about the details
behind the "just works" friendly objective so that I can decide if it will
work for me -or not-, and under what circumstances. One reason I sometimes
pick C++ instead of Python is because of performance, especially when I
need to handle large datasets. In those cases the details often matter. So,
if I was going to consider using it, it would be helpful to see performance
metrics -e.g. compared to some naive alternative-.
I've read that you computes variance: can that computation be
switched-on/off (e.g. I might not need it)? Also, there are various online
(single pass, weighted) variance algorithms: some a stable, other not.
Which one have you implemented? Does is use std::accumulate? It would be
nice to reassure numerically focused users about the level of quality of he
internals.
I would also like to see information about the computational and memory
complexity about two other internal algorithms I think I saw mentioned:
1) automatically re-binning: when you modify bins do you split a single
bin, or do you readjust *all* bin boundaries? Do you keep  a sorted list
inside each bin?
2) sparse storage: .. I know this is a complex field where lots of trade
off can be made-. E.g. suppose I fill a 10-dimensional histogram with
samples that (only) have elements on a diagonal -a potential worst case
scenario for some methods would be-:
for(int i: {1, 2, 3, 4, 5})
     h.fill([i,i,i,i,i,i,i,i,i,i])
would this result in 5 sparse bins -the bins on the diagonal-, or 5^10 bins
-the outer product of ten axis, each with 5 bins-?
Thanks,
Thijs
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost