proposed new library "histogram"

newer
Regression testing Clang/LLVM with...

Hans Dembinski

4 May 2016 4 May '16

10:21 p.m.

Hi everybody, I recently added a new library called "histogram" to the Boost Incubator. I would like to advertise it a little here in the hope to find a person interested in reviewing it. I hope that shameless self-advertisement is not going against some rule of this list, but I am sure you will let me know. My background is in analysis of big data in the fields of particle physics and astroparticle physics. Boost is very popular among my peers, since it is a free, high-quality, rich, and very well maintained collection of libraries. There is a growing number of tools to do statistical analysis in Boost and I think this project would fit in nicely, and fill a gap. We work with histograms a lot, so that's why my interest came from. I am a senior programmer in C++ and Python with 10 years of experience. Guiding development through code reviews and tickets, as well as taking on responsibility for continuous maintenance, are natural for me. Naturally, I am willing to commit free time to maintain the project should it be accepted, and do my share of the work in this community. I put a lot of thought and effort into this project, the rationale and my design choices are explained in the documentation, which I wrote according to the advice given at the Boost Incubator website. The project is feature complete from my side. What it needs now is the input from the Boost community to round off possible edges and to make the interface rich enough for everybody. I am good at considering the user perspective, but I cannot anticipate everyone's needs. In case you got interested, here are the links: Incubator link: http://rrsd.com/blincubator.com/bi_library/histogram-2/?gform_post_id=1582 github link: https://github.com/HDembinski/histogram Best regards, Hans

Show replies by date

Klemens Morgenstern

5 May 5 May

7:21 a.m.

Am 05.05.2016 um 00:21 schrieb Hans Dembinski:

...

Hi everybody,

I recently added a new library called "histogram" to the Boost Incubator. I would like to advertise it a little here in the hope to find a person interested in reviewing it. I hope that shameless self-advertisement is not going against some rule of this list, but I am sure you will let me know.

Well no, that's part of the purpose of the mailing list. Though be prepared, that you may get very harsh criticism here.

...

My background is in analysis of big data in the fields of particle physics and astroparticle physics. Boost is very popular among my peers, since it is a free, high-quality, rich, and very well maintained collection of libraries. There is a growing number of tools to do statistical analysis in Boost and I think this project would fit in nicely, and fill a gap. We work with histograms a lot, so that's why my interest came from.

All this sound quite interesting. I took a look at your documentation and the tests and I have to say: I have no clue what this library does. That is, yes, it helps you to write histograms, surce, but how does that look? This might be obvious for you as the developer, but for me it's completely unclear - maybe you can enhance your examples by providing the actually generated output. Because if you want interesent in your library you need people to have a clue what you're talking about. That does not mean, that you need to give every detail, but to have an overview and an basic idea would be nice.

...

I am a senior programmer in C++ and Python with 10 years of experience. Guiding development through code reviews and tickets, as well as taking on responsibility for continuous maintenance, are natural for me. Naturally, I am willing to commit free time to maintain the project should it be accepted, and do my share of the work in this community.

I put a lot of thought and effort into this project, the rationale and my design choices are explained in the documentation, which I wrote according to the advice given at the Boost Incubator website. The project is feature complete from my side. What it needs now is the input from the Boost community to round off possible edges and to make the interface rich enough for everybody. I am good at considering the user perspective, but I cannot anticipate everyone's needs.

Is it pure C++03? Because (just from looking into it) it seems a lot of stuff in histrogramm.hpp could be done with templates, i.e. without marcos. But that's just my impression. Also things like move-constructors seem to be missing, which would make a lot of sense for a histogram. I would also use std::array instead of C-Arrays, etc.. Also: I think you will still need to provide bjam files, because that's still the way boost is built (though there has been lively discussion about cmake).

...

In case you got interested, here are the links:

Incubator link:

http://rrsd.com/blincubator.com/bi_library/histogram-2/?gform_post_id=1582

github link:

https://github.com/HDembinski/histogram

Best regards,

Hans

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Hans Dembinski

2:32 p.m.

Dear Klemens, thank you for your reply and your comments! I started to fill my issue tracker with your suggestions. On 5/5/16 3:21 AM, Klemens Morgenstern wrote:

...

Am 05.05.2016 um 00:21 schrieb Hans Dembinski:

...
Hi everybody,

I recently added a new library called "histogram" to the Boost Incubator. I would like to advertise it a little here in the hope to find a person interested in reviewing it. I hope that shameless self-advertisement is not going against some rule of this list, but I am sure you will let me know.

Well no, that's part of the purpose of the mailing list. Though be prepared, that you may get very harsh criticism here. I respect the fair warning. I followed the list for a few weeks now, so I got a glimpse of what I am getting into. I think it is worth it and I have no problem with harsh criticism as long as it is fair.

...

...
My background is in analysis of big data in the fields of particle physics and astroparticle physics. Boost is very popular among my peers, since it is a free, high-quality, rich, and very well maintained collection of libraries. There is a growing number of tools to do statistical analysis in Boost and I think this project would fit in nicely, and fill a gap. We work with histograms a lot, so that's why my interest came from.

All this sound quite interesting. I took a look at your documentation and the tests and I have to say: I have no clue what this library does. That is, yes, it helps you to write histograms, surce, but how does that look? This might be obvious for you as the developer, but for me it's completely unclear - maybe you can enhance your examples by providing the actually generated output. Because if you want interesent in your library you need people to have a clue what you're talking about. That does not mean, that you need to give every detail, but to have an overview and an basic idea would be nice.

May I ask what part of the documentation you looked at? If you had a look at the README.md on the github page only, then I understand what you mean. The README.md does not contain usage examples. I will add some there to improve the appeal of the front page and make the point of the project more clear. However, if you download the repository, you will find a folder docs/html (following the standard directory structure of Boost) which contains an extensive documentation written according to the guidelines set out by the Boost Incubator website. It contains much more information, including a section called "Tutorial" which shows examples to create and fill histograms, and how to read the content of the bins and their errors. If that is not sufficient, please let me know what usage I should cover. In principle, using this library is very simple, which was one of the major design goals. The project is not a framework to make histograms, it is just a single histogram class that covers all the use cases for a histogram in a unified manner. I want it to be useful as a daily tool for people who do statistical analysis.

...

...
I am a senior programmer in C++ and Python with 10 years of experience. Guiding development through code reviews and tickets, as well as taking on responsibility for continuous maintenance, are natural for me. Naturally, I am willing to commit free time to maintain the project should it be accepted, and do my share of the work in this community.

I put a lot of thought and effort into this project, the rationale and my design choices are explained in the documentation, which I wrote according to the advice given at the Boost Incubator website. The project is feature complete from my side. What it needs now is the input from the Boost community to round off possible edges and to make the interface rich enough for everybody. I am good at considering the user perspective, but I cannot anticipate everyone's needs.

Is it pure C++03? Because (just from looking into it) it seems a lot of stuff in histrogramm.hpp could be done with templates, i.e. without marcos. But that's just my impression. Also things like move-constructors seem to be missing, which would make a lot of sense for a histogram. I would also use std::array instead of C-Arrays, etc..

Yes, it is pure C++03. I suppose I could replace the internal use of c-arrays with std::array in a few places. Does it matter if they are private members? Shouldn't the internals of a class be a matter of style as long as they work correctly and there is no difference in terms of code readability? About replacing macros with templates: agreed, there are a few places in the interface which could be written in a much nicer way by using variadic templates. I am well aware of the nice C++11 and C++14 features, but this was a motivated decision. AFAIK boost libs are required to run on C++0x compilers, so I went for the less elegant, but more compatible option.

...

Also: I think you will still need to provide bjam files, because that's still the way boost is built (though there has been lively discussion about cmake).

I've been following that, but I didn't notice a conclusion that the use of CMake is prohibited. According to the Boost Incubator, I am allowed to use CMake. I am not eager to write the build again using bjam, but I will do it if that is absolutely required. Best regards, Hans

Klemens Morgenstern

3:08 p.m.

...

...
All this sound quite interesting. I took a look at your documentation and the tests and I have to say: I have no clue what this library does. That is, yes, it helps you to write histograms, surce, but how does that look? This might be obvious for you as the developer, but for me it's completely unclear - maybe you can enhance your examples by providing the actually generated output. Because if you want interesent in your library you need people to have a clue what you're talking about. That does not mean, that you need to give every detail, but to have an overview and an basic idea would be nice. May I ask what part of the documentation you looked at? If you had a look at the README.md on the github page only, then I understand what you mean. The README.md does not contain usage examples. I will add some there to improve the appeal of the front page and make the point of the project more clear.

I looked at the html in the repository: https://htmlpreview.github.io/?https://raw.githubusercontent.com/HDembinski/... That was the link you provided in the incubator. Since I only looked shortly into that, the only thing i got, was that you can calculate the variance and over-/underflow in C++. I did not get the python example. You should at least provide the output of the example.

...

Yes, it is pure C++03. I suppose I could replace the internal use of c-arrays with std::array in a few places. Does it matter if they are private members? Shouldn't the internals of a class be a matter of style as long as they work correctly and there is no difference in terms of code readability?

Well, it was declared as C++0x in the incubator, so I wondered. C++03 is not a strict requirement for boost (e.g. boost.hana requires C++14) and I personally don't see any benefit in supporting obsolete Standards in new libraries. At least C++11 is widely available today. I would really like a class holding data to be movable. And you might probably be able to make good use of constexpr.

...

About replacing macros with templates: agreed, there are a few places in the interface which could be written in a much nicer way by using variadic templates. I am well aware of the nice C++11 and C++14 features, but this was a motivated decision. AFAIK boost libs are required to run on C++0x compilers, so I went for the less elegant, but more compatible option.

Makes sense for C++03.

...

...
Also: I think you will still need to provide bjam files, because that's still the way boost is built (though there has been lively discussion about cmake).

I've been following that, but I didn't notice a conclusion that the use of CMake is prohibited. According to the Boost Incubator, I am allowed to use CMake. I am not eager to write the build again using bjam, but I will do it if that is absolutely required.

The main build of the boost libraries will run on boost.build. Though that's down the road, and I'll gladly help you when that becomes necessary. Though you might think about adding travis-ci and coveralls to your repository.

Hans Dembinski

3:39 p.m.

On 5/5/16 11:08 AM, Klemens Morgenstern wrote:

...

...
...
All this sound quite interesting. I took a look at your documentation and the tests and I have to say: I have no clue what this library does. That is, yes, it helps you to write histograms, surce, but how does that look? This might be obvious for you as the developer, but for me it's completely unclear - maybe you can enhance your examples by providing the actually generated output. Because if you want interesent in your library you need people to have a clue what you're talking about. That does not mean, that you need to give every detail, but to have an overview and an basic idea would be nice. May I ask what part of the documentation you looked at? If you had a look at the README.md on the github page only, then I understand what you mean. The README.md does not contain usage examples. I will add some there to improve the appeal of the front page and make the point of the project more clear.

I looked at the html in the repository: https://htmlpreview.github.io/?https://raw.githubusercontent.com/HDembinski/...

That was the link you provided in the incubator.

Since I only looked shortly into that, the only thing i got, was that you can calculate the variance and over-/underflow in C++. I did not get the python example. You should at least provide the output of the example. Cool, I was not aware of htmlpreview.github.io, that's useful.

Ok, I will add the output of the examples and more explanation of what is actually happening there.

...

...
Yes, it is pure C++03. I suppose I could replace the internal use of c-arrays with std::array in a few places. Does it matter if they are private members? Shouldn't the internals of a class be a matter of style as long as they work correctly and there is no difference in terms of code readability?

Well, it was declared as C++0x in the incubator, so I wondered. C++03 is not a strict requirement for boost (e.g. boost.hana requires C++14) and I personally don't see any benefit in supporting obsolete Standards in new libraries. At least C++11 is widely available today. I would really like a class holding data to be movable. And you might probably be able to make good use of constexpr. I can use Boost.Move to implement that without sacrificing C++0x compatibility, so no problem.

Support for C++0x is essential for the target audience of this library, among which are big public science projects. All such projects I know still work with old compilers and are not likely to adapt the newer ones soon. Some big names I can drop are CERN, the IceCube Experiment, Pierre Auger Observatory. Big experiments like these rely on support for old Linux distributions which are still running on many computing clusters. These old Linux distributions do not have modern compilers that support C++11.

...

...
About replacing macros with templates: agreed, there are a few places in the interface which could be written in a much nicer way by using variadic templates. I am well aware of the nice C++11 and C++14 features, but this was a motivated decision. AFAIK boost libs are required to run on C++0x compilers, so I went for the less elegant, but more compatible option.

Makes sense for C++03.

...
...
Also: I think you will still need to provide bjam files, because that's still the way boost is built (though there has been lively discussion about cmake).

I've been following that, but I didn't notice a conclusion that the use of CMake is prohibited. According to the Boost Incubator, I am allowed to use CMake. I am not eager to write the build again using bjam, but I will do it if that is absolutely required.

The main build of the boost libraries will run on boost.build. Though that's down the road, and I'll gladly help you when that becomes necessary. Though you might think about adding travis-ci and coveralls to your repository.

Thank you for your offer to help with bjam, I really appreciate it. I also think it is something that can be done later. I will look into adding support for travis-ci and coveralls.io. I already added all these suggestions to the issue tracker on github. Best regards, Hans

Jason Rhinelander

4:35 p.m.

On 05/05/16 11:39 AM, Hans Dembinski wrote:

...

... without sacrificing C++0x compatibility, ... ^^^^^

Support for C++0x is essential ... ^^^^^

...

This project contains an easy-to-use powerful n-dimensional histogram class implemented in C++0x ^^^^^

Your usage here of "C++0x", both in your documentation and your replies in this thread, appears to be contradicting what you mean: C++0x was the informal name for what eventually became C++11 (see http://www.stroustrup.com/C++11FAQ.html). But what you seem to mean (judging from the context of your replies) is C++03, not C++11. Jason Rhinelander

Hans Dembinski

9:26 p.m.

Sorry, yes that's true, I mean C++03. On 5/5/16 12:35 PM, Jason Rhinelander wrote:

...

On 05/05/16 11:39 AM, Hans Dembinski wrote:

...
... without sacrificing C++0x compatibility, ... ^^^^^

Support for C++0x is essential ... ^^^^^

...
This project contains an easy-to-use powerful n-dimensional histogram class implemented in C++0x ^^^^^

Your usage here of "C++0x", both in your documentation and your replies in this thread, appears to be contradicting what you mean: C++0x was the informal name for what eventually became C++11 (see http://www.stroustrup.com/C++11FAQ.html). But what you seem to mean (judging from the context of your replies) is C++03, not C++11.

Jason Rhinelander

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Mathias Gaunard

10:51 p.m.

On 5 May 2016 at 16:39, Hans Dembinski <hans.dembinski@gmail.com> wrote:

...

Some big names I can drop are CERN, the IceCube Experiment, Pierre Auger Observatory. Big experiments like these rely on support for old Linux distributions which are still running on many computing clusters. These old Linux distributions do not have modern compilers that support C++11.

Slight tangent here, but that seems surprising. Are you sure those don't have GCC 4.8 through RHEL DTS 2?

Hans Dembinski

6 May 6 May

4:13 p.m.

Yes, I am sure. Here is evidence: http://information-technology.web.cern.ch/fr/services/lxplus-service "[The cluster computers] run SLC6 (Scientific Linux CERN 6)" SLC6 comes with gcc-4.4.7, which has bad support for C++11, see the question of a suffering user (most likely a CERN physicist): http://stackoverflow.com/questions/15975481/problems-with-c11-library-and-g-... Currently, I am a member of the IceCube experiment and we are also not allowed to use C++11 features. On 05/05/2016 06:51 PM, Mathias Gaunard wrote:

...

On 5 May 2016 at 16:39, Hans Dembinski <hans.dembinski@gmail.com> wrote:

...
Some big names I can drop are CERN, the IceCube Experiment, Pierre Auger Observatory. Big experiments like these rely on support for old Linux distributions which are still running on many computing clusters. These old Linux distributions do not have modern compilers that support C++11.

Slight tangent here, but that seems surprising. Are you sure those don't have GCC 4.8 through RHEL DTS 2?

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Mathias Gaunard

6:13 p.m.

On 6 May 2016 at 17:13, Hans Dembinski <hans.dembinski@gmail.com> wrote:

...

Yes, I am sure.

Here is evidence: http://information-technology.web.cern.ch/fr/services/lxplus-service "[The cluster computers] run SLC6 (Scientific Linux CERN 6)"

SLC6 comes with gcc-4.4.7, which has bad support for C++11, see the question of a suffering user (most likely a CERN physicist):

http://stackoverflow.com/questions/15975481/problems-with-c11-library-and-g-...

Currently, I am a member of the IceCube experiment and we are also not allowed to use C++11 features.

Right, and that is just a variant of RHEL6, whose native compiler is indeed gcc 4.4.x. The various DTS are still availabe to get access to more modern compilers, see http://linux.web.cern.ch/linux/devtoolset/ DTS 2 is gcc 4.8.x and 3 is 4.9.x, 4 is 5.2.x. All can be installed on SLC6, and any binary compiled with those tools will run on RHEL6/CentOS6/SLC6 without needing to install anything.

Paul A. Bristow

5 May 5 May

3:31 p.m.

...

-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Hans Dembinski Sent: 05 May 2016 15:33 To: boost@lists.boost.org Subject: Re: [boost] proposed new library "histogram"

...

About replacing macros with templates: agreed, there are a few places in the interface which could be written in a much nicer way by using variadic templates. I am well aware of the nice C++11 and C++14 features, but this was a motivated decision. AFAIK boost libs are required to run on C++0x compilers, so I went for the less elegant, but more compatible option.

For new libraries, there is no *requirement* to use older compilers, just not making use of new features perversely. If there are real improvements from using C++11 or newer, then you can do this.

...

...
Also: I think you will still need to provide bjam files, because that's still the way boost is built (though there has been lively discussion about cmake). I've been following that, but I didn't notice a conclusion that the use of CMake is prohibited. According to the Boost Incubator, I am allowed to use CMake. I am not eager to write the build again using bjam, but I will do it if that is absolutely required.

In order to take advantage of the Boost Build system of test running machines, you will need to provide jamfile(s) to run under b2. Ask for help. But there are many hurdles to jump first ;-) Paul --- Paul A. Bristow Prizet Farmhouse Kendal UK LA8 8AB +44 (0) 1539 561830

Hans Dembinski

4:12 p.m.

Dear Paul, On 5/5/16 11:31 AM, Paul A. Bristow wrote:

...

For new libraries, there is no *requirement* to use older compilers, just not making use of new features perversely.

If there are real improvements from using C++11 or newer, then you can do this.

Okay, thanks for clarifying this. I was not sure what the policy is.

...

...
...
Also: I think you will still need to provide bjam files, because that's still the way boost is built (though there has been lively discussion about cmake). I've been following that, but I didn't notice a conclusion that the use of CMake is prohibited. According to the Boost Incubator, I am allowed to use CMake. I am not eager to write the build again using bjam, but I will do it if that is absolutely required. In order to take advantage of the Boost Build system of test running machines, you will need to provide jamfile(s) to run under b2. Ask for help.

But there are many hurdles to jump first ;-) Alright, I will keep it in mind and focus on the other issues for now.

Best regards, Hans

Hans Dembinski

6 May 6 May

7:13 p.m.

Hi Klemens, I added support for move semantics using boost::move. Best regards, Hans On 05/05/2016 03:21 AM, Klemens Morgenstern wrote:

...

Am 05.05.2016 um 00:21 schrieb Hans Dembinski:

...
Hi everybody,

I recently added a new library called "histogram" to the Boost Incubator. I would like to advertise it a little here in the hope to find a person interested in reviewing it. I hope that shameless self-advertisement is not going against some rule of this list, but I am sure you will let me know.

Well no, that's part of the purpose of the mailing list. Though be prepared, that you may get very harsh criticism here.

...
My background is in analysis of big data in the fields of particle physics and astroparticle physics. Boost is very popular among my peers, since it is a free, high-quality, rich, and very well maintained collection of libraries. There is a growing number of tools to do statistical analysis in Boost and I think this project would fit in nicely, and fill a gap. We work with histograms a lot, so that's why my interest came from.

All this sound quite interesting. I took a look at your documentation and the tests and I have to say: I have no clue what this library does. That is, yes, it helps you to write histograms, surce, but how does that look? This might be obvious for you as the developer, but for me it's completely unclear - maybe you can enhance your examples by providing the actually generated output. Because if you want interesent in your library you need people to have a clue what you're talking about. That does not mean, that you need to give every detail, but to have an overview and an basic idea would be nice.

...
I am a senior programmer in C++ and Python with 10 years of experience. Guiding development through code reviews and tickets, as well as taking on responsibility for continuous maintenance, are natural for me. Naturally, I am willing to commit free time to maintain the project should it be accepted, and do my share of the work in this community.

I put a lot of thought and effort into this project, the rationale and my design choices are explained in the documentation, which I wrote according to the advice given at the Boost Incubator website. The project is feature complete from my side. What it needs now is the input from the Boost community to round off possible edges and to make the interface rich enough for everybody. I am good at considering the user perspective, but I cannot anticipate everyone's needs.

Is it pure C++03? Because (just from looking into it) it seems a lot of stuff in histrogramm.hpp could be done with templates, i.e. without marcos. But that's just my impression. Also things like move-constructors seem to be missing, which would make a lot of sense for a histogram. I would also use std::array instead of C-Arrays, etc..

Also: I think you will still need to provide bjam files, because that's still the way boost is built (though there has been lively discussion about cmake).

...
In case you got interested, here are the links:

Incubator link:

http://rrsd.com/blincubator.com/bi_library/histogram-2/?gform_post_id=1582

github link:

https://github.com/HDembinski/histogram

Best regards,

Hans

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Thijs van den Berg

5 May 5 May

8:36 p.m.

On 5 May 2016 at 00:21, Hans Dembinski <hans.dembinski@gmail.com> wrote:

...

Hi everybody,

I recently added a new library called "histogram" to the Boost Incubator. I would like to advertise it a little here in the hope to find a person interested in reviewing it. I hope that shameless self-advertisement is not going against some rule of this list, but I am sure you will let me know.

My background is in analysis of big data in the fields of particle physics and astroparticle physics. Boost is very popular among my peers, since it is a free, high-quality, rich, and very well maintained collection of libraries. There is a growing number of tools to do statistical analysis in Boost and I think this project would fit in nicely, and fill a gap. We work with histograms a lot, so that's why my interest came from.

I am a senior programmer in C++ and Python with 10 years of experience. Guiding development through code reviews and tickets, as well as taking on responsibility for continuous maintenance, are natural for me. Naturally, I am willing to commit free time to maintain the project should it be accepted, and do my share of the work in this community.

I put a lot of thought and effort into this project, the rationale and my design choices are explained in the documentation, which I wrote according to the advice given at the Boost Incubator website. The project is feature complete from my side. What it needs now is the input from the Boost community to round off possible edges and to make the interface rich enough for everybody. I am good at considering the user perspective, but I cannot anticipate everyone's needs.

In case you got interested, here are the links:

Incubator link:

http://rrsd.com/blincubator.com/bi_library/histogram-2/?gform_post_id=1582

github link:

https://github.com/HDembinski/histogram

Best regards,

Hans

Hi Hans, Interesting ideas. I have some algorithmic questions: I'd like to learn about the details behind the "just works" friendly objective so that I can decide if it will work for me -or not-, and under what circumstances. One reason I sometimes pick C++ instead of Python is because of performance, especially when I need to handle large datasets. In those cases the details often matter. So, if I was going to consider using it, it would be helpful to see performance metrics -e.g. compared to some naive alternative-. I've read that you computes variance: can that computation be switched-on/off (e.g. I might not need it)? Also, there are various online (single pass, weighted) variance algorithms: some a stable, other not. Which one have you implemented? Does is use std::accumulate? It would be nice to reassure numerically focused users about the level of quality of he internals. I would also like to see information about the computational and memory complexity about two other internal algorithms I think I saw mentioned: 1) automatically re-binning: when you modify bins do you split a single bin, or do you readjust *all* bin boundaries? Do you keep a sorted list inside each bin? 2) sparse storage: .. I know this is a complex field where lots of trade off can be made-. E.g. suppose I fill a 10-dimensional histogram with samples that (only) have elements on a diagonal -a potential worst case scenario for some methods would be-: for(int i: {1, 2, 3, 4, 5}) h.fill([i,i,i,i,i,i,i,i,i,i]) would this result in 5 sparse bins -the bins on the diagonal-, or 5^10 bins -the outer product of ten axis, each with 5 bins-? Thanks, Thijs

Hans Dembinski

10:47 p.m.

...

Hi Hans,

Interesting ideas. I have some algorithmic questions: I'd like to learn about the details behind the "just works" friendly objective so that I can decide if it will work for me -or not-, and under what circumstances. One reason I sometimes pick C++ instead of Python is because of performance, especially when I need to handle large datasets. In those cases the details often matter. So, if I was going to consider using it, it would be helpful to see performance metrics -e.g. compared to some naive alternative-. I have a simple benchmark comparison against three classes in the ROOT

Hi Thijs, On 5/5/16 4:36 PM, Thijs van den Berg wrote: framework comparing the performance on 1-dimensional, 3-dimensional, and 6-dimensional data. I will add the performance results to the documentation. I tested the benchmark on two different computers and got qualitatively different results, so I cannot draw a general conclusion. The speed is roughly similar. If you like to try for yourself, you could check out the code and activate the CMake option BUILD_CHECKS, then run the executable "nhistogram_speed" generated in the build directory. I could and probably should also do a comparison against numpy.histogram. If you have a particular kind of benchmark in mind, let me know, perhaps I can implement it. It is probably impossible to beat a specialized histogram type made exclusively for 1d-data and a particular binning strategy, because a dynamic solution like mine has additional overhead related to the ability to define the binning algorithm and the number of dimensions at run-time. Maybe an expert on this community sees a way to make it faster. That being said, the performance gap is not big and which was explicitly one of the design goals. I think that in most cases more CPU cycles are spend to generate/read the data that is to be filled into the histogram than used by histogram itself during the binning and counting.

...

I've read that you computes variance: can that computation be switched-on/off (e.g. I might not need it)? Also, there are various online (single pass, weighted) variance algorithms: some a stable, other not. Which one have you implemented? Does is use std::accumulate? It would be nice to reassure numerically focused users about the level of quality of he internals. I don't use std::accumulate, since it does not fit into my scheme.

...

I would also like to see information about the computational and memory complexity about two other internal algorithms I think I saw mentioned:

1) automatically re-binning: when you modify bins do you split a single bin, or do you readjust *all* bin boundaries? Do you keep a sorted list inside each bin? I think there is a misunderstanding. There is no automatic re-binning in

...

2) sparse storage: .. I know this is a complex field where lots of trade off can be made-. E.g. suppose I fill a 10-dimensional histogram with samples that (only) have elements on a diagonal -a potential worst case scenario for some methods would be-: for(int i: {1, 2, 3, 4, 5}) h.fill([i,i,i,i,i,i,i,i,i,i])

would this result in 5 sparse bins -the bins on the diagonal-, or 5^10 bins -the outer product of ten axis, each with 5 bins-? This histogram implements a dense storage strategy, for the sake of

If you do not fill the histogram with weighted events, there is no overhead involved for the variance estimate. If you fill the histogram with normal data, without using weights, then the variance is computed on-the-fly when the user requests it via histogram::variance(...). When no weights are involved, the variance estimate per bin is taken to be equal to the count in that bin, a common estimate based on Poisson theory. When weights were used during the filling, the variance estimate is the sum of squared weights. Storing the sum of squared weights for each bin requires twice the memory. I did not implement an option to switch that off, since in a statistical analysis, the variance estimate is as important as the actual count. I think it is safe to assume that if you have a special case with weighted data, you also want a variance estimate. I will put in more details on these things into the Notes section of the documentation. the sense that the number of bins along each axis of the histogram is changed. The number of bins is always the same, bins are not split. What grows is the size of the integer used to store a bin count. I start with 1 byte for each bin, which covers counts from 0 to 255. Once you try to fill a bin that has already 255 counts, the internal memory for *all* bins is re-allocated and replaced by a memory block that uses 2 bytes for each bin. In addition to the memory reallocation this involves a O(N) copy that is done in place (N is the number of bins). This procedure is repeated if any of the bins exceeds its new storage maximum of 65535, and so on. Since the reallocation is done for all bins at once, this overhead does not occur very often and does not introduce a significant performance hit. I considered more elaborate storage strategies where the number of bytes could differ for each bin, but all those that I could think of would significantly decrease performance and might actually increase the memory footprint for really large histograms. I can store an integer in 1 byte, but I already need 8 byte for a pointer, even if the pointer points nowhere. performance. Writing a histogram with sparse storage is not my design goal. However, there is a base class which handles the binning and the axis types which a sparse histogram could re-use. Best regards, Hans

Hans Dembinski

6 May 6 May

7:10 p.m.

Some performance metrics, as requested. For more information, please have a look at the updated docs. Test system: Intel Core i7-4500U CPU clocked at 1.8 GHz, 8 GB of DDR3 RAM ================= ======= ======= ======= ======= ======= ======= distribution uniform normal ----------------- ------------------------- ------------------------- dimension 1D 3D 6D 1D 3D 6D ================= ======= ======= ======= ======= ======= ======= No. of fills 12M 4M 2M 12M 4M 2M C++: ROOT [t/s] 0.127 0.199 0.185 0.168 0.143 0.179 C++: boost [t/s] 0.172 0.177 0.155 0.172 0.171 0.150 Py: numpy [t/s] 0.825 0.727 0.436 0.824 0.426 0.401 Py: boost [t/s] 0.209 0.229 0.192 0.207 0.194 0.168 ================= ======= ======= ======= ======= ======= ======= Using boost::histogram in Python is considerably faster than using numpy.histogram. On 05/05/2016 04:36 PM, Thijs van den Berg wrote:

...

On 5 May 2016 at 00:21, Hans Dembinski <hans.dembinski@gmail.com> wrote:

...
Hi everybody,

I recently added a new library called "histogram" to the Boost Incubator. I would like to advertise it a little here in the hope to find a person interested in reviewing it. I hope that shameless self-advertisement is not going against some rule of this list, but I am sure you will let me know.

My background is in analysis of big data in the fields of particle physics and astroparticle physics. Boost is very popular among my peers, since it is a free, high-quality, rich, and very well maintained collection of libraries. There is a growing number of tools to do statistical analysis in Boost and I think this project would fit in nicely, and fill a gap. We work with histograms a lot, so that's why my interest came from.

I am a senior programmer in C++ and Python with 10 years of experience. Guiding development through code reviews and tickets, as well as taking on responsibility for continuous maintenance, are natural for me. Naturally, I am willing to commit free time to maintain the project should it be accepted, and do my share of the work in this community.

I put a lot of thought and effort into this project, the rationale and my design choices are explained in the documentation, which I wrote according to the advice given at the Boost Incubator website. The project is feature complete from my side. What it needs now is the input from the Boost community to round off possible edges and to make the interface rich enough for everybody. I am good at considering the user perspective, but I cannot anticipate everyone's needs.

In case you got interested, here are the links:

Incubator link:

http://rrsd.com/blincubator.com/bi_library/histogram-2/?gform_post_id=1582

github link:

https://github.com/HDembinski/histogram

Best regards,

Hans

Hi Hans,

Interesting ideas. I have some algorithmic questions: I'd like to learn about the details behind the "just works" friendly objective so that I can decide if it will work for me -or not-, and under what circumstances. One reason I sometimes pick C++ instead of Python is because of performance, especially when I need to handle large datasets. In those cases the details often matter. So, if I was going to consider using it, it would be helpful to see performance metrics -e.g. compared to some naive alternative-.

I've read that you computes variance: can that computation be switched-on/off (e.g. I might not need it)? Also, there are various online (single pass, weighted) variance algorithms: some a stable, other not. Which one have you implemented? Does is use std::accumulate? It would be nice to reassure numerically focused users about the level of quality of he internals.

I would also like to see information about the computational and memory complexity about two other internal algorithms I think I saw mentioned:

1) automatically re-binning: when you modify bins do you split a single bin, or do you readjust *all* bin boundaries? Do you keep a sorted list inside each bin?

2) sparse storage: .. I know this is a complex field where lots of trade off can be made-. E.g. suppose I fill a 10-dimensional histogram with samples that (only) have elements on a diagonal -a potential worst case scenario for some methods would be-: for(int i: {1, 2, 3, 4, 5}) h.fill([i,i,i,i,i,i,i,i,i,i])

would this result in 5 sparse bins -the bins on the diagonal-, or 5^10 bins -the outer product of ten axis, each with 5 bins-?

Thanks, Thijs

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

3346

Age (days ago)

3348

Last active (days ago)

List overview

Download

15 comments

6 participants

participants (6)

Hans Dembinski
Jason Rhinelander
Klemens Morgenstern
Mathias Gaunard
Paul A. Bristow
Thijs van den Berg