
Boost.Histogram contains useful and well-designed features, like excess (over/underflow) bins, and axis transforms to name a few. As this is a review, I am going to spent most time below on critical scrutiny. I. DESIGN --------- Iterators are const. This prevents us from bootstrapping the histogram with a prior distribution using STL algorithms like std::fill(), std::generate(), or std::sample(). The adaptive_storage does not work well with STL algorithms, because weight_counter is not an arithmetic type. See the Implementation section below for more detail. I therefore propose that the default storage policy is changed to something without weight_counter. The adaptive_storage has two responsibilities: data compaction and weights. Would it be possible to split this into two separate storage policies? I have no real use for arrival variance. I am not too fond of using operator() for insertion. The code looks like a function call with side-effect. I prefer an explicitly named member function. The axis::ouflow_type is oddly named. I suggest that this is renamed to something like axis::excess_type. II. IMPLEMENTATION ------------------ The implementation is generally of high quality. However, I did encounter the three problem areas listed below. (A) Integration with STL algorithms can be improved. Here are some examples: First, std::distance() does not work on a histogram with array_storage. This means that other STL algorithms, like std::any_of(), fails to compile. I solved this problem by copying the distance_to() function from the axis iterator to the histogram iterator. Second, std::max_element (and brethren) cannot be used on a histogram. Consider the following example: auto element = std::max_element(h.begin(), h.end()); Compilation fails for adaptive_storage because weight_counter has no less-than operator. Furthermore, compilation fails for array_storage because iterator_over<H>::operator= fails. It attempts to re-assign a reference, but accidentally triggers a copy-constructor instead. Letting the iterator store a pointer instead of a reference solves the problem. Apropos, iterator_over<H>::operator= does not return *this. Consider adding the -Werror=return-type compiler flag and a unit-test to that calls this operator. Third, std::inner_product fails to compile for adaptive_storage because weight_counter does not have a binary operator*. The inner product of two histograms is useful for calculating the cosine similarity, which can be used to compare two distributions. std::inner_product works for array_storage though. (B) Indexing and size use different types (int versus std::size_t.) I assume that this is because of the underflow bin, which is indexed by -1. I am not certain whether or not this is a real problem, but it does cause some oddities when stressed to the limit. For example, if we need a histogram that counts each unsigned int, then we get an "lower < upper" exception during construction: using range_type = unsigned int; auto h = make_static_histogram_with( array_storage<int>(), axis::integer<>(0, std::numeric_limits<range_type>::max())); It works if we reduce the end by 2, but then we are not collecting all values (unless we shift the range left by 1 and misuse the excess bins for the two missing values.) A possible solution could be to replace the -1 and N excess indices with an "enum struct excess { lower, upper }" and let operator[] et al use overloading on this enum. (C) Sometimes the compilation errors are nonsensical. For example: // Forgot to use make_static_histogram_with() auto h = make_static_histogram(array_storage<int>, axis::regular<>(10, 0, 1)); triggers an incomprehensible static_assert plus an error about a missing .shape() function. III. DOCUMENTATION ------------------ User guide is clear and pedagogical. I would like to see more examples that uses STL algorithms, such as calculating the CDF using std::partial_sum(), or calculating the cosine similarity of two histograms using std::inner_product(). Most examples use std::cout plus a comment to document the results of operations. Consider using assert() instead. Consider using a tabular layout similar to that of the C++ standard (or cppreference.com for that matter) on the Concepts page. Reference documentation is rather meager and shows implementation details. The documentation contains both a "Reference" and a "References" chapter which are completely different. Consider renaming the latter to something like "External references" , "Literature references", or "Bibliography". IV. MISC -------- I have spent around 15 hours on the review, mainly writing small examples that uses STL algorithms on Boost.Histogram. I am well-versed in the topic. I work with statistical distributions for real-time analysis of data, although I mainly dabble around in one dimension. I have written a library for online/streaming statistical algorithms. V. VERDICT ---------- While Boost.Histogram is a small niche library, it has a wide range of practical applications to warrant the inclusion into Boost. Boost.Histogram should be ACCEPTED into Boost, provided: 1. Reference documentation is finalized. I furthermore strongly recommend that the default storage policy is changed to something without weight_counter because of the various problems with STL algorithms. This recommendation is not a prerequisite for acceptance.

Dear Bjørn,
On 23. Sep 2018, at 16:26, Bjorn Reese via Boost <boost@lists.boost.org> wrote:
Boost.Histogram contains useful and well-designed features, like excess (over/underflow) bins, and axis transforms to name a few.
As this is a review, I am going to spent most time below on critical scrutiny. […]
thank you very much for the great review and the time spend with the library. Some discussion below.
I. DESIGN ---------
Iterators are const. This prevents us from bootstrapping the histogram with a prior distribution using STL algorithms like std::fill(), std::generate(), or std::sample().
Writable iterators could be added, but I am not convinced that is a good idea. The histogram iterators are const, because I cannot imagine a use-case where the histogram has to be bootstrapped with a prior distribution. Only providing const iterators avoids the misconception that a histogram is a container. Histogram can make stronger guarantees about its internal consistency, if users cannot set bin counters arbitrarily.
The adaptive_storage does not work well with STL algorithms, because weight_counter is not an arithmetic type. See the Implementation section below for more detail. I therefore propose that the default storage policy is changed to something without weight_counter.
There are two solutions, right? Turing weight_counter into a full arithmetic type or removing support for weight_counter in the default storage.
The adaptive_storage has two responsibilities: data compaction and weights. Would it be possible to split this into two separate storage policies? I have no real use for arrival variance.
Agreed, it should be a compile-time option for adaptive_storage. https://github.com/HDembinski/histogram/issues/75
I am not too fond of using operator() for insertion. The code looks like a function call with side-effect. I prefer an explicitly named member function.
Originally, I had a special method called fill(), then I switched to operator(). There are two arguments in favour of operator(): - a 1D histogram can be filled with std::for_each or std::accumulate - consistency with Boost.Accumulators, which also use operator()
The axis::ouflow_type is oddly named. I suggest that this is renamed to something like axis::excess_type.
I am ok with renaming, but I think we haven't found the best name yet. What do you think about axis::tails_type?
II. IMPLEMENTATION ————————— […]
Agreed, the missing bits will be implemented and unit-tested.
III. DOCUMENTATION ————————— […]
Agreed, documentation will be expanded in this way.
Boost.Histogram should be ACCEPTED into Boost, provided:
1. Reference documentation is finalized.
I furthermore strongly recommend that the default storage policy is changed to something without weight_counter because of the various problems with STL algorithms. This recommendation is not a prerequisite for acceptance.
Thanks!

On 09/24/18 23:52, Hans Dembinski wrote:
The histogram iterators are const, because I cannot imagine a use-case where the histogram has to be bootstrapped with a prior distribution. Only providing const iterators avoids the misconception that a histogram is a container. Histogram can make stronger guarantees about its internal consistency, if users cannot set bin counters arbitrarily.
If you are using histograms to collect data for later analysis, then there may indeed not be much use for a prior distribution. However, if you use histograms to make decisions on-the-fly, then you cannot always wait for the histogram to be filled up with data in order to make reliable decisions. In such cases a prior distribution can help. Admittedly, I have not done this with histograms, but I do use CDFs (well, quantile approximations) quite a lot for on-the-fly decision- making. The other use case is to restore the histogram with data from an earlier run, but serialization already covers that.
There are two solutions, right? Turing weight_counter into a full arithmetic type or removing support for weight_counter in the default storage.
True, but then I am going to play the zero-overhead card :)
Originally, I had a special method called fill(), then I switched to operator(). There are two arguments in favour of operator(): - a 1D histogram can be filled with std::for_each or std::accumulate - consistency with Boost.Accumulators, which also use operator()
Inserting data into the histogram with std::accumulate is odd. An alternative is to use std::copy (or std::transform) with std::inserter. That does requires a histogram::insert(iterator hint, value_type value) where the hint is completely useless... but that is what std::set is using.
I am ok with renaming, but I think we haven't found the best name yet. What do you think about axis::tails_type?
Seems like you settled on options in another thread. I am fine with that. Related to this, I also like the extent() over shape(). This name is also used by Boost.MultiArray and the std::mdspan proposal (P0009R7). In the name of consistency, you may also consider renaming dim() to rank().

On 30. Sep 2018, at 17:29, Bjorn Reese via Boost <boost@lists.boost.org> wrote:
On 09/24/18 23:52, Hans Dembinski wrote:
The histogram iterators are const, because I cannot imagine a use-case where the histogram has to be bootstrapped with a prior distribution. Only providing const iterators avoids the misconception that a histogram is a container. Histogram can make stronger guarantees about its internal consistency, if users cannot set bin counters arbitrarily.
If you are using histograms to collect data for later analysis, then there may indeed not be much use for a prior distribution. However, if you use histograms to make decisions on-the-fly, then you cannot always wait for the histogram to be filled up with data in order to make reliable decisions. In such cases a prior distribution can help.
Admittedly, I have not done this with histograms, but I do use CDFs (well, quantile approximations) quite a lot for on-the-fly decision- making.
The other use case is to restore the histogram with data from an earlier run, but serialization already covers that.
I plan to add a struct `unsafe_access` which allows unsafe raw access to the bin counters, like so auto& storage = unsafe_access::storage(some_histogram); This will be part of the public interface, but is targeted towards advanced users. It will be used by algorithms that transform histograms and by serialization code. I also want to support other serialization libraries than Boost.Serialization, and unsafe_access makes this easier. With `unsafe_access`, advanced users get full control, but the standard interface still protects from trivial usage errors. The docs of `unsafe_access` then basically say: if you use this, you are on your own. It could be used to set the histogram with a prior, which seems like an advanced use case.
There are two solutions, right? Turing weight_counter into a full arithmetic type or removing support for weight_counter in the default storage.
True, but then I am going to play the zero-overhead card :)
Fair enough. https://github.com/HDembinski/histogram/issues/75 The default value would then be without weight support.
Originally, I had a special method called fill(), then I switched to operator(). There are two arguments in favour of operator(): - a 1D histogram can be filled with std::for_each or std::accumulate - consistency with Boost.Accumulators, which also use operator()
Inserting data into the histogram with std::accumulate is odd.
I was wrong, std::accumulate uses += to fill, which is not supported.
An alternative is to use std::copy (or std::transform) with std::inserter. That does requires a histogram::insert(iterator hint, value_type value) where the hint is completely useless... but that is what std::set is using.
histogram is not a container, so it should not support std::inserter. histogram is a accumulator and accumulators have a functor design, as established by Boost.Accumulator. They eat samples and update their internal state, there is an implied transformative aspect. Inserting a value implies that you can exactly retrieve it later, which is not the case here.
I am ok with renaming, but I think we haven't found the best name yet. What do you think about axis::tails_type?
Seems like you settled on options in another thread. I am fine with that.
Related to this, I also like the extent() over shape(). This name is also used by Boost.MultiArray and the std::mdspan proposal (P0009R7). In the name of consistency, you may also consider renaming dim() to rank().
Good suggestion, thank you! https://github.com/HDembinski/histogram/issues/112 Best regards, Hans
participants (2)
-
Bjorn Reese
-
Hans Dembinski