Re: [boost] [proposed][histogram]

12 Apr 2017

      On 2017-04-12 12:34, Bjorn Reese via Boost wrote:
...
On 04/12/2017 11:37 AM, Hans Dembinski via Boost wrote:
...
The library implements a histogram class (a highly configurable 
policy-based template) for C++ and Python in C++11 code. Histograms 
are a standard tool to explore Big Data. They allow one to visualise 
and analyse distributions of random variables. A histogram provides a 
lossy compression of input data. GBytes of input can be put in a 
compact form which requires only a small fraction of the original 
memory. This makes histograms convenient for interactive data analysis 
and further processing.
Given that the compression is lossy, I am wondering how it compares 
with
a distribution estimator like:
https://arxiv.org/abs/1507.05073v2
A common use-case when collecting numerical data is to determine the
quantiles. Boost.Accumulators contains an estimator (extended_p_square)
for that.
The advantage of such estimators are that they execute in constant time
and with constant memory usage, where the constant depends only on the
required precision.
PS: I am aware that this is a non-trivial question, so I do not expect
    an answer.
Hi,

Simple answer: Histograms are not designed for estimating the quantile 
function, but the pdf.

While it is true that a sufficiently good estimate of the pdf will give 
you an estimate of the quantiles via the inverse of the cdf, the 
obtainable precision depends on the size of the bins chosen for the 
histogram.

On the other hand, if your data is multi-variate or your pdf 
multi-modal, you will have a hard time using quantiles, while you could 
still do for example outlier detection using histograms.

Best,
Oswin

Re: [boost] [proposed][histogram]

Oswin Krause