On 2017-04-12 12:34, Bjorn Reese via Boost wrote:
On 04/12/2017 11:37 AM, Hans Dembinski via Boost wrote:
The library implements a histogram class (a highly configurable policy-based template) for C++ and Python in C++11 code. Histograms are a standard tool to explore Big Data. They allow one to visualise and analyse distributions of random variables. A histogram provides a lossy compression of input data. GBytes of input can be put in a compact form which requires only a small fraction of the original memory. This makes histograms convenient for interactive data analysis and further processing.
Given that the compression is lossy, I am wondering how it compares with a distribution estimator like:
https://arxiv.org/abs/1507.05073v2
A common use-case when collecting numerical data is to determine the quantiles. Boost.Accumulators contains an estimator (extended_p_square) for that.
The advantage of such estimators are that they execute in constant time and with constant memory usage, where the constant depends only on the required precision.
PS: I am aware that this is a non-trivial question, so I do not expect an answer.
Hi, Simple answer: Histograms are not designed for estimating the quantile function, but the pdf. While it is true that a sufficiently good estimate of the pdf will give you an estimate of the quantiles via the inverse of the cdf, the obtainable precision depends on the size of the bins chosen for the histogram. On the other hand, if your data is multi-variate or your pdf multi-modal, you will have a hard time using quantiles, while you could still do for example outlier detection using histograms. Best, Oswin