Re: [boost] [Review] Review of the Accumulators library begins today Jan 29

newer
Thread-safe singleton (was [FSM]...

older
[Review] Review of the...

james.jones＠firstinvestors.com

31 Jan 2007 31 Jan '07

2:53 p.m.

From: Matthias Troyer <troyer@phys.ethz.ch>

...

These were easy, trickier are robust estimators of variance and moments, even trickier will be median estimators where I currently even do not see how this should be done without storing most of the time series.

Isn't this already a problem with sequences? Suppose you're storing the current median, and a new value comes along. What's the new median - without checking all the previous values? There are statistical estimates for this, but I don't know any exact way other than essentially resorting the data and checking the new median. - James Jones Administrative Data Mgmt. Webmaster 375 Raritan Center Pkwy, Suite A Data Architect Edison, NJ 08837

Show replies by date

Ben FrantzDale

31 Jan 31 Jan

3:19 p.m.

New subject: [Review] Review of the Accumulators library begins today Jan 29

On 1/31/07, james.jones@firstinvestors.com <james.jones@firstinvestors.com> wrote:

...

From: Matthias Troyer <troyer@phys.ethz.ch>

...
These were easy, trickier are robust estimators of variance and moments, even trickier will be median estimators where I currently even do not see how this should be done without storing most of the time series.

Isn't this already a problem with sequences? Suppose you're storing the current median, and a new value comes along. What's the new median - without checking all the previous values? There are statistical estimates for this, but I don't know any exact way other than essentially resorting the data and checking the new median.

Good point. Combining partial results may not be possible for all accumulated statistics or there may be a cost to making a statistic "combinable". Perhaps a refinement of the Accumulator concept would be needed. The idea of the Accumulator library is to implement online algorithms; is there a corresponding word for online algorithms for which partial results can be combined? —Ben PS For reference, these two pages may be of use: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Algorithm_I... http://en.wikipedia.org/wiki/Sum_of_normal_distributions

Hans Meine

3:38 p.m.

New subject: [Review] Review of the Accumulators library begins today Jan 29

Hi! On Wednesday, 31. January 2007 16:19, Ben FrantzDale wrote:

...

...
Isn't this already a problem with sequences? Suppose you're storing the current median, and a new value comes along. What's the new median - without checking all the previous values? There are statistical estimates for this, but I don't know any exact way other than essentially resorting the data and checking the new median.

Good point. Combining partial results may not be possible for all accumulated statistics or there may be a cost to making a statistic "combinable". Perhaps a refinement of the Accumulator concept would be needed.

The idea of the Accumulator library is to implement online algorithms; is there a corresponding word for online algorithms for which partial results can be combined?

Actually, I don't see any "online algorithms" whose partial results cannot be combined. IIUC, what James pointed out is exactly that the median is not an online algorithm. AFAICS, the "partial result" (in terms of inner representation) of a median accumulator consists of the (possibly sorted) complete input seen so far. (I would love to be corrected, but I cannot imagine a solution.) IMO, combining accumulators should be an integral part of the accumulators library. On Wednesday, 31. January 2007 08:03, Matthias Troyer wrote:

...

Yes, this is definitely needed but I think it will be lots of work to add it, based on my own experience in implementing something like this in our ALPS library for parallel Monte Carlo simulations. I am planning to think about this sometime this year, but believe it is a substantial extension to the library and will not be easy to design right.

Is it really such a "substantial extension"? I fully agree with John here: On Wednesday, 31. January 2007 10:23, John Maddock wrote:

...

But can't we make this a conceptual requirement that if you try and combine two accumulator sets, then each accumulator must have a "combine" method (otherwise you get a compile time error). The "combine" methods for min/max/sum/count etc are all trivial aren't they - and could be added fairly easily ?

Which accumulators would then have a non-trivial "combine" method? Greetings, Hans

Matthias Troyer

9:54 p.m.

New subject: [Review] Review of the Accumulators library begins today Jan 29

On Jan 31, 2007, at 4:38 PM, Hans Meine wrote:

...

Hi!

On Wednesday, 31. January 2007 16:19, Ben FrantzDale wrote:

...
...
Isn't this already a problem with sequences? Suppose you're storing the current median, and a new value comes along. What's the new median - without checking all the previous values? There are statistical estimates for this, but I don't know any exact way other than essentially resorting the data and checking the new median.

Good point. Combining partial results may not be possible for all accumulated statistics or there may be a cost to making a statistic "combinable". Perhaps a refinement of the Accumulator concept would be needed.

The idea of the Accumulator library is to implement online algorithms; is there a corresponding word for online algorithms for which partial results can be combined?

Actually, I don't see any "online algorithms" whose partial results cannot be combined. ...

Is it really such a "substantial extension"? I fully agree with John here:

On Wednesday, 31. January 2007 10:23, John Maddock wrote:

...
But can't we make this a conceptual requirement that if you try and combine two accumulator sets, then each accumulator must have a "combine" method (otherwise you get a compile time error). The "combine" methods for min/max/sum/count etc are all trivial aren't they - and could be added fairly easily ?

Which accumulators would then have a non-trivial "combine" method?

There are two types of accumulators that I can think of immediately: 1. approximate median and quantile estimators such as the p^2 methods cannot be combined but are very important for some applications 2. accumulators for correlates samples, e.g. autocorrelation estimators, or jackknife-bins cannot easily be combined. Instead of just merging the results one has to store the results from the individual accumulator_sets separately and combine them only in the final evaluations. Please keep in mind that correlated samples are very common. For 1. there is simply no way, and that is just it - one might either have to make this cause a compile-time error or drop the median and quantile estimators when combining. For 2. the problem is that combined accumulator sets need a different data structure than individual ones. Matthias

Hervé Brönnimann

1 Feb 1 Feb

1:39 a.m.

New subject: [Review] Review of the Accumulators library begins today Jan 29

Matthias: Forgive me, I haven't looked at your code yet. I wrote a few accumulators, in my days, but I don't quite understand what you mean by "combine". One may combine online algorithms, e.g. pass the result of a filter to a summation or average accumulator. But I don't quite grasp what you might do by combining a median into another algorithm. Median is not really an online algorithms: It maintains some quantities (e.g. a sample), but only at certain times (periodically, or at the end) the result of the computation (median) is queried. There is no continued output. Or you could imagine feeding the median continuously to another accumulator (average median of stream?) but it stretches the imagination? For sampling, which has been a focus of my research recently, there is of course the reservoir sampling, but a few other special-purpose deterministic sampling algorithms have been devised. On my publication web page, (http://photon.poly.edu/~hbr/publis- themes.html, section 5 - shameless plug, sorry)) I give a few sampling algorithms. I could program them, if I have the time. The so-called Biased-L2 and DRS are much better at preserving correlations / association rules for transactional data sets. I also meant to note in passing that it is well known in database community that sampling is bad for estimating sizes of join, and other quantities dependent on correlation. There are sophisticated ways (not too hard to program, though) to estimate them: the celebrated Alon, Gibbons, Mattias, and Szegedy trick of projecting onto a number of random vectors (if you're interested / inclined: The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci. 58(1): 137-147 (1999), portal.acm.org/citation.cfm? id=237823). They also fall within the area of data streams I mentioned earlier, and it is an extremely powerful method for keeping track of those quantities with a small memory footprint over a very large stream. Not sure if everyone is interested in these kinds of things, I realize, but I hope I piqued the interest of a few. Cheers, -- Hervé Brönnimann hervebronnimann@mac.com On Jan 31, 2007, at 4:54 PM, Matthias Troyer wrote:

...

There are two types of accumulators that I can think of immediately:

1. approximate median and quantile estimators such as the p^2 methods cannot be combined but are very important for some applications

2. accumulators for correlates samples, e.g. autocorrelation estimators, or jackknife-bins cannot easily be combined. Instead of just merging the results one has to store the results from the individual accumulator_sets separately and combine them only in the final evaluations. Please keep in mind that correlated samples are very common.

For 1. there is simply no way, and that is just it - one might either have to make this cause a compile-time error or drop the median and quantile estimators when combining. For 2. the problem is that combined accumulator sets need a different data structure than individual ones.

Matthias

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/ listinfo.cgi/boost

Hans Meine

12:30 p.m.

New subject: [boost] [Review] Review of the Accumulators library begins today Jan 29

Hi! On Thursday, 01. February 2007 02:39, Hervé Brönnimann wrote:

...

Matthias: Forgive me, I haven't looked at your code yet. I wrote a few accumulators, in my days, but I don't quite understand what you mean by "combine".

If you look back in the thread, "combine" here means that you have two accumulators of the same type containing partial results for two sets of samples and want to combine the accumulators to get the result for the combined sample set. -- Ciao, / / /--/ / / ANS

Hans Meine

12:36 p.m.

New subject: [Review] Review of the Accumulators library begins today Jan 29

On Wednesday, 31. January 2007 22:54, Matthias Troyer wrote:

...

...
Which accumulators would then have a non-trivial "combine" method?

There are two types of accumulators that I can think of immediately:

Thanks for your answer, Matthias; I already suspected that you had some accumulators in mind which I don't know, thanks for giving examples. (I will have to look them up somewhere.)

...

1. approximate median and quantile estimators such as the p^2 methods cannot be combined but are very important for some applications

2. accumulators for correlates samples, e.g. autocorrelation estimators, or jackknife-bins cannot easily be combined. Instead of just merging the results one has to store the results from the individual accumulator_sets separately and combine them only in the final evaluations. Please keep in mind that correlated samples are very common.

For 1. there is simply no way, and that is just it - one might either have to make this cause a compile-time error or drop the median and quantile estimators when combining.

I'd say that a compile-time error would be perfectly OK. It's just that having a method for combining accumulators would open your library up to many more applications. (E.g., in my case I would be interested in using it for image analysis, where I am using sort of accumulators for measuring properties of image regions, which are eventually merged.)

...

For 2. the problem is that combined accumulator sets need a different data structure than individual ones.

I would be very happy if you found a way to integrate a simple way for combining soon, and postpone a solution of that latter problem. Right now, I don't think it would be impractical to require the user of your library to specify in advance that additional data structures shall be maintained in case he/she wants to combine accumulators later. Ciao, / / /--/ / / ANS

Matthias Troyer

8 Feb 8 Feb

5:53 p.m.

New subject: [Review] Review of the Accumulators library begins today Jan 29

On 1 Feb 2007, at 04:36, Hans Meine wrote:

...

On Wednesday, 31. January 2007 22:54, Matthias Troyer wrote:

...
1. approximate median and quantile estimators such as the p^2 methods cannot be combined but are very important for some applications

2. accumulators for correlates samples, e.g. autocorrelation estimators, or jackknife-bins cannot easily be combined. Instead of just merging the results one has to store the results from the individual accumulator_sets separately and combine them only in the final evaluations. Please keep in mind that correlated samples are very common.

For 1. there is simply no way, and that is just it - one might either have to make this cause a compile-time error or drop the median and quantile estimators when combining.

I'd say that a compile-time error would be perfectly OK. It's just that having a method for combining accumulators would open your library up to many more applications. (E.g., in my case I would be interested in using it for image analysis, where I am using sort of accumulators for measuring properties of image regions, which are eventually merged.)

...
For 2. the problem is that combined accumulator sets need a different data structure than individual ones.

I would be very happy if you found a way to integrate a simple way for combining soon, and postpone a solution of that latter problem.

Right now, I don't think it would be impractical to require the user of your library to specify in advance that additional data structures shall be maintained in case he/she wants to combine accumulators later.

A possible solution would indeed be to have a simple combine method for those accumulators where this is possible, and to work on a general solution later. This would immediately give the combine functionality to a part of the user community, while for the more demanding applications we have to wait. Matthias

Ben FrantzDale

7:15 p.m.

New subject: [Review] Review of the Accumulators library begins today Jan 29

On 2/8/07, Matthias Troyer <troyer@phys.ethz.ch> wrote:

...

... A possible solution would indeed be to have a simple combine method for those accumulators where this is possible, and to work on a general solution later. This would immediately give the combine functionality to a part of the user community, while for the more demanding applications we have to wait.

Matthias

That sounds reasonable. Two things: 1. It might be a good idea to decide now on a concept for Combinable Accumulator so people don't run off and make ad-hoc ones. The syntax acc1(acc2) seems nice and has the most obvious semantics of the options I can think of. On the other hand, acc1 = combine(acc1, acc2); might be more generic. 2. Since some accumulators would have to be implemented differently to be combinable, would it make sense to use the dependency mechanism to request a combinable accumulator? That is, a syntax like accumulator_set<double, stats<tag::mean, tag::moment<2>, tag::combinable>

...

acc; If possible, this would be a nice syntax to request that every accumulator be combinable.

—Ben

Hervé Brönnimann

1 Feb 1 Feb

1:23 a.m.

New subject: [Review] Review of the Accumulators library begins today Jan 29

On Jan 31, 2007, at 10:38 AM, Hans Meine wrote:

...

On Wednesday, 31. January 2007 16:19, Ben FrantzDale wrote:

...
...
Isn't this already a problem with sequences? Suppose you're storing the current median, and a new value comes along. What's the new median - without checking all the previous values? There are statistical estimates for this, but I don't know any exact way other than essentially resorting the data and checking the new median.

Good point. Combining partial results may not be possible for all accumulated statistics or there may be a cost to making a statistic "combinable". Perhaps a refinement of the Accumulator concept would be needed.

The idea of the Accumulator library is to implement online algorithms; is there a corresponding word for online algorithms for which partial results can be combined?

Actually, I don't see any "online algorithms" whose partial results cannot be combined.

IIUC, what James pointed out is exactly that the median is not an online algorithm. AFAICS, the "partial result" (in terms of inner representation) of a median accumulator consists of the (possibly sorted) complete input seen so far. (I would love to be corrected, but I cannot imagine a solution.)

Just to contribute to the discussion: The keyword for this sort of algorithms is Data Streams (where your memory storage must be small, e.g bounded or very slowly growing, no matter how many elements the stream has), and they're now part of any theoretical computer science conference (database conferences also sport a few of these). Theoretical computer science tells you that indeed, it is not mathematically possible to maintain the median of a stream of N values without Omega(N) storage, essentially remembering all the values. BTW, there are faster algorithms for finding the median than sorting, using quickselect algorithm for instance (essentially std::nth_element). It's also possible in linear storage to maintain the median at logarithmic cost per new element. Online algorithms is not a well-defined term in computer science, but online or competivity analysis of algorithms refers to something else: the online competitiveness ratio of an algorithm that processes a stream online is the ratio between its solution, and the solution of the best offline algorithm (which knows all the values in advance). I.e. it measures what you lose by not knowing the future. Examples of data stream algorithm are of course the average (you only need to maintain sum and count), or higher moments (sum of squares, etc.) with all the pitfalls that have been detailed with John's torture tests earlier in this stream. But if you don't insist on the median, you can get approximate medians (of rank within (1/2+/-epsilon)) for a data stream cheaply, using a random sample of size roughly 1/epsilon (a bit larger actually, to take care of the random deviations). There are deterministic algorithm (by Manku, Motwani, or Greenwald and Khanna. Another simple problem is: maintain the count of *unique* items (i.e. eliminating the redundant ones) over a stream of N elements. At first it seems you need to maintain a hash-set, or some representation, of all the distinct elements. But it's possible to do this approximately as well, with very small storage. Another fascinating sort of problems happen when you want to maintain e.g. the average of the last M elements over a stream of N elements, with M fixed, referred to as "sliding window" problems. Or when elements can be cancelled from the stream. A good but somewhat dated survey on data stream algorithms for those who want/have time to delve deeper is given by S. Muthukrishnan: http://athos.rutgers.edu/~muthu/stream-1-1.ps (in Postscript). Much food for thought. Muthu used to be at AT&T as well as Rutgers and now works at Google (hint, hint...) Cheers, -- Hervé Brönnimann hervebronnimann@mac.com

Eric Niebler

31 Jan 31 Jan

6:47 p.m.

New subject: [Review] Review of the Accumulators library begins today Jan 29

james.jones@firstinvestors.com wrote:

...

From: Matthias Troyer <troyer@phys.ethz.ch>

...
These were easy, trickier are robust estimators of variance and moments, even trickier will be median estimators where I currently even do not see how this should be done without storing most of the time series.

Isn't this already a problem with sequences? Suppose you're storing the current median, and a new value comes along. What's the new median - without checking all the previous values? There are statistical estimates for this, but I don't know any exact way other than essentially resorting the data and checking the new median.

To calculate an exact median, you're correct. Of course, nothing stops you from writing an accumulator that calculates the exact median by storing all the samples seen so far. There are approximate median algorithms that don't need to do that, though, and that's what Matthias is referring to. Those might be tricky to combine. -- Eric Niebler Boost Consulting www.boost-consulting.com

Matthias Troyer

9:01 p.m.

New subject: [Review] Review of the Accumulators library begins today Jan 29

On Jan 31, 2007, at 7:47 PM, Eric Niebler wrote:

...

james.jones@firstinvestors.com wrote:

...
From: Matthias Troyer <troyer@phys.ethz.ch>

...
These were easy, trickier are robust estimators of variance and moments, even trickier will be median estimators where I currently even do not see how this should be done without storing most of the time series.

Isn't this already a problem with sequences? Suppose you're storing the current median, and a new value comes along. What's the new median - without checking all the previous values? There are statistical estimates for this, but I don't know any exact way other than essentially resorting the data and checking the new median.

To calculate an exact median, you're correct. Of course, nothing stops you from writing an accumulator that calculates the exact median by storing all the samples seen so far. There are approximate median algorithms that don't need to do that, though, and that's what Matthias is referring to. Those might be tricky to combine.

Correct. These approximate (e.g. the p_square methods) algorithms are impossible to combine. Also many of the algorithms for correlated sequences of data are hard or impossible to combine. Matthias

Martin Knoblauch Revuelta

1 Feb 1 Feb

6:29 p.m.

New subject: [Review] Review of the Accumulators library begins today Jan 29

...

...
From: Matthias Troyer <troyer@phys.ethz.ch>

...
[..] even trickier will be median estimators where I currently even do not see how this should be done without storing most of the time series.

...

james.jones@firstinvestors.com wrote:

...
Isn't this already a problem with sequences? Suppose you're storing the current median, and a new value comes along. What's the new median - without checking all the previous values? There are statistical estimates for this, but I don't know any exact way other than essentially resorting the data and checking the new median.

On 1/31/07, Eric Niebler <eric@boost-consulting.com> wrote:

...

To calculate an exact median, you're correct. Of course, nothing stops you from writing an accumulator that calculates the exact median by storing all the samples seen so far. [..]

As it has been commented around here, you just need to store unique values, and it can be done with O(log N) time per sample, and O(N) extra storage (N is not the number of samples, but the number of unique values). IMHO, the best way to do it is using a tree that is: - balanced - sorted (for binary search) - augmented with counters that are added bottom-up, allowing random access (rank tree or statistic order tree) I send attached a sample program that solves this problem using a Rank List (proposed for inclusion in Boost some months ago): http://boost.cvs.sourceforge.net/*checkout*/boost-sandbox/boost-sandbox/libs... The most interesting part is: ----------------------------------------------------------------------- samples_container A; // Unique values only samples_container::iterator p; // [...] sample = rand (); // Get next sample found = A.binary_search (sample, p); // Search for it in the // container if (found) // Already there? A.npsv_set_width (p, A.npsv_width(p)+1); // count ++ else // Not there? A.insert (p, sample); // count = 1 total = A.npsv_width (); // Go to the middle (50%) of the p = A.npsv_at_pos (total/2); // Non Proportional Sequence median = *p; // View (the median is there) ----------------------------------------------------------------------- An accumulator storing a tree of this kind would be able to calculate not only the median, but any desired percentile in O(log N) time. The drawback is that: - accumulating a sample would take O(log N) too - O(N) extra storage Combining two accumulators of this kind (containing all unique values in a tree) would require O(N+M) time, where N and M are the numbers of unique values stored in the respective accumulators. Though, I guess that the median (or any percentile) might be calculated in O(square(log N)+square(log M)) without actually merging... I'm not sure... The extra storage is a very important drawback. If an algorithm like this one was ever included in the accumulators, it should be optional and disabled by default. Though, storing the values in a Rank List would allow very precise calculations. Note that a sequential sum is prone to cumulative errors, since the first samples pass through O(N) operations. In a Rank List every value would pass through about O(log N) operations. Regards, Martin

Hervé Brönnimann

2 Feb 2 Feb

1:34 a.m.

New subject: [Review] Review of the Accumulators library begins today Jan 29

Martin: On Feb 1, 2007, at 1:29 PM, Martin Knoblauch Revuelta wrote:

...

IMHO, the best way to do it is using a tree that is: - balanced - sorted (for binary search) - augmented with counters that are added bottom-up, allowing random access (rank tree or statistic order tree)

I send attached a sample program that solves this problem using a Rank List (proposed for inclusion in Boost some months ago): http://boost.cvs.sourceforge.net/*checkout*/boost-sandbox/boost- sandbox/libs/rank_list/doc/index.html

If the purpose is calculating the median, forgive me for being the academic professor. But that is an overkill. You should maintain two binary heaps of equal size (plus or minus one) with the lower half being a max heap, and the upper half being a min-heap. Upon adding an element, you insert it to the appropriate heap, and if one heap grows larger than the other by more than one, you can rebalance the two heaps by transferring the min of the second heap into the first heap, or the max of the first heap into the second. The runtime is still O(log N) per element and the storage O(N) but all storage is now sequential, using two std::vector, all the algorithms are provided by std:: (you can even use an std::priority_queue out of the box), and being much more compact, the constants will be much lower. Note that there are also min-max heaps that work in O(log N) time per insertion or extract-min or extract-max, and I should really provide one in my boost.Minmax library. They also work by using an std::vector storage and using an implicit binary tree. That enables to have exact quantiles in O(t+log N) time per insertion. Maintaining t quantiles means t separate vectors, so the overhead would be up to double the storage (assuming doubling rule for expanding vectors) and you'd have to compare that to a r/b tree node. For elements such as integers, double still compares favorably to binary tree nodes which has an overhead of three pointers and a rank per integer.

...

An accumulator storing a tree of this kind would be able to calculate not only the median, but any desired percentile in O(log N) time. The drawback is that: - accumulating a sample would take O(log N) too - O(N) extra storage

Agreed. If you don't want just the median, but have the rank selection operation for any rank, a tree would be also my way to go. However, in that case, I wouldn't use the accumulator library to start with. Even for applying this to a tail of a distribution, I humbly submit a sorted vector would be better (assuming tail is small, i.e. constant size) due to compactness in the code, locality of reference, etc. Then there is the issue of distinct numbers, you'd have to decide for your application if N is much smaller than the total number of multiplicities...

...

Combining two accumulators of this kind (containing all unique values in a tree) would require O(N+M) time, where N and M are the numbers of unique values stored in the respective accumulators. Though, I guess that the median (or any percentile) might be calculated in O(square(log N)+square(log M)) without actually merging... I'm not sure...

You're correct. Given two sorted ranges (which your trees simulate with an overhead for random access to log(N) or log(M)), by applying binary search you can find the median of the union in an extra multiplicative logarithmic factor.

...

The extra storage is a very important drawback. If an algorithm like this one was ever included in the accumulators, it should be optional and disabled by default.

I agree with this. Once you provide it, people will use it. Better make sure they darn know what they're getting into...

...

Though, storing the values in a Rank List would allow very precise calculations. Note that a sequential sum is prone to cumulative errors, since the first samples pass through O(N) operations. In a Rank List every value would pass through about O(log N) operations. Regards, Martin

On a side note, maintaining the elements in order allows to recompute the sum (from smaller to larger in absolute values) with the same precision as the underlying floating point, i.e. 2^-52 for IEEE doubles. If precision is a real issue, that can be an incentive. On the other hand, if N is small, I'd much rather maintain the sum in exact representation (using for instance the fact that IEEE doubles can be added exactly using a trick due to Decker, using seven IEEE additions and maintaining the result as two non-overlapping doubles). Cheers, -- Hervé Brönnimann hervebronnimann@mac.com

Martin Knoblauch Revuelta

2:37 p.m.

New subject: [Review] Review of the Accumulators library begins today Jan 29

...

On Feb 1, 2007, at 1:29 PM, Martin Knoblauch Revuelta wrote:

...
IMHO, the best way to do it is using a tree [..]

On 2/2/07, Hervé Brönnimann <hervebronnimann@mac.com> wrote:

...

If the purpose is calculating the median, forgive me for being the academic professor.

Oh, please, be.

...

But that is an overkill. You should maintain two binary heaps of equal size (plus or minus one) with the lower half being a max heap, and the upper half being a min-heap. [..] being much more compact, the constants will be much lower.

You are right.

...

Note that there are also min-max heaps that work in O(log N) time per insertion or extract-min or extract-max, [..] That enables to have exact quantiles in O(t+log N) time per insertion. Maintaining t quantiles [..]

Sounds really good.

...

[..] the overhead would be up to double the storage (assuming doubling rule for expanding vectors)

Does it depend on t (the number of quantiles)?

...

and you'd have to compare that to a r/b tree node. For elements such as integers, double still compares favorably to binary tree nodes which has an overhead of three pointers and a rank per integer.

Oh yes, and in my solution you'd have to add five fields more (two pointers and three integers).

...

...
[..] not only the median, but any desired percentile in O(log N) time[..]

...

Agreed. If you don't want just the median, but have the rank selection operation for any rank, a tree would be also my way to go.

That's why I posted my message. It can be slightly off topic, since it requires O(N) storage, but it was mentioned several times and I couldn't resist the temptation... :)

...

However, in that case, I wouldn't use the accumulator library to start with.

Yes, storing samples (or just unique values) sounds to me contrary to what accumulators are intended to be.

...

[..] Then there is the issue of distinct numbers, you'd have to decide for your application if N is much smaller than the total number of multiplicities...

Of course, this depends on the number of possible values, the nature of the distribution... By the way, you could do the same thing with the heaps (or with the sorted vector). I mean, storing just unique values, together with counters.

...

...
[..] precise calculations. [..]

...

On a side note, maintaining the elements in order allows to recompute the sum (from smaller to larger in absolute values) with the same precision as the underlying floating point, i.e. 2^-52 for IEEE doubles.

Good point. Though, "recompute the sum" takes O(N) time. If the sum is required after every insertion...

...

If precision is a real issue, that can be an incentive. On the other hand, if N is small, I'd much rather maintain the sum in exact representation (using for instance the fact that IEEE doubles can be added exactly using a trick due to Decker, using seven IEEE additions and maintaining the result as two non-overlapping doubles).

Sounds interesting. I'll try to read something about it. Thanks for your detailed and illustrating response. Regards, Martin

Matt Hurd

1 Feb 1 Feb

6:55 a.m.

New subject: [Review] Review of the Accumulators library begins today Jan 29

...

On 01/02/07, james.jones@firstinvestors.com <james.jones@firstinvestors.com> wrote: From: Matthias Troyer <troyer@phys.ethz.ch>

<snip>

...

Isn't this already a problem with sequences? Suppose you're storing the current median, and a new value comes along. What's the new median - without checking all the previous values? There are statistical estimates for this, but I don't know any exact way other than essentially resorting the data and checking the new median.

You can get improvement to an on-line median by have storing some of the central state. If the distribution is kind and you get lucky you can avoid a lot of checking of previous values giving nice improvements. Similar stuff works for min/max, quartiles and pecentiles. For windowed esimates you need to account for values falling off and those coming in but the same works. Done wisely you can get an order of magnitude improvement for appropriate data. $AUD 0.0002 Matt.

6740

Age (days ago)

6748

Last active (days ago)

List overview

Download

15 comments

8 participants

participants (8)

Ben FrantzDale
Eric Niebler
Hans Meine
Hervé Brönnimann
james.jones＠firstinvestors.com
Martin Knoblauch Revuelta
Matt Hurd
Matthias Troyer