
Martin: On Feb 1, 2007, at 1:29 PM, Martin Knoblauch Revuelta wrote:
IMHO, the best way to do it is using a tree that is: - balanced - sorted (for binary search) - augmented with counters that are added bottom-up, allowing random access (rank tree or statistic order tree)
I send attached a sample program that solves this problem using a Rank List (proposed for inclusion in Boost some months ago): http://boost.cvs.sourceforge.net/*checkout*/boost-sandbox/boost- sandbox/libs/rank_list/doc/index.html
If the purpose is calculating the median, forgive me for being the academic professor. But that is an overkill. You should maintain two binary heaps of equal size (plus or minus one) with the lower half being a max heap, and the upper half being a min-heap. Upon adding an element, you insert it to the appropriate heap, and if one heap grows larger than the other by more than one, you can rebalance the two heaps by transferring the min of the second heap into the first heap, or the max of the first heap into the second. The runtime is still O(log N) per element and the storage O(N) but all storage is now sequential, using two std::vector, all the algorithms are provided by std:: (you can even use an std::priority_queue out of the box), and being much more compact, the constants will be much lower. Note that there are also min-max heaps that work in O(log N) time per insertion or extract-min or extract-max, and I should really provide one in my boost.Minmax library. They also work by using an std::vector storage and using an implicit binary tree. That enables to have exact quantiles in O(t+log N) time per insertion. Maintaining t quantiles means t separate vectors, so the overhead would be up to double the storage (assuming doubling rule for expanding vectors) and you'd have to compare that to a r/b tree node. For elements such as integers, double still compares favorably to binary tree nodes which has an overhead of three pointers and a rank per integer.
An accumulator storing a tree of this kind would be able to calculate not only the median, but any desired percentile in O(log N) time. The drawback is that: - accumulating a sample would take O(log N) too - O(N) extra storage
Agreed. If you don't want just the median, but have the rank selection operation for any rank, a tree would be also my way to go. However, in that case, I wouldn't use the accumulator library to start with. Even for applying this to a tail of a distribution, I humbly submit a sorted vector would be better (assuming tail is small, i.e. constant size) due to compactness in the code, locality of reference, etc. Then there is the issue of distinct numbers, you'd have to decide for your application if N is much smaller than the total number of multiplicities...
Combining two accumulators of this kind (containing all unique values in a tree) would require O(N+M) time, where N and M are the numbers of unique values stored in the respective accumulators. Though, I guess that the median (or any percentile) might be calculated in O(square(log N)+square(log M)) without actually merging... I'm not sure...
You're correct. Given two sorted ranges (which your trees simulate with an overhead for random access to log(N) or log(M)), by applying binary search you can find the median of the union in an extra multiplicative logarithmic factor.
The extra storage is a very important drawback. If an algorithm like this one was ever included in the accumulators, it should be optional and disabled by default.
I agree with this. Once you provide it, people will use it. Better make sure they darn know what they're getting into...
Though, storing the values in a Rank List would allow very precise calculations. Note that a sequential sum is prone to cumulative errors, since the first samples pass through O(N) operations. In a Rank List every value would pass through about O(log N) operations. Regards, Martin
On a side note, maintaining the elements in order allows to recompute the sum (from smaller to larger in absolute values) with the same precision as the underlying floating point, i.e. 2^-52 for IEEE doubles. If precision is a real issue, that can be an incentive. On the other hand, if N is small, I'd much rather maintain the sum in exact representation (using for instance the fact that IEEE doubles can be added exactly using a trick due to Decker, using seven IEEE additions and maintaining the result as two non-overlapping doubles). Cheers, -- Hervé Brönnimann hervebronnimann@mac.com