
Matthias: Forgive me, I haven't looked at your code yet. I wrote a few accumulators, in my days, but I don't quite understand what you mean by "combine". One may combine online algorithms, e.g. pass the result of a filter to a summation or average accumulator. But I don't quite grasp what you might do by combining a median into another algorithm. Median is not really an online algorithms: It maintains some quantities (e.g. a sample), but only at certain times (periodically, or at the end) the result of the computation (median) is queried. There is no continued output. Or you could imagine feeding the median continuously to another accumulator (average median of stream?) but it stretches the imagination? For sampling, which has been a focus of my research recently, there is of course the reservoir sampling, but a few other special-purpose deterministic sampling algorithms have been devised. On my publication web page, (http://photon.poly.edu/~hbr/publis- themes.html, section 5 - shameless plug, sorry)) I give a few sampling algorithms. I could program them, if I have the time. The so-called Biased-L2 and DRS are much better at preserving correlations / association rules for transactional data sets. I also meant to note in passing that it is well known in database community that sampling is bad for estimating sizes of join, and other quantities dependent on correlation. There are sophisticated ways (not too hard to program, though) to estimate them: the celebrated Alon, Gibbons, Mattias, and Szegedy trick of projecting onto a number of random vectors (if you're interested / inclined: The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci. 58(1): 137-147 (1999), portal.acm.org/citation.cfm? id=237823). They also fall within the area of data streams I mentioned earlier, and it is an extremely powerful method for keeping track of those quantities with a small memory footprint over a very large stream. Not sure if everyone is interested in these kinds of things, I realize, but I hope I piqued the interest of a few. Cheers, -- Hervé Brönnimann hervebronnimann@mac.com On Jan 31, 2007, at 4:54 PM, Matthias Troyer wrote:
There are two types of accumulators that I can think of immediately:
1. approximate median and quantile estimators such as the p^2 methods cannot be combined but are very important for some applications
2. accumulators for correlates samples, e.g. autocorrelation estimators, or jackknife-bins cannot easily be combined. Instead of just merging the results one has to store the results from the individual accumulator_sets separately and combine them only in the final evaluations. Please keep in mind that correlated samples are very common.
For 1. there is simply no way, and that is just it - one might either have to make this cause a compile-time error or drop the median and quantile estimators when combining. For 2. the problem is that combined accumulator sets need a different data structure than individual ones.
Matthias
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/ listinfo.cgi/boost