Re: [boost] [Review] Review of the Accumulators library begins today Jan 29

1 Feb 2007

      Matthias:  Forgive me, I haven't looked at your code yet.  I wrote a  
few accumulators, in my days, but I don't quite understand what you  
mean by "combine".  One may combine online algorithms, e.g. pass the  
result of a filter to a summation or average accumulator.  But I  
don't quite grasp what you might do by combining a median into  
another algorithm.  Median is not really an online algorithms:  It  
maintains some quantities (e.g. a sample), but only at certain times  
(periodically, or at the end) the result of the computation (median)  
is queried.   There is no continued output.  Or you could imagine  
feeding the median continuously to another accumulator (average  
median of stream?) but it stretches the imagination?

For sampling, which has been a focus of my research recently, there  
is of course the reservoir sampling, but a few other special-purpose  
deterministic sampling algorithms have been devised.  On my  
publication web page, (http://photon.poly.edu/~hbr/publis- 
themes.html, section 5 - shameless plug, sorry))  I give a few  
sampling algorithms.  I could program them, if I have the time.  The  
so-called Biased-L2 and DRS are much better at preserving  
correlations / association rules for transactional data sets.

I also meant to note in passing that it is well known in database  
community that sampling is bad for estimating sizes of join, and  
other quantities dependent on correlation.  There are sophisticated  
ways (not too hard to program, though) to estimate them:  the  
celebrated Alon, Gibbons, Mattias, and Szegedy trick of projecting  
onto a number of random vectors (if you're interested / inclined:   
The Space Complexity of Approximating the Frequency Moments. J.  
Comput. Syst. Sci. 58(1): 137-147 (1999), portal.acm.org/citation.cfm? 
id=237823).  They also fall within the area of data streams I  
mentioned earlier, and it is an extremely powerful method for keeping  
track of those quantities with a small memory footprint over a very  
large stream.

Not sure if everyone is interested in these kinds of things, I  
realize, but I hope I piqued the interest of a few.
Cheers,
--
Hervé Brönnimann
hervebronnimann@mac.com

On Jan 31, 2007, at 4:54 PM, Matthias Troyer wrote:
...
There are two types of accumulators that I can think of immediately:
1. approximate median and quantile estimators such as the p^2 methods
cannot be combined but are very important for some applications
2. accumulators for correlates samples, e.g. autocorrelation
estimators, or jackknife-bins cannot easily be combined. Instead of
just merging the results one has to store the results from the
individual accumulator_sets separately and combine them only in the
final evaluations. Please keep in mind that correlated samples are
very common.
For 1. there is simply no way, and that is just it - one might either
have to make this cause a compile-time error or drop the median and
quantile estimators when combining. For 2. the problem is that
combined accumulator sets need a different data structure than
individual ones.
Matthias
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/ 
listinfo.cgi/boost