[Autosave] Re: [math][accumulators] Empirical distribution function

Hi, On Sun, June 19, 2011 22:36, er wrote:
Hope this can serve as a basis for a conversation:
I'm assisting Eric with the maintenance of Accumulators. I've had a look through the code at the above link, and would like to offer the following comments (if I have misunderstood anything, please let me know). My basic concern with the code is that a map is used to store the counts of data-points that have been added (the map keys are the data-points, the map values are the counts). In real-world floating point data it is rare for two data-points to be exactly the same, so in practice the map would have a single key-value pair for each data-point q_i, of the form (key=q_i,value=1). This is inefficient, because all the key values will be 1. Also, the memory usage will grow linearly with the number of data-points accumulated, which doesn't seem to be in keeping with the spirit of the Accumulators library. For these reasons, I'm not convinced that the code should be added to the library in its current state. Simon.

On 8/8/11 7:06 PM, Simon West wrote:
Hi,
On Sun, June 19, 2011 22:36, er wrote:
Hope this can serve as a basis for a conversation:
I'm assisting Eric with the maintenance of Accumulators. I've had a look through the code at the above link, and would like to offer the following comments (if I have misunderstood anything, please let me know).
My basic concern with the code is that a map is used to store the counts of data-points that have been added (the map keys are the data-points, the map values are the counts). In real-world floating point data it is rare for two data-points to be exactly the same, so in practice the map would have a single key-value pair for each data-point q_i, of the form (key=q_i,value=1). This is inefficient, because all the key values will be 1. Also, the memory usage will grow linearly with the number of data-points accumulated, which doesn't seem to be in keeping with the spirit of the Accumulators library.
For these reasons, I'm not convinced that the code should be added to the library in its current state. Thanks, but it was just, I quote, a "basis for a conversation" at the request of a user (Denis Arnaud), but it failed to go anywhere at the time. Yes, I realize this is not the spirit of Accumulators, which is to iteratively compute statistics, such that memory usage is fixed given
Thanks for following up. the number of features. This would therefore only be suitable for a distribution whose domain is finite. I had some thought of merging this idea with another that I gave a shot at a while back (look for chi-square table in boost.users), but which in hindsight I'd do a bit differently. So for now, not much to add, and thanks.
Simon.

Hi all, just one comment: My basic concern with the code is that a map is used to store the counts of
data-points that have been added (the map keys are the data-points, the map values are the counts). In real-world floating point data it is rare for two data-points to be exactly the same, so in practice the map would have a single key-value pair for each data-point q_i, of the form (key=q_i,value=1). This is inefficient, because all the key values will be 1. Also, the memory usage will grow linearly with the number of data-points accumulated, which doesn't seem to be in keeping with the spirit of the Accumulators library.
The application looks like an analysis of data stored in a histogram. In this case it is possible to use a map with keys of floating types. Map provides interface parameter of predicate. This type parameter can be used to provide equivalence of values of floating types through a tolerance method specific to a problem domain. In theory, another possible option for such applications is to use multimap. Regards, Vadim Stadnik
participants (3)
-
er
-
Simon West
-
Vadim Stadnik