Re: [boost] [time_series] Review

10 Aug 2007

      Zach Laine wrote:
...
You misunderstand me.  I'm all for DiscretizationType (the template
parameter -- yay typesafety), and I'm all for discretization (the
value and associated fuctions) for dense series.  What I'm against is
using discretization (the value) for non-dense series.  (Note that we
have run into the Discretization/discretization naming ambiguity again
here.)  I find it a confusing value to keep around, especially since
it can be simply ignored, as you pointed out in a previous email.  A
data value that you can access and that is put forward as a
first-class element of the design -- that is also ignored -- suggests
a better design is possible.  Here's what I suggest:
The Discretization template parameter becomes OffsetType, which is IMO
a more accurate name, and follows the RangeRun concepts.  I will be
using that name below.
The "OffsetType discretization" ctor parameter, the "OffsetType
discretization()" accessors, and "void discretization(OffsetType d)"
mutators should only be applied to the dense_series<> type.
The user should be able to access any data in any type exclusively by
using offsets.  This means that dense_series<> seamlessly handles the
mapping between (possibly floating point) offset and sample index I
requested initially.
This has these advantages:
The notion of discretization is not introduced into types for which it
has questionable meaning, e.g. piecewise_constant_series<>.
Offsets can be used exclusively, both on input (as before) and output
(as before, but now including dense_series<> with floating point
offsets).
Floating point and integral offsets are now treated much more uniformly.
Is this reasonable?  Have I perhaps missed something fundamental?
Reasonable, but problematic for reasons of unclear semantics and sub-par 
performance. A similar design was considered early on.

Semantically, it makes it unclear what indexing into a dense series 
means. Consider:

   dense_series<int> d( discretization = 5 );

In your design as I understand, d[0], d[1], d[2], d[3] and d[4] all 
refer to the same element in the dense series. To me, this interface 
changes the meaning of a dense series. It is no longer a dense sampling 
of points at regular intervals. Now it looks piecewise constant.

To illustrate the sorts of confusion this can cause, what happens now if 
I use set_at() to try to change the value at offset 2? It's nonsensical.

As another example, say I have a dense series with discretization of 5ms 
and samples at 0ms, 5ms, and 10ms. Now I also have a sparse series (no 
discretization) with samples at 0, 5 and 10. Now I add the two. What 
does that mean? Does the dense series have a value at offset 3? If I 
understand your design, yes. Does the sparse series? No. Will the result 
of adding the two? It seems the answer is yes, but it should be no.

Discretization for the other series types is not useless. For sparse, 
for example, it enforces that samples may only exist at offsets that are 
a multiple of the discretization.

There are performance implications too. Indexing into a dense series is 
currently not much more expensive than indexing into an array. With this 
change, every access to the dense series would incur an integer 
division. And traversing a dense series would incur an integer 
multiplication at each iteration.

But the real performance problems creep in when a dense series has a 
discretization, but other series do not. In the current design, 
algorithms check the discretization of two series once for 
compatibility, and then it can be safely ignored for the rest of the 
computation. Traversing two series in parallel involves conditionally 
bumping one or both cursors depending on if and how the current runs 
overlap. Dense and sparse both currently have "indivisible" runs. If 
they start at the same offset, I know they overlap perfectly and can 
bump both cursors. If I don't know that both series are using the same 
discretization, I can't do this, because the notion of indivisibility is 
intricately tied to discretization. What constitutes an indivisible run 
for the dense series [0ms,5ms) is different than an indivisible run for 
sparse series [0ms,1ms).

The only way to fix these problems would be to add back a discretization 
to all the series types. Then the indexing thing (whether it indices are 
dense and logically multiplied by the discretization, or sparse and 
actually multiplied) becomes a matter of convention, convenience and 
efficiency. And on those counts I think the current design is a winner.

-- 
Eric Niebler
Boost Consulting
www.boost-consulting.com

The Astoria Seminar ==> http://www.astoriaseminar.com