Re: [boost] [time_series] Review

9 Aug 2007


      On 8/8/07, Eric Niebler <eric@boost-consulting.com> wrote:
...
...
I'm not really sure why dense_series<> does not allow floating point
offsets.  I understand that it is supposed to resemble std::vector.
However, there is implicitly a relationship between the indices of the
underlying vector and the times represented by the indices.  It is
possible (and quite convenient) for dense_series<> to hold a
discretization_type that represents the start offset, and an interval,
and perform the mapping for me.  The lack of this mapping means that
for dense time series of arbitrary discretization (say 0.5), I have to
multiply all my times by 0.5 in user code when using the time series.
I feel the library should take care of this for me; the fact that the
underlying storage is vector-like should not force me to treat
dense_series<> discretizations differently from all other series'
discretizations.
How does it force you to treat discretizations differently? Whether your
discretization is 1 or 0.5 or whether your underlying storage is dense
or sparse, it doesn't affect how you index into the series, does it? I'm
afraid I've missed your point.
I wrote "discretization" when I meant to say "offset".  Perhaps it's
even that I'm merely confused, but this is actually making my point --
I see a need for clarifying the relationships among disctretization,
offset, and run.  My whole point was that (as I understand it) in
order to represent an offset of 3.14 I need to keep an extrinsic value
somewhere that tells me how to convert between 3.14 and the integer
offset used in dense_series<>' runs.  Is that accurate?  If so, isn't
this at odds with the other series types, which let me specify double
values for offsets directly?
...
...
Nonetheless, it would be best if it were possible to
specify that a sample exists at offset X, where X is double, int, or
units::seconds, without worrying about any other details, including
discretization.  That is, discretization seems useful to me only for
regularly-spaced time series, and seems like noise for
arbitrarily-spaced time series.
Discretizations are useful for coarse- and fine-graining operations that
resample the data at different intervals. This can be useful even for
time series that are initially arbitrarily-spaced.
Sometimes you don't care to resampmle your data at a different
discretization, or call the integrate() algorithm. In those cases, the
discretization parameter can be completely ignored. It does tend to
clutter up the docs, but no more than, say, the allocator parameter
clutters up std::vector's docs.
Is discretization then properly a property of the series itself?  If
the offsets of each sample are not related to the discretization, why
have both in the same container?  I find this very confusing.  To
accomodate the algorithms you mention above, would it be possible to
simply say that I want to resample using a scale factor instead?  What
I'm getting at here is that discretization and offset seem to have a
very muddy relationship.  Doing everything in terms of offset seems
clearer to me, and I don't yet see how this simplification loses
anything useful.
...
...
In addition, a sample should be
representable as a point like 3.14 or a run like [3.14, 4.2).
A zero-width point, like [3.14, 3.14)? What that would mean in the
context of the time_series library is admittedly still an outstanding
design issue.
Fair enough.
...
...
The rest of the algorithm detailed docs have concept requirements, but
it would be much easier to use them if the concepts were links to the
relevant concept docs; as it is now, I have to do some bit of
searching to find each one listed.  This applies generally to all
references to concepts throughout the docs -- even in the concepts
docs, I find names of concepts that I must then look up by going back
to the TOC, since they are not links.
Yeah, it's a limitation of our BoostBook took chain. Doxygen actually
emits this documentation with cross-links, but out doxygen2boostbook XSL
transform actually ignores them. Very frustrating.
That's too bad.
...
...
* What is your evaluation of the potential usefulness of the library?
I think it is potentially quite useful.  However, I think its
usefulness is not primarily as a financial time series library, but as
I mentioned earlier, its current docs make it sound as if it is mainly
only useful for that.  In addition, I am forced to ask how a time
series library is more useful for signal processing than a std::vector
and an extrinsic discretization value.
It's for when you want to many options for the in-memory representation
of a series, and efficient and reusable algorithms that work equally
well on all those different representations.
This is very true, and that's what I was alluding to below, if a bit unclearly.
...
...
The answer I came up with is
that Boost.TimeSeries is really only advantageous when you have
arbitrary spacing between elements, or when you want to use two
representations of time series in an algorithm.  That is, using
Boost.TimeSeries' two-series for_each() is almost certainly better
than a custom -- and probably complicated -- loop everywhere I need to
operate on two time series.  However, these cases are relatively rare
in signal processing; it is much more common to simply loop over all
the samples and do some operation on each element.  This can be
accomplished just as well with std::for_each or std::transform.
If std::vector and std::for_each meet your needs, then yes I agree
Time_series is overkill for you. That's not the case for everyone.
...
The
question then becomes, "Does using Boost.TimeSeries introduce
clarifying abstractions, or conceptual noise?".  The concensus among
my colleagues is that the latter is the case.
Sorry you feel that way.
I think this feeling would change rapidly if there were more features
directly applicable to signal processing, as mentioned below.
...
...
Some specific signal-processing usability concerns:
- For many signal processing tasks, the time series used is too large
to fit in memory.  The solution is usually to use a circular buffer or
similar structure to keep around just the part you need at the moment.
 The Boost.TimeSeries series types seem unable to accommodate this
mode of operation.
Not "unable to accommodate" -- making a circular buffer model the time
series concept would be fairly straightforward, and then all the
existing algorithms would work for it. But no, there is no such type in
the library at present.
I'm glad to hear that this would be straightforward to do, and I think
it's a must-have for signal processing folks.
...
...
- It might be instructive to both the Boost.TimeSeries developers and
some of its potential users if certain common signal-processing
algorithms were implemented with the library, even if just in the
documentation. For example, how might one implement a sliding-window
normalizer over densely populated, millisecond resolution data? What
if this normalization used more than two time series to do it's work?
It may well be possible with the current framework, but a) it's not
really clear how to do it based on the documentation and b) the
documenation almost seems to have a bias against that kind of
processing.
I wonder why you say that. The library provides a 2-series transform()
algorithm that is for just this purpose.
That's why I asked about "more than two time series".  Such
convolutions of multiple time series can be done in one pass, and
Boost.TimeSeries does this admirably for N=2, but rewriting
transform() for N>2 is a lot for most users to bite off.
...
As for the rolling window calculations, I have code that does that, and
sent it around on this list just a few weeks ago. I hope to add the
rolling average algorithm soon. It uses a circular buffer, and would
make a good example for the docs.
I agree.  This would be a great addition to the docs.
...
...
As it stands, no.  If there were clearly-defined relationships between
samples and their extents and offsets; better support for large and/or
piecewise-mutable time series; a rolling-window algorithm; and better
customizability of coarse_grain() and fine_grain(), I would probably
change my vote.
I'm still not clear on what you mean by "clearly-defined relationships
between samples and their extents and offsets." The rest is all fair.
Rolling-window is already implemented, but not yet included.
I was alluding to my issue with the relationships among
discretization, offset, and run that I mentioned earlier.

Zach Laine