[boost] [time_series] Review

8 Aug 2007

      I want to say at the outset that I work with a lot of people that do
signal processing, and so that is the main use case my colleagues and
I would have for using Boost.TimeSeries.  That has definitely colored
my view of Boost.TimeSeries, for what that's worth.

   * What is your evaluation of the design?

Why is a series with only one nonzero element called a delta series?
I would expect a delta series to be the elementwise differences
between two series or some such.  In fact, a little Googling revealed
many references to delta series that were the differences of two
series, but none like delta_series<>.  Is this just my ignorance of
the financial uses of time series?

I'd like to see an overload of the coarse_grain() function allowing
the specification of a DownSampler.  I have a need for coarse
resampling that preserves the largest value in the original set of
finer samples, and I'm sure many others would like to use their
favorite convolution-of-oversamples technique to move from high to low
sample frequency.  In fact, the ability to specify the method used
when moving up or down in sample frequency should be a first-class
feature of the coarse_grain() and fine_grain() functions.  The default
behavior is a gross simplification for many users' purposes.  Please
see my note elsewhere on the lack of UpSampler concept docs, making
writing a custom one more difficult than it needs to be.

The TimeSeries concept docs state that range_run_storage::pre_value( s
) returns "The value associated with the pre_run, if there is one."
Does it return TimeSeries< S >::zero_type() otherwise?  Is there any
way for the user to tell the difference between a return value that
happens to be equal to zero_type() and one that was never set?  These
same questions apply to post_value.

I'm not really sure why dense_series<> does not allow floating point
offsets.  I understand that it is supposed to resemble std::vector.
However, there is implicitly a relationship between the indices of the
underlying vector and the times represented by the indices.  It is
possible (and quite convenient) for dense_series<> to hold a
discretization_type that represents the start offset, and an interval,
and perform the mapping for me.  The lack of this mapping means that
for dense time series of arbitrary discretization (say 0.5), I have to
multiply all my times by 0.5 in user code when using the time series.
I feel the library should take care of this for me; the fact that the
underlying storage is vector-like should not force me to treat
dense_series<> discretizations differently from all other series'
discretizations.

I find it a little confusing that discretization is both a template
parameter and a ctor parameter.  It would perhaps be a bit more clear
if template-parameter Discretization were DiscretizationType or
something.  In general, I think that this library would be far more
usable if there were more clarification of the relationships among
discretization, offsets, and runs/intervals.  What I'm getting at here
is that the current structure makes it really unclear what
DiscretizationType=double, discretization=2.1, offset=3.2 means for a
run.  Where is the run?  At 3.2, or at 2.1 * 3.2?  The fact that I
have to think about this is probably simply a failure of
documentation.  Nonetheless, it would be best if it were possible to
specify that a sample exists at offset X, where X is double, int, or
units::seconds, without worrying about any other details, including
discretization.  That is, discretization seems useful to me only for
regularly-spaced time series, and seems like noise for
arbitrarily-spaced time series.  In addition, a sample should be
representable as a point like 3.14 or a run like [3.14, 4.2).

   * What is your evaluation of the implementation?

I didn't look too closely at the implementation, due to lack of time,
but I would like to know why the functors in boost::range_run_storage
are const references to singletons in an anonymous namespace.  What is
that supposed to accomplish that other techniques do not?  One of the
great things about boost is learning new advanced techniques and
discovering best practices for sticky coding situations.  Without the
why, this technique is lost on me, and probably others.  Note that I'm
not raising any issue with this implementation choice -- I'm just
looking to understand why this was done.

   * What is your evaluation of the documentation?

The docs could use a gentler intro.  What I mean by this is that the
docs appear to be somewhat inverted.  Moving from foundations to
things that build upon those foundations, and from the general to the
specific would be better.  For example, delta_series<> has the
following description on the first page of the user manual: "A series
which has exactly one run of unit length."  I have no precise idea at
this point what a run is, and moreover following the link gives a much
more comprehensible definition in the reference section:
"boost::time_series::delta_series -- A Mutable_TimeSeries that has a
distinct value at some unique offset, and zero elsewhere".  After
reading the Range-Run Abstraction section, I think it would be
appropriate to place that before mentions of runs, specific time
series models, etc.  It was only on the second pass that I fully
understood the material presented up front in Series Containers.

In general, the documentation is skewed towards financial use, but
doesn't need to be -- the library is useful for other purposes as
well.  For instance, when the predefined resolution types are
presented, it seems that these are somehow necessary, or that the lack
of a "seconds" or "milliseconds" resolution typedef might be a
concern.  Further investigation indicates that in fact the
Discretization can be virtually any numeric type (is that correct?).
Placing these at the beginning of the documentation made several of my
signal processing colleagues think this library would only be good for
financial users.  I suggest a more general emphasis, and a specific
subsection that says something like "this was originally written for
financial use, and here are some features that facilitate such use
...".  For example, here is what I wrote on my first pass through the
docs; I realize what the situation is now, but here is how the docs
led me astray a bit:
"Discretization is only done on days on up.  What about seconds,
minutes, hours, milliseconds (the one I need) discretization?  Are
series with different discretizations usable together without explicit
user conversion?  Also, if mpl::int_<360> is used as a conversion
factor for yearly, it's wrong for many users' purposes.  If it's an
arbitrary number, how about using mpl::int_<0>, mpl::int_<1>, etc.,
and adding some other time increments?"

In period_sums() description in the manual section, "summed" is
misspelled "summer".

I wanted to figure out whether I could supply my own sampler functor
to fine_grain(), but it is unclear from the documentation what an
UpSampler template parameter is supposed to look like.  Looking at the
fine_grain() reference is no help, due to lack of concept
documentation.  Following the samplers::sampler_base<> link was no
help either, since it appears not to be documented other than listing
its API.  I finally found out what I need to do to write a sampler
only by looking at the piecewise_upsample implementation.
The coarse_grain() function says that it requires the coarser grain to
be a multiple of the finer grain.  I assume that should read "integer
multiple"; is that right?
After finding the above doc issues, I did a quick survey of the
functions in the manual's Algorithms section.  Most of them appear to
have complete concept docs, but there are some deficiencies:
Partial concept docs:
integrate (missing Series requirements)
Low-to-no concept docs:
coarse_grain (missing all requirements except that the discretizations
need to be integer multiples of one another, but doesn't use the word
integer)
fine_grain (missing all requirements except that the discretizations
need to be integer multiples of one another, but doesn't use the word
integer)
invert_heaviside (no requirements at all)
The rest of the algorithm detailed docs have concept requirements, but
it would be much easier to use them if the concepts were links to the
relevant concept docs; as it is now, I have to do some bit of
searching to find each one listed.  This applies generally to all
references to concepts throughout the docs -- even in the concepts
docs, I find names of concepts that I must then look up by going back
to the TOC, since they are not links.

The fact that "The only series type that does not support
floating-point offsets is dense_series<>" should not just be mentioned
in the manual, but also in the Discretization requirements in the
dense_series<> reference docs.

There appears not to be a rationale section.  Even listing the usage
cases alluded to in the Acknowledgements section may help illuminate
the choices that were made.  Specific things I'd like to see in a
rationale:
- Why was commit() chosen, instead of simpler self-contained insertions?
- Why were ordered_inserters chosen instead of more STL-like insertion
via iterators and member functions?
- Why do ordered_inserters wipe out the previous contents of a time
series, instead of appending/inserting?
- Why can't elements and/or ranges of elements be removed from or
added to an existing time series?
- Why is the TimeSeries concept interface split between
boost::sequence namespace and boost::range_run_storage namespace
accessors?

   * What is your evaluation of the potential usefulness of the library?

I think it is potentially quite useful.  However, I think its
usefulness is not primarily as a financial time series library, but as
I mentioned earlier, its current docs make it sound as if it is mainly
only useful for that.  In addition, I am forced to ask how a time
series library is more useful for signal processing than a std::vector
and an extrinsic discretization value.  The answer I came up with is
that Boost.TimeSeries is really only advantageous when you have
arbitrary spacing between elements, or when you want to use two
representations of time series in an algorithm.  That is, using
Boost.TimeSeries' two-series for_each() is almost certainly better
than a custom -- and probably complicated -- loop everywhere I need to
operate on two time series.  However, these cases are relatively rare
in signal processing; it is much more common to simply loop over all
the samples and do some operation on each element.  This can be
accomplished just as well with std::for_each or std::transform.  The
question then becomes, "Does using Boost.TimeSeries introduce
clarifying abstractions, or conceptual noise?".  The concensus among
my colleagues is that the latter is the case.

Some specific signal-processing usability concerns:
- For many signal processing tasks, the time series used is too large
to fit in memory.  The solution is usually to use a circular buffer or
similar structure to keep around just the part you need at the moment.
 The Boost.TimeSeries series types seem unable to accommodate this
mode of operation.
- Two of the most potentially useful bits of Boost.TimeSeries for
certain kinds of signal processing are the coarse_grain() and
fine_grain().  These do not allow in the case of coarse grain, and
make difficult in the case of fine grain, the use of an arbitrary
functor to do downsampling/upsampling.
- It might be instructive to both the Boost.TimeSeries developers and
some of its potential users if certain common signal-processing
algorithms were implemented with the library, even if just in the
documentation. For example, how might one implement a sliding-window
normalizer over densely populated, millisecond resolution data? What
if this normalization used more than two time series to do it's work?
It may well be possible with the current framework, but a) it's not
really clear how to do it based on the documentation and b) the
documenation almost seems to have a bias against that kind of
processing.

If problems like that were looked at by the developers, it may well
inform their design by helping them finding new generalizations and so
forth. Likewise, if methods for using Boost.TimeSeries to do this kind
of work were present in the documentation, the library would have
immediate appeal to a whole new range of folks in the scientific
computing field.

   * Did you try to use the library?  With what compiler?  Did you
have any problems?

No.

   * How much effort did you put into your evaluation? A glance? A
quick reading? In-depth study?

In-depth reading of the docs, API, and some of the implementation.  I
ran out of time to evaluate much code before the review ended.

   * Are you knowledgeable about the problem domain?

Somewhat.  I work at a place that does a lot of signal processing
using time series, and though I work with these time series from time
to time, it's not my core area.

   * Do you think the library should be accepted as a Boost library?

As it stands, no.  If there were clearly-defined relationships between
samples and their extents and offsets; better support for large and/or
piecewise-mutable time series; a rolling-window algorithm; and better
customizability of coarse_grain() and fine_grain(), I would probably
change my vote.

Zach Laine