Re: [boost] [time_series] Review

13 Aug 2007

      On 8/12/07, Eric Niebler <eric@boost-consulting.com> wrote:
...
Stjepan Rajko wrote:
...
On a less nit-picky note though, I still can't find a single outside
reference in which something that assigns a value to a whole real set
interval is called a time series.  Eric, you indicated that your
choice for using range runs (as opposed to just points, I assume) was
that this yielded superior generic algorithms.  But in the floating
point case, this is causing that most of your structures to represent
something that is not really a time series.  In rethinking the
floating point case, are any of the strategies you are considering
looking to put all of your structures in line with what the
mathematical notion of what a time series is?
Would you agree that the time series types that use integral offsets are
isomorphic to what a time series is, in the mathematical sense? Are
Yes
...
there any time series (in the math sense) that are not representable
using integral offsets?
I think that most time-series, and especially those time series used
in practice, could be approximated fairly well using an
integral-offset series with the appropriate discretization.  The
problems are:

1) if you don't know much about the series a priori, you might not
know what to set the discretization to initially, and might never get
a good idea of what the discretization really should be.

2) say you have a series coming at you, and the time intervals between
the samples keep getting smaller and smaller, repetitively beating
your discretization no matter how small it is.  You would have to keep
fine_graining, which doesn't seem efficient.

3) a user simply might not want to use discretization.  But this is
not really a problem when it comes to your lib, because the sparse
series with floating point offsets would do the trick.
...
...
In one of your posts, you mentioned something along the lines of
making Point concept a first class citizen of the library - IIUTC,
that would be a good approach.  Furthermore, I think that the RangeRun
should be rethought, so that a RangeRun is in effect equivalent to a
countable set of Points even in the floating point case (where by "a
countable set", I mean "a countable set significantly smaller than the
one including every Point indexable by a floating point number between
the start offset and end offset").  If not, then I see this as a
Time_series+something else library, which is fine.  But with time
series, I think a continuous interval is much less useful than a way
to specify a number of discrete points in an interval.
I'm having a hard time seeing how this is any different that using a
series with integral offsets and a floating point discretization. The
time series library provides this functionality already. Can you clarify
what you're suggesting?
I don't think that the integral offsets + floating point
discretization approach always works (mostly given the problems I list
above).  But again, sparse_series with floating point offsets can be
used instead.  I gave a slightly more specific example of what I am
suggesting in http://tinyurl.com/ywu53v, but also see below.
...
I agree that the time series types that use floating point offsets are
not very time series-ish in the math sense. But some have expressed the
strong opinion during this review that the functionality they provide is
useful.
Please don't get me wrong - I also think that the floating point
offset series are useful as they are. For example, I think it's really
useful to be able to multiply a sparse_series with a
piecewise_constant_series, to accomplish something like "multiply all
the samples in [0, 100) by 10, and all samples in [100, 200) by 20".

Also, I agree with Steven in that the floating point offset series can
be divided into two categories - sparse/dense (and delta), which are
pretty consistent with the mathematical concept of a time series as
they are, with the exception of their pre_runs and post_runs, and the
rest, which are closer to modeling a piecewise constant function.

What keeps nagging me is that all these "others" are not time series.
I wouldn't even call them "series", although they are a series of
tuples, because they so much better reflect a piecewise constant
function.

What I do see coming out of the RangeRun concept is a potentially
wonderful foundation for Boost.MathFunction - but in order to get
there, it would need to grow (for example, somehow supporting all
flavors of open/closed/half-open intervals). So, I see most of these
floating point.  So all these "others", I see in this limbo - they are
not time series, but they are useful to have with time series, and
they are almost really nice implementations of piecewise constant
functions (and with the potential to implement any function I think,
using the RangeRun concept) but not quite there either.

So what I'm mostly suggesting is:
* whatever is supposed to be a time series - make it a true time
series.  At the end of the day, anything that is a time series should
be convertible to a sequence of discrete time points with values
attached, and nothing more.  Integral offset versions of the series
are there.  Floating point versions of dense, sparse, and delta series
are also there, except for their pre-runs and their post runs.
* whatever is not a time-series - call it something else, or make it
clear in the documentation that it behaves as something else in
certain circumstances (like floating point offsets).  I am not
disputing the fact that they are useful, and not suggesting they be
removed from the library - they are definitely useful in conjunction
with time-series.
* alternatively - make everything a time series, which would require
you to revisit the RangeRun concept so that it is always convertible
to a countable set of discrete (value, time) pairs.  The utility I see
here is the following:  from a time series perspective, I can't just
say "All samples in [0, 100) have value 10".  I have to specify
exactly where all these samples lie.  Allowing me to do this concisely
using a modified RangeRun would be very useful, since I wouldn't have
to specify each of the possibly numerous samples separately, nor would
they have to be stored separately.
...
I guess what I'm missing is a use case for non-integral offsets. Any
reanalysis of what floating point offsets mean has to start there. Is it
I hope the above makes a case for some of that.  I think in a lot of
cases, users will not want to deal with discretization or any involved
transforms and just want to use their (value, time) pairs as they are.
...
simply the desire to index into a sequence of points and interpolate
between them in some way? If that's the case, then support for floating
point offsets can be dropped in favor of a flexible interpolating
facade. (IMO, something like that is needed anyway.)
I have to think about that...  I wasn't thinking about that case.
...
If someone really
needs a way to say, "This signal really has the value of X in the time
interval [Y,Z)" where Y and Z are floating point values, then continuous
floating point runs are the way to go. That seems like a reasonable
thing to want, even if it doesn't fit the mathematical definition of
"time series".
It is a *very* reasonable thing to want.  And it fits the mathematical
definition of a function with a domain in the real numbers very well
;-)

Best regards,

Stjepan