Time Series review - 7/30-8/8 - Boost - lists.preview.boost.org

Time Series review - 7/30-8/8

older
[Subversion] Subversion repository...

John Phillips

31 Jul 2007 31 Jul '07

12:30 p.m.

My apologies for the delay in this posting, but the review period for the Time Series library submitted by Eric Neibler runs from Monday, July 30 until Wednesday, August 8. From the documentation: The purpose of the Boost.Time_series library is to provide data structures, numerical operators and algorithms to operate on time series. A time series is a series of data points, sampled at regular intervals. The library provides numerous time series containers, each with different time/space trade-offs, and a hierarchy of concepts which allow the time series to be manipulated generically. The library also provides operators and algorithms which use the generic interfaces to perform calculations on time series and accumulate various statistics about them. Boost.Time_series does not yet contain all the algorithms one might want in order to perform full time series analysis. However, the key contribution of Boost.Time_series is the framework and the rich hierarchy of concepts with which such algorithms can be written to efficiently and generically process series of data with widely divergent in-memory representations and performance characteristics. Boost.Time_series provides several such series containers, as well as mechanisms for defining additional series types and algorithms that fit within the framework. Some examples of series types that are provided are: dense, sparse, piecewise constant, heaviside and others, as well as adaptors for providing shifted, scaled and clipped series views. Please notice that the Time Series library uses some features of boost that will be part off the 1.35 release, but are not I the 1.34.1 release. For testing, you will need to either test against CVS Head (Which will not be available for much of Tuesday.) or use backports provided in the time series download files to update your 1.34.x installation. The library is available from http://boost-consulting.com/vault/index.php?directory=Math%20-%20Numerics or by following the link provided on the review schedule. Thanks to all for your time and effort on this review. John Phillips Review Manager

Show replies by date

Zach Laine

31 Jul 31 Jul

2:21 p.m.

Here are the review guidelines from http://boost.org/more/formal_review_process.htm : What to include in Review Comments Your comments may be brief or lengthy, but basically the Review Manager needs your evaluation of the library. If you identify problems along the way, please note if they are minor, serious, or showstoppers. Here are some questions you might want to answer in your review: * What is your evaluation of the design? * What is your evaluation of the implementation? * What is your evaluation of the documentation? * What is your evaluation of the potential usefulness of the library? * Did you try to use the library? With what compiler? Did you have any problems? * How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? * Are you knowledgeable about the problem domain? And finally, every review should answer this question: * Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion. Zach Laine On 7/31/07, John Phillips <phillips@delos.mps.ohio-state.edu> wrote:

...

My apologies for the delay in this posting, but the review period for the Time Series library submitted by Eric Neibler runs from Monday, July 30 until Wednesday, August 8. From the documentation:

The purpose of the Boost.Time_series library is to provide data structures, numerical operators and algorithms to operate on time series. A time series is a series of data points, sampled at regular intervals. The library provides numerous time series containers, each with different time/space trade-offs, and a hierarchy of concepts which allow the time series to be manipulated generically. The library also provides operators and algorithms which use the generic interfaces to perform calculations on time series and accumulate various statistics about them.

Boost.Time_series does not yet contain all the algorithms one might want in order to perform full time series analysis. However, the key contribution of Boost.Time_series is the framework and the rich hierarchy of concepts with which such algorithms can be written to efficiently and generically process series of data with widely divergent in-memory representations and performance characteristics. Boost.Time_series provides several such series containers, as well as mechanisms for defining additional series types and algorithms that fit within the framework. Some examples of series types that are provided are: dense, sparse, piecewise constant, heaviside and others, as well as adaptors for providing shifted, scaled and clipped series views.

Please notice that the Time Series library uses some features of boost that will be part off the 1.35 release, but are not I the 1.34.1 release. For testing, you will need to either test against CVS Head (Which will not be available for much of Tuesday.) or use backports provided in the time series download files to update your 1.34.x installation.

The library is available from

http://boost-consulting.com/vault/index.php?directory=Math%20-%20Numerics

or by following the link provided on the review schedule.

Thanks to all for your time and effort on this review.

John Phillips Review Manager

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Stjepan Rajko

7 Aug 7 Aug

10:10 a.m.

On 7/31/07, John Phillips <phillips@delos.mps.ohio-state.edu> wrote:

...

My apologies for the delay in this posting, but the review period for the Time Series library submitted by Eric Neibler runs from Monday, July 30 until Wednesday, August 8. From the documentation:

I have started to review the library, and at this point I am confused :-) So I thought I would share my thoughts in case someone can de-confuse me and I can produce a more useful review. To preface, I would be *very* interested in having a time series library in Boost. I use time series frequently in my work, although I wouldn't call myself an expert. That said, it took me a while to understand the language of the documentation. The whole "runs, discretization, coarseness" business didn't readily make sense to me. The lightbulb finally went off when I saw: "The integral of a series is calculated by multiplying the value of each run in the series by the run's length, summing all the results, and multiplying the sum by the series' discretization." Aha! Now I knew exactly what run and discretization meant - it is as if you are modeling a piecewise constant function with something like a scalable "x" axis. I went on to experiments with some dense, sparse, and piecewise constant series to see how they behave. The first thing that struck me as odd was: dense_series<double, daily> dense(start = 0, stop = 2); sparse_series< double, daily, double > sparse; sparse = dense; // Compilation error! (GCC 4.0.1/darwin) However, if I change the discretization type to int, all is good. Is there a reason? Before assigning to sparse, I populated the dense series with: ordered_inserter< dense_series< double > > in_d( dense ); in_d(10,0)(11,1)(12,2).commit(); After assigning to sparse, I was a little surprised at what sparse was spitting back out via the [] operator. Basically, sparse[0] == 10, sparse[1] == 11, sparse[2] == 12, and seemingly 0 everywhere else, but the docs say: sparse_series<> A sparse series where runs have unit length. So how come I get sparse[0.1] == sparse[-0.1] == 0? Shouldn't there be a unit length interval around 0 where sparse is 10? I think you may have commented on this recently, but I found it rather unexpected, given what the docs say (otherwise, perfectly good behavior). I moved on to: piecewise_constant_series< double, int, double > pc; pc = dense; And now more weirdness... pc[x] == 10 for x \in [0, 1] (inclusive)!!! i.e., pc[0] == pc[1] == 10, but pc[1.01] == 11. The first reason why this strikes me as odd is, if I start with a discrete time series (as dense_series models very nicely), and put it into something that expands it to a piecewise constant function, the choice of translating "value 10 at time 0" to "value 10 at interval [0, 1]" is not obvious... if I wanted approximation, I would be more likely to assign to a particular pc[x] the value of the closest point in dense - i.e. I'd be more likely to set pc[0.9]==11 because 0.9 is closer to 1 than to 0. That strategy would make it difficult to deal with the infinite pre-run and post-run, though, and you could argue that assignment is not about this kind of approximation but about something else - so you may want to specify in the docs what the semantics of conversion through assignment is. The second reason why I find this odd is, I would at least expect that for an explicitly specified point such as dense[1], pc[1] would have the same value after assignment. Anyway, at this point I stopped experimenting as it seemed to me that there was something fundamental that I wasn't getting.

...

From reading the docs, I do have some other comments:

"It may seem odd that even the unit series have a value parameter. This is because 1 is not always convertible to the series' value type. Consider a delta_unit_series< std::vector< int > >. You may decide that for this series, the unit value should be std::vector< int >(3, 1). For unit series with scalar value types such as int, double and std::complex<>, the value parameter is ignored." Question: if I was dealing with something that needs to have the "1" specified, couldn't I just use delta_series instead of delta_unit_series? About discretization: your prenamed discretizations start with with daily as mpl::int<1>. That's great if that's how the integrals need to be calculated. However, in a lot of cases one would want to have "secondly" as mpl::int<1>. Could you provide another set of prenamed discretizations that provide that option? Although, if someone used both of them in the same app, then compile-time checking would consider them equivalent, no? Could the units lib be used for this instead? About the docs: time series can be expressed very clearly visually - I think that the docs would benefit greatly from some visual examples. I guess that's it for now. Sorry about my confusion :-) I can try to give a little more feedback if I start to understand the fundamentals of the library better. Back to bed for me... I actually tried to go to bed earlier but couldn't sleep because I was thinking about this lib. So I woke up to write this. Weird. Stjepan

Stjepan Rajko

5:17 p.m.

On 8/7/07, Stjepan Rajko <stipe@asu.edu> wrote:

...

On 7/31/07, John Phillips <phillips@delos.mps.ohio-state.edu> wrote:

...
My apologies for the delay in this posting, but the review period for the Time Series library submitted by Eric Neibler runs from Monday, July 30 until Wednesday, August 8. From the documentation:

I have started to review the library, and at this point I am confused :-) So I thought I would share my thoughts in case someone can de-confuse me and I can produce a more useful review.

OK, I think I have tracked down many of the sources of my confusion. As you say in the beginning of the docs, "A time series is a series of data points, sampled at regular intervals". That I agree with, except for the fact that it's not necessarily regular intervals. dense_series and sparse_series implement this definition of time series well - they behave as containers that have specified values only at the specified time points. Everywhere else, they are zero (although, I would argue that for a time series, if you had to examine a time point that has not been specified, I would be more likely to call it an "unspecified" or "unknown" rather than a "zero"). The other containers though, they don't really implement what I would consider a time series. They more or less implement piecewise constant functions. In some cases, they do so well, and in some cases, it's tricky, like with piecewise_constant_series. One of the problems there is that runs might have an intersection - that should be handled carefully. Even if it is up to the user to make sure there are no overlaps, there would have to at least be support for open / half-open intervals so that I can specify that, say f has a value of 10 on [0, 1), a value of 11 on [1, 2), etc. This is where the concept of a "run" breaks down for me - it gives the illusion / assumption that you are dealing with a continuous function (rather than a strict time series), but it is not handled carefully to provide (IMO) mathematically intuitive behavior - examples are the fact that sparse_series claims to have a run of unit length, but it is still zero at any unspecified value, and the surprising handling of overlapping ranges in piecewise_continuous_series. IMO, this library deals with two different problem domains - time_series and picewise constant functions. I also think that these two domains are too different to be stuck in the same bucket. I think that a general time_series does not need to address the "run" concept - perhaps, there can be a notion of a "weight" assigned to each sample, which can represent time duration or other things. That would make it behave eqivalent to the run for the "integral" (which should really just be a sum for time_series, IMO). Also, I think that for the piecewise constant functions, runs should at least have the option of being open/half open intervals. With all this in mind, to some extent, I believe that sparse_series and dense_series (which I see as time series) should be treated differently than the rest of the contaners (which I see as piecewise constant functions). All in all, I think that this library is useful, but it to me it it is something different than what it claims to be. It attempts to address a mathematical/numerical concept, but I find that it does so in ways that are unintuitive to me - maybe to people using time series in other contexts this makes perfect sense but it left me very confused. If the library made a separation between time series (values at discrete specified times only) and piecewise constant functions (values everywhere), and handled both with some of the changes suggested above, I'd say "Yes! Accept!". As it stands, I'm not sure. Oh, and kudos to Zürcher Kantonalbank. And Eric for yet another impressive implementation, of course :-) Stjepan

Eric Niebler

6:19 p.m.

Stjepan Rajko wrote:

...

On 8/7/07, Stjepan Rajko <stipe@asu.edu> wrote:

...
On 7/31/07, John Phillips <phillips@delos.mps.ohio-state.edu> wrote:

...
My apologies for the delay in this posting, but the review period for the Time Series library submitted by Eric Neibler runs from Monday, July 30 until Wednesday, August 8. From the documentation:

I have started to review the library, and at this point I am confused :-) So I thought I would share my thoughts in case someone can de-confuse me and I can produce a more useful review.

OK, I think I have tracked down many of the sources of my confusion.

As you say in the beginning of the docs, "A time series is a series of data points, sampled at regular intervals". That I agree with, except for the fact that it's not necessarily regular intervals.

dense_series and sparse_series implement this definition of time series well - they behave as containers that have specified values only at the specified time points. Everywhere else, they are zero (although, I would argue that for a time series, if you had to examine a time point that has not been specified, I would be more likely to call it an "unspecified" or "unknown" rather than a "zero").

You could think of it that way. When multiplying two series, for example, the library assumes that where two series do not overlap, the result is "zero" or "undefined". You can see that as a manifestation of "0 x Y == 0" or as "<undefined> x Y == <undefined>". I prefer to think of them as zeros because then I don't need to define arithmetic with undefined values.

...

The other containers though, they don't really implement what I would consider a time series. They more or less implement piecewise constant functions. In some cases, they do so well, and in some cases, it's tricky, like with piecewise_constant_series. One of the problems there is that runs might have an intersection - that should be handled carefully. Even if it is up to the user to make sure there are no overlaps, there would have to at least be support for open / half-open intervals so that I can specify that, say f has a value of 10 on [0, 1), a value of 11 on [1, 2), etc.

Integral runs are half-open. Floating-point runs are problematic in this regard. Some extra thought needs to go into this. One possibility would be to disallow sparse and delta series with floating point offsets, and require floating-point runs for piecewise constance to be half-open like their integral brethren. That may result is more intuitive behavior.

...

This is where the concept of a "run" breaks down for me - it gives the illusion / assumption that you are dealing with a continuous function (rather than a strict time series), but it is not handled carefully to provide (IMO) mathematically intuitive behavior - examples are the fact that sparse_series claims to have a run of unit length, but it is still zero at any unspecified value, and the surprising handling of overlapping ranges in piecewise_continuous_series.

Only for floating-point offsets. And this behavior can be changed.

...

IMO, this library deals with two different problem domains - time_series and picewise constant functions. I also think that these two domains are too different to be stuck in the same bucket. I think that a general time_series does not need to address the "run" concept - perhaps, there can be a notion of a "weight" assigned to each sample, which can represent time duration or other things. That would make it behave eqivalent to the run for the "integral" (which should really just be a sum for time_series, IMO). Also, I think that for the piecewise constant functions, runs should at least have the option of being open/half open intervals. With all this in mind, to some extent, I believe that sparse_series and dense_series (which I see as time series) should be treated differently than the rest of the contaners (which I see as piecewise constant functions).

You've hit on something important -- I agree time_series currently has a split personality, but I don't agree that its the sparse/dense vs. piecewise constant thing. It's the integral vs. floating-point offset thing. And I think those problems are fixable.

...

All in all, I think that this library is useful, but it to me it it is something different than what it claims to be. It attempts to address a mathematical/numerical concept, but I find that it does so in ways that are unintuitive to me - maybe to people using time series in other contexts this makes perfect sense but it left me very confused. If the library made a separation between time series (values at discrete specified times only) and piecewise constant functions (values everywhere), and handled both with some of the changes suggested above, I'd say "Yes! Accept!". As it stands, I'm not sure.

Thanks for your very valuable feedback.

...

Oh, and kudos to Zürcher Kantonalbank. And Eric for yet another impressive implementation, of course :-)

-- Eric Niebler Boost Consulting www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Stjepan Rajko

7:57 p.m.

Eric, thanks for all of your responses - they clarify things quite a bit, and your rationale seems more justified. I think most importantly: On 8/7/07, Eric Niebler <eric@boost-consulting.com> wrote:

...

Stjepan Rajko wrote:

...
IMO, this library deals with two different problem domains - time_series and picewise constant functions. I also think that these two domains are too different to be stuck in the same bucket. I think that a general time_series does not need to address the "run" concept - perhaps, there can be a notion of a "weight" assigned to each sample, which can represent time duration or other things. That would make it behave eqivalent to the run for the "integral" (which should really just be a sum for time_series, IMO). Also, I think that for the piecewise constant functions, runs should at least have the option of being open/half open intervals. With all this in mind, to some extent, I believe that sparse_series and dense_series (which I see as time series) should be treated differently than the rest of the contaners (which I see as piecewise constant functions).

You've hit on something important -- I agree time_series currently has a split personality, but I don't agree that its the sparse/dense vs. piecewise constant thing. It's the integral vs. floating-point offset thing. And I think those problems are fixable.

Ah! That is a valuable perspective indeed - with integral offsets, the word "run" makes more sense (it is the repetition of the same value), and the library now looks like a perfectly well behaved time series library. But you definitely need floating point offsets for a lot of applications :-( So, what do you do? One strategy that would seem valid to me (conceptually, I don't know what sort of chaos this would bring to the implementation) would be to modify the concept of a run to mean, instead of "a value sample of certain duration" to be "a value regularly sampled between two offsets at a certain period". So, to specify a non-trivial run, you'd need a beginning offset, an end offset, and the period at which the actual samples are taken. So a run (10, 0, 1, 0.5) means there are three samples - (10, 0), (10, 0.5), and (10, 1). The samples would then still be discrete rather than sometimes continuous (which makes me happy :-)), and you no longer run into the open/closed problem because you are dealing with discrete time offsets - if you don't want the (10, 1) sample you can just specify (10, 0, 0.5, 0.5). They could also express the same thing as (10, 0, 0.9, 0.5) which is kind of ugly, so it may even be better to have the user provide the number of samples in a run, rather than the period - so (10, 0, 1, 3) would mean "3 evenly spaced samples between 0 and 1 of value 10". Just ideas. Something like the above, and I'm sold. In any case, I am now definitely in agreement with the integer-offset parts of the library. Best regards, Stjepan

Matthias Troyer

9 Aug 9 Aug

3:45 a.m.

On 7 Aug 2007, at 12:19, Eric Niebler wrote:

...

Stjepan Rajko wrote:

...
IMO, this library deals with two different problem domains - time_series and picewise constant functions. I also think that these two domains are too different to be stuck in the same bucket. I think that a general time_series does not need to address the "run" concept - perhaps, there can be a notion of a "weight" assigned to each sample, which can represent time duration or other things. That would make it behave eqivalent to the run for the "integral" (which should really just be a sum for time_series, IMO). Also, I think that for the piecewise constant functions, runs should at least have the option of being open/half open intervals. With all this in mind, to some extent, I believe that sparse_series and dense_series (which I see as time series) should be treated differently than the rest of the contaners (which I see as piecewise constant functions).

You've hit on something important -- I agree time_series currently has a split personality, but I don't agree that its the sparse/dense vs. piecewise constant thing. It's the integral vs. floating-point offset thing. And I think those problems are fixable.

The sparse/dense versus piecewise constant is fine. The piecewise constants are the running sums (integrals) of the sparse/dense timeseries and thus well defined. Floating point offsets: for sparse/dense series there is no problem since data points just exist at certain time points, and whether they are integer or floating point does not matter. for piecewise constant one just needs to define a convention: is the interval left-closed right-open? or right-closed left-open? or should this be flexible and defined by an (optional) tempate parameter? Matthias

Eric Niebler

7 Aug 7 Aug

6:06 p.m.

Stjepan Rajko wrote:

...

On 7/31/07, John Phillips <phillips@delos.mps.ohio-state.edu> wrote:

...
My apologies for the delay in this posting, but the review period for the Time Series library submitted by Eric Neibler runs from Monday, July 30 until Wednesday, August 8. From the documentation:

I have started to review the library, and at this point I am confused :-) So I thought I would share my thoughts in case someone can de-confuse me and I can produce a more useful review.

To preface, I would be *very* interested in having a time series library in Boost. I use time series frequently in my work, although I wouldn't call myself an expert.

That said, it took me a while to understand the language of the documentation. The whole "runs, discretization, coarseness" business didn't readily make sense to me. The lightbulb finally went off when I saw:

"The integral of a series is calculated by multiplying the value of each run in the series by the run's length, summing all the results, and multiplying the sum by the series' discretization."

Aha! Now I knew exactly what run and discretization meant - it is as if you are modeling a piecewise constant function with something like a scalable "x" axis.

I went on to experiments with some dense, sparse, and piecewise constant series to see how they behave. The first thing that struck me as odd was:

dense_series<double, daily> dense(start = 0, stop = 2); sparse_series< double, daily, double > sparse;

sparse = dense; // Compilation error! (GCC 4.0.1/darwin)

When assigning one series type to another, or using two series in arithmetic expressions, the offset types must be compatible. For the above, the offset type (not specified) is "std::ptrdiff_t". For the sparse series, it is a double. It wasn't clear to me what the right behavior should be, so to be conservative, I disallowed mixing series with different offset types.

...

However, if I change the discretization type to int, all is good. Is there a reason?

Before assigning to sparse, I populated the dense series with:

ordered_inserter< dense_series< double > > in_d( dense ); in_d(10,0)(11,1)(12,2).commit();

After assigning to sparse, I was a little surprised at what sparse was spitting back out via the [] operator. Basically, sparse[0] == 10, sparse[1] == 11, sparse[2] == 12, and seemingly 0 everywhere else, but the docs say:

sparse_series<> A sparse series where runs have unit length.

So how come I get sparse[0.1] == sparse[-0.1] == 0? Shouldn't there be a unit length interval around 0 where sparse is 10? I think you may have commented on this recently, but I found it rather unexpected, given what the docs say (otherwise, perfectly good behavior).

The docs are admittedly unclear on this point, and part of the reason for the confusion is that time series with floating-point offsets were added after this part of the docs were written. Sparse and delta are more properly thought of as "point" data structures, where the coordinate of each datum can be represented with a single offset. So a sparse series with a 42 at 3.14 has a 42 *exactly* there and nowhere else. Does that help?

...

I moved on to:

piecewise_constant_series< double, int, double > pc; pc = dense;

If that compiles, it's a bug. The offset types should be required to match.

...

And now more weirdness... pc[x] == 10 for x \in [0, 1] (inclusive)!!! i.e., pc[0] == pc[1] == 10, but pc[1.01] == 11.

Yuk, that's awful. It seems this is a good reason to disallow mixing series with floating-point and integral offsets in the same expression. See below for more on this.

...

The first reason why this strikes me as odd is, if I start with a discrete time series (as dense_series models very nicely), and put it into something that expands it to a piecewise constant function, the choice of translating "value 10 at time 0" to "value 10 at interval [0, 1]" is not obvious... if I wanted approximation, I would be more likely to assign to a particular pc[x] the value of the closest point in dense - i.e. I'd be more likely to set pc[0.9]==11 because 0.9 is closer to 1 than to 0. That strategy would make it difficult to deal with the infinite pre-run and post-run, though, and you could argue that assignment is not about this kind of approximation but about something else - so you may want to specify in the docs what the semantics of conversion through assignment is.

The second reason why I find this odd is, I would at least expect that for an explicitly specified point such as dense[1], pc[1] would have the same value after assignment.

Anyway, at this point I stopped experimenting as it seemed to me that there was something fundamental that I wasn't getting.

I'm sorry you ran into trouble with floating-point offsets. To be honest, they were added to the time_series library as a bit of an after-thought, and they are admittedly not a seamless addition. With integral offsets, runs are in fact half-open. Floating-point runs are not -- it's not clear what that would mean. (E.g., a delta series D has a 42 at [3.14,3.14) -- that's a zero-width half open range! What should D[3.14] return?) Some of the algorithms won't work with floating-point offsets. If you feel this part of the library needs more thought, that's a fair assessment. I'm certain that had you used integral offsets your experience would have been less frustrating. I think with better docs and more strict type checking, integral and floating point offsets can both be supported without confusion.

...

...
From reading the docs, I do have some other comments:

"It may seem odd that even the unit series have a value parameter. This is because 1 is not always convertible to the series' value type. Consider a delta_unit_series< std::vector< int > >. You may decide that for this series, the unit value should be std::vector< int >(3, 1). For unit series with scalar value types such as int, double and std::complex<>, the value parameter is ignored."

Question: if I was dealing with something that needs to have the "1" specified, couldn't I just use delta_series instead of delta_unit_series?

You certainly could, but there are extra optimization opportunities if all runs are known at compile time to have a "unit" value. The library may short-cut multiplication, for instance, if it knows that one operand will always be 1.

...

About discretization: your prenamed discretizations start with with daily as mpl::int<1>. That's great if that's how the integrals need to be calculated. However, in a lot of cases one would want to have "secondly" as mpl::int<1>. Could you provide another set of prenamed discretizations that provide that option? Although, if someone used both of them in the same app, then compile-time checking would consider them equivalent, no? Could the units lib be used for this instead?

"Daily" is its own type, not a typedef for mpl::int_<1>. struct daily : mpl::int_<1> {}; So if your your application you define your own discretizations like: struct secondly : mpl::int_<1> {}; compile-time time checking would still consider them different. It's the type that matters for the compile-time checks, not the value. The value is only used by the integrate() algorithm. I really don't want to add an alternate set of named discretizations. One could argue that I shouldn't even be providing any. I haven't looked into using the units lib for this, but it seems like a reasonable idea.

...

About the docs: time series can be expressed very clearly visually - I think that the docs would benefit greatly from some visual examples.

True.

...

I guess that's it for now. Sorry about my confusion :-) I can try to give a little more feedback if I start to understand the fundamentals of the library better. Back to bed for me... I actually tried to go to bed earlier but couldn't sleep because I was thinking about this lib. So I woke up to write this. Weird.

Sorry to keep you up! ;-) -- Eric Niebler Boost Consulting www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Eric Niebler

7:34 p.m.

Eric Niebler wrote:

...

Stjepan Rajko wrote:

...
I moved on to:

piecewise_constant_series< double, int, double > pc; pc = dense;

If that compiles, it's a bug. The offset types should be required to match.

I've confirmed this bug, and fixed it locally. This error will be caught at compile time for now. If and when the the semantics of floating point offsets are made consistent with integral offsets, I can revisit the issue of offset type convertibility. Thanks, -- Eric Niebler Boost Consulting www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Steven Watanabe

8:29 p.m.

AMDG Eric Niebler <eric <at> boost-consulting.com> writes:

...

I really don't want to add an alternate set of named discretizations. One could argue that I shouldn't even be providing any. I haven't looked into using the units lib for this, but it seems like a reasonable idea.

dense_series<double, boost::units::SI::time> will work fine as is. In Christ, Steven Watanabe

Steven Watanabe

10:10 p.m.

AMDG Steven Watanabe <steven <at> providere-consulting.com> writes:

...

dense_series<double, boost::units::SI::time> will work fine as is.

As long as you don't use any of the algorithms that treat the discretization as an integer... In Christ, Steven Watanabe

Eric Niebler

11:35 p.m.

Steven Watanabe wrote:

...

AMDG

Steven Watanabe <steven <at> providere-consulting.com> writes:

...
dense_series<double, boost::units::SI::time> will work fine as is.

As long as you don't use any of the algorithms that treat the discretization as an integer...

If any do, it's a bug. The only algorithm that does anything interesting with the discretization is integrate(), and its only requirement is that the value_type can be multiplied by the discretization and the result assigned back to the value_type. That is probably not the case with boost.units, is it? -- Eric Niebler Boost Consulting www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Steven Watanabe

8 Aug 8 Aug

9:28 p.m.

Eric Niebler <eric <at> boost-consulting.com> writes:

...

Steven Watanabe wrote:

...
AMDG

Steven Watanabe <steven <at> providere-consulting.com> writes:

...
dense_series<double, boost::units::SI::time> will work fine as is.

As long as you don't use any of the algorithms that treat the discretization as an integer...

If any do, it's a bug. The only algorithm that does anything interesting with the discretization is integrate(), and its only requirement is that the value_type can be multiplied by the discretization and the result assigned back to the value_type. That is probably not the case with boost.units, is it?

What about fine_grain and coarse_grain? If you use piecewise_constant_series<double, SI::time> you should in theory be able to return a quantity<SI::time, double> from integrate if you use Typeof, right? In Christ, Steven Watanabe

Matthias Troyer

9 Aug 9 Aug

3:47 a.m.

On 7 Aug 2007, at 12:06, Eric Niebler wrote:

...

I'm sorry you ran into trouble with floating-point offsets. To be honest, they were added to the time_series library as a bit of an after-thought, and they are admittedly not a seamless addition. With integral offsets, runs are in fact half-open. Floating-point runs are not -- it's not clear what that would mean. (E.g., a delta series D has a 42 at [3.14,3.14) -- that's a zero-width half open range! What should D[3.14] return?) Some of the algorithms won't work with floating-point offsets. If you feel this part of the library needs more thought, that's a fair assessment. I'm certain that had you used integral offsets your experience would have been less frustrating.

I think with better docs and more strict type checking, integral and floating point offsets can both be supported without confusion.

Don't confuse the sparse/dense time series with piecewise constant functions. A delta series D has 42 at 3.14, not at any interval [3.14,3.14) - intervals should be used only for the piecewise constant functions. Matthias

Eric Niebler

4:26 a.m.

Matthias Troyer wrote:

...

On 7 Aug 2007, at 12:06, Eric Niebler wrote:

...
I'm sorry you ran into trouble with floating-point offsets. To be honest, they were added to the time_series library as a bit of an after-thought, and they are admittedly not a seamless addition. With integral offsets, runs are in fact half-open. Floating-point runs are not -- it's not clear what that would mean. (E.g., a delta series D has a 42 at [3.14,3.14) -- that's a zero-width half open range! What should D[3.14] return?) Some of the algorithms won't work with floating-point offsets. If you feel this part of the library needs more thought, that's a fair assessment. I'm certain that had you used integral offsets your experience would have been less frustrating.

I think with better docs and more strict type checking, integral and floating point offsets can both be supported without confusion.

Don't confuse the sparse/dense time series with piecewise constant functions. A delta series D has 42 at 3.14, not at any interval [3.14,3.14) - intervals should be used only for the piecewise constant functions.

Ah, but the library is built on top of lower-level abstractions that assume intervals. An interval (a run) is how algorithms on time series are expressed. This design was chosen because it makes it possible to write generic algorithms for lots of different types of series, and it works very well for integral offsets. The question is whether the abstractions upon which time series is built are compatible with a sparse series with floating point offsets and if so, what convention can be used so that the algorithms naturally give the correct results both with points and with runs. The way the library currently handles floating point offsets is /almost/ right, but not quite. -- Eric Niebler Boost Consulting www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Matthias Troyer

10:35 p.m.

On 8 Aug 2007, at 22:26, Eric Niebler wrote:

...

Matthias Troyer wrote:

...
On 7 Aug 2007, at 12:06, Eric Niebler wrote:

...
I'm sorry you ran into trouble with floating-point offsets. To be honest, they were added to the time_series library as a bit of an after-thought, and they are admittedly not a seamless addition. With integral offsets, runs are in fact half-open. Floating-point runs are not -- it's not clear what that would mean. (E.g., a delta series D has a 42 at [3.14,3.14) -- that's a zero-width half open range! What should D[3.14] return?) Some of the algorithms won't work with floating- point offsets. If you feel this part of the library needs more thought, that's a fair assessment. I'm certain that had you used integral offsets your experience would have been less frustrating.

I think with better docs and more strict type checking, integral and floating point offsets can both be supported without confusion.

Don't confuse the sparse/dense time series with piecewise constant functions. A delta series D has 42 at 3.14, not at any interval [3.14,3.14) - intervals should be used only for the piecewise constant functions.

Ah, but the library is built on top of lower-level abstractions that assume intervals. An interval (a run) is how algorithms on time series are expressed. This design was chosen because it makes it possible to write generic algorithms for lots of different types of series, and it works very well for integral offsets. The question is whether the abstractions upon which time series is built are compatible with a sparse series with floating point offsets and if so, what convention can be used so that the algorithms naturally give the correct results both with points and with runs. The way the library currently handles floating point offsets is /almost/ right, but not quite.

How do you express the delta series as a run then? length 1? Matthias

Eric Niebler

11:27 p.m.

Matthias Troyer wrote:

...

On 8 Aug 2007, at 22:26, Eric Niebler wrote:

...
Matthias Troyer wrote:

...
On 7 Aug 2007, at 12:06, Eric Niebler wrote:

...
I'm sorry you ran into trouble with floating-point offsets. To be honest, they were added to the time_series library as a bit of an after-thought, and they are admittedly not a seamless addition. With integral offsets, runs are in fact half-open. Floating-point runs are not -- it's not clear what that would mean. (E.g., a delta series D has a 42 at [3.14,3.14) -- that's a zero-width half open range! What should D[3.14] return?) Some of the algorithms won't work with floating- point offsets. If you feel this part of the library needs more thought, that's a fair assessment. I'm certain that had you used integral offsets your experience would have been less frustrating.

I think with better docs and more strict type checking, integral and floating point offsets can both be supported without confusion. Don't confuse the sparse/dense time series with piecewise constant functions. A delta series D has 42 at 3.14, not at any interval [3.14,3.14) - intervals should be used only for the piecewise constant functions.

Ah, but the library is built on top of lower-level abstractions that assume intervals. An interval (a run) is how algorithms on time series are expressed. This design was chosen because it makes it possible to write generic algorithms for lots of different types of series, and it works very well for integral offsets. The question is whether the abstractions upon which time series is built are compatible with a sparse series with floating point offsets and if so, what convention can be used so that the algorithms naturally give the correct results both with points and with runs. The way the library currently handles floating point offsets is /almost/ right, but not quite.

How do you express the delta series as a run then? length 1?

For integer offsets, yes. For floating point, no. The time series currently has a notion of an "indivisible" run -- one that cannot be divided into smaller time slices. For float, that is essentially a run like [3.14,3.14] -- a closed range. That mostly works, but it leads to some inconsistent handling of termination conditions, since all other runs are half-open. The solution may involve nothing more than establishing a convention, or it may involve promoting the concept of Point to the same importance as Run and specializing algorithms appropriately. It'll take some thought. -- Eric Niebler Boost Consulting www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

Matthias Troyer

10 Aug 10 Aug

4:22 a.m.

On 9 Aug 2007, at 17:27, Eric Niebler wrote:

...

Matthias Troyer wrote:

...
How do you express the delta series as a run then? length 1?

For integer offsets, yes. For floating point, no. The time series currently has a notion of an "indivisible" run -- one that cannot be divided into smaller time slices. For float, that is essentially a run like [3.14,3.14] -- a closed range. That mostly works, but it leads to some inconsistent handling of termination conditions, since all other runs are half-open. The solution may involve nothing more than establishing a convention, or it may involve promoting the concept of Point to the same importance as Run and specializing algorithms appropriately. It'll take some thought.

I think that this is indeed the main and only problem with floating point offsets and it will take some thought. I am not yet sure which solution would be best, but nothing more than what you mentioned should be needed. Matthias

Paul A Bristow

9 Aug 9 Aug

noon

New subject: Time Series review

...

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of John Phillips Sent: 31 July 2007 13:31 To: boost@lists.boost.org Cc: boost-users@lists.boost.org Subject: [boost] Time Series review - 7/30-8/8

* What is your evaluation of the design? Fundamentally sound and potentially useful in many domains, not just the finance one for which it was designed. * What is your evaluation of the implementation? Looks OK. Some details of difficult areas like FP that can almost certainly be improved. (See also some detailed notes and questions below) * What is your evaluation of the documentation? Good, but far too steep to tempt most potential users. Needs an outline, a tutorial, and more examples, especially several outside finance. * What is your evaluation of the potential usefulness of the library? Although there are many problem for which a plain C array of doubles will suffice, it is surprisingly how soon more complicated data structures emerge, and that is where I see this package making the difficult much easier. * Did you try to use the library? With what compiler? Did you have any problems? No, time/ priorities did not permit, but I hope to. * How much effort did you put into your evaluation? A quicker reading than I would wish. * Are you knowledgeable about the problem domain? Somewhat, but with physical measurements, not finance. * Am I confident that the library will be maintained. Yes, definitely. * Do you think the library should be accepted as a Boost library? Definitely. I note that a number of critical comments have been made of the library: I don't think any of these are showstoppers and I think it would be unfortunate if this prevented acceptance of the library. I am confident that these criticisms can be addressed. (As an aside, I think it would also set an unfortunate example to reject a high quality library that has been kindly 'donated' by a commercial organisation. No doubt it took some effort to persuade them to do this, and that it took a *lot of effort for them to make a decision to do so*! I believe we want to encourage other organisations to make freely available, through Boost, work that they have funded.) Some detailed observations: 1 Is the Time series is restricted to time? - the series time unit could be anything, voltage, age, - so not an ideal name? So I am tempted to suggest Data_series or even Boost.Series? But a rose by any other name would smell as sweet ;-) (Can I claim a brownie point for the first Shakespeare quote in a Boost review? ;-) However there is plenty of precedent as you give http://en.wikipedia.org/wiki/Time_series and time is the most common 'variable'. 2 How would one signify missing data??? Use NaN? as zero? 3 The series types are rather tersely defined. The learning curve would be flattened by more explanation and examples at the point of definition, aiming for lucidity as well as accuracy. 4 The delta_series should at least mention Dirac! Why does Heaviside get his name on a series and Dirac not? Wikipedia link would be good. 3 Despite the warning, I am sure many will make the mistake of not committing. A stronger admonishment might be useful? "You must .commit() the returned inserter when you are done with it." 4 I'd like to see examples with each definition, and preferably not all from finance, lest people imagin that it is only suitable for finance when it is actually at least as useful for all things physical and even sociological. 5 It may astonish Boosters to know that there are very many potential users who will not understand "This operation is amortized O(1)". Links like [@http://en.wikipedia.org/wiki/Amortized_analysis amortized 0(1)] would help them. 6 A proof of application using circular buffer would be most valuable. There are very many people with an endless torrent of data hitting them, without an infinite supply of memory. Showing how to get information from the data flood could sell time_series to them. 7 A single example, in a single finance arena (nice and commented) is really not enough. 8 The docs really needed a much more gentle tutorial introduction, so that users can see the wood for the metaprocessing-trees. Typos - I only spotted one in Zeros and Sparse Data accomlish should be accomplish. Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

Eric Niebler

5:05 p.m.

New subject: Time Series review

Paul A Bristow wrote:

...

...
-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of John Phillips Sent: 31 July 2007 13:31 To: boost@lists.boost.org Cc: boost-users@lists.boost.org Subject: [boost] Time Series review - 7/30-8/8

* What is your evaluation of the design?

Fundamentally sound and potentially useful in many domains, not just the finance one for which it was designed.

* What is your evaluation of the implementation?

Looks OK. Some details of difficult areas like FP that can almost certainly be improved. (See also some detailed notes and questions below)

Agreed.

...

* What is your evaluation of the documentation?

Good, but far too steep to tempt most potential users. Needs an outline, a tutorial, and more examples, especially several outside finance.

Yes, I'm getting that message loud and clear. :-)

...

* What is your evaluation of the potential usefulness of the library?

Although there are many problem for which a plain C array of doubles will suffice, it is surprisingly how soon more complicated data structures emerge, and that is where I see this package making the difficult much easier.

* Did you try to use the library? With what compiler? Did you have any problems?

No, time/ priorities did not permit, but I hope to.

* How much effort did you put into your evaluation?

A quicker reading than I would wish.

* Are you knowledgeable about the problem domain?

Somewhat, but with physical measurements, not finance.

* Am I confident that the library will be maintained.

Yes, definitely.

* Do you think the library should be accepted as a Boost library?

Definitely.

I note that a number of critical comments have been made of the library: I don't think any of these are showstoppers and I think it would be unfortunate if this prevented acceptance of the library. I am confident that these criticisms can be addressed.

Thank you. I also am confident the FP problem can be dealt with.

...

(As an aside, I think it would also set an unfortunate example to reject a high quality library that has been kindly 'donated' by a commercial organisation. No doubt it took some effort to persuade them to do this, and that it took a *lot of effort for them to make a decision to do so*! I believe we want to encourage other organisations to make freely available, through Boost, work that they have funded.)

IMHO, if time_series is accepted, I hope it's for its merits and not because it was donated by anyone.

...

Some detailed observations:

1 Is the Time series is restricted to time? - the series time unit could be anything, voltage, age, - so not an ideal name?

The data structures are fairly general, that's true, which is why in another message I suggested moving them into the range_run_storage namespace so they can be reused. A "TimeSeries" is an InfiniteRangeRunStorage with a discretization, and the Time_series library includes a bunch of algorithms that are specific to the time series domain.

...

So I am tempted to suggest Data_series or even Boost.Series?

But a rose by any other name would smell as sweet ;-)

(Can I claim a brownie point for the first Shakespeare quote in a Boost review? ;-)

+1 brownie point ;-)

...

However there is plenty of precedent as you give http://en.wikipedia.org/wiki/Time_series and time is the most common 'variable'.

2 How would one signify missing data??? Use NaN? as zero?

NaN works if your data is floating point, some magic number otherwise. You could even use optional<T> as your value type if you were so inclined, so long as you define the appropriate operations on that type.

...

3 The series types are rather tersely defined. The learning curve would be flattened by more explanation and examples at the point of definition, aiming for lucidity as well as accuracy.

Agreed.

...

4 The delta_series should at least mention Dirac! Why does Heaviside get his name on a series and Dirac not? Wikipedia link would be good.

A link to wikipedia would be good. I wouldn't want Dirac to get jealous of Heaviside.

...

3 Despite the warning, I am sure many will make the mistake of not committing. A stronger admonishment might be useful?

"You must .commit() the returned inserter when you are done with it."

I could do that, but I've been getting lots of negative feedback about .commit(), and I think it might make sense to provide a higher-level wrapper. I could imagine something like: sparse_series<int> s; assign_ordered(s) (1, 2) (2, 4) (3, 6) (4, 8); Here, assign_ordered() returns a type that is movable and automatically .commit()'s in its destructor. I might also rip out the inserter-as-an-output-iterator duality, as it may end up leading people astray.

...

4 I'd like to see examples with each definition, and preferably not all from finance, lest people imagin that it is only suitable for finance when it is actually at least as useful for all things physical and even sociological.

5 It may astonish Boosters to know that there are very many potential users who will not understand "This operation is amortized O(1)". Links like [@http://en.wikipedia.org/wiki/Amortized_analysis amortized 0(1)] would help them.

Good point.

...

6 A proof of application using circular buffer would be most valuable. There are very many people with an endless torrent of data hitting them, without an infinite supply of memory. Showing how to get information from the data flood could sell time_series to them.

7 A single example, in a single finance arena (nice and commented) is really not enough.

8 The docs really needed a much more gentle tutorial introduction, so that users can see the wood for the metaprocessing-trees.

Right.

...

Typos - I only spotted one in Zeros and Sparse Data

accomlish should be accomplish.

Thanks for your feedback. -- Eric Niebler Boost Consulting www.boost-consulting.com The Astoria Seminar ==> http://www.astoriaseminar.com

6569

Age (days ago)

6579

Last active (days ago)

List overview

Download

19 comments

7 participants

participants (7)

Eric Niebler
John Phillips
Matthias Troyer
Paul A Bristow
Steven Watanabe
Stjepan Rajko
Zach Laine