[ANN] Boost.Time_series: feedback requested

Eric Niebler

11 Dec 2006 11 Dec '06

11:51 p.m.

I'm pleased to announce the availability of a new library for computing with time series (http://en.wikipedia.org/wiki/Time_series). From the documentation:

...

The purpose of the Boost.Time_series library is to provide data structures, numerical operators and algorithms to operate on time series. A time series is a series of data points, sampled at regular intervals. The library provides numerous time series containers, each with different time/space trade-offs, and a hierarchy of concepts which allow the time series to be manipulated generically. The library also provides operators and algorithms which use the generic interfaces to perform calculations on time series and accumulate various statistics about them.

Boost.Time_series does not yet contain all the algorithms one might want in order to perform full time series analysis. However, the key contribution of Boost.Time_series is the framework and the rich hierarchy of concepts with which such algorithms can be written to efficiently and generically process series of data with widely divergent in-memory representations and performance characteristics. Boost.Time_series provides several such series containers, as well as mechanisms for defining additional series types and algorithms that fit within the framework. Some examples of series types that are provided are: dense, sparse, piecewise constant, heaviside and others, as well as adaptors for providing shifted, scaled and clipped series views.

The documentation can be found here: http://boost-sandbox.sf.net/libs/time_series/doc/html/index.html The code is in time_series.zip, which can be downloaded here: http://boost-consulting.com/vault/index.php?directory=Math%20-%20Numerics Any suggestions for improvements are most welcome. -- Eric Niebler Boost Consulting www.boost-consulting.com

Show replies by date

Jeff Garland

12 Dec 12 Dec

2:14 a.m.

Eric Niebler wrote:

...

I'm pleased to announce the availability of a new library for computing with time series (http://en.wikipedia.org/wiki/Time_series). From the documentation:

...
The purpose of the Boost.Time_series library is to provide data structures, numerical operators and algorithms to operate on time series. A time series is a series of data points, sampled at regular intervals. The library provides numerous time series containers, each with different time/space trade-offs, and a hierarchy of concepts which allow the time series to be manipulated generically. The library also provides operators and algorithms which use the generic interfaces to perform calculations on time series and accumulate various statistics about them.

Boost.Time_series does not yet contain all the algorithms one might want in order to perform full time series analysis. However, the key contribution of Boost.Time_series is the framework and the rich hierarchy of concepts with which such algorithms can be written to efficiently and generically process series of data with widely divergent in-memory representations and performance characteristics. Boost.Time_series provides several such series containers, as well as mechanisms for defining additional series types and algorithms that fit within the framework. Some examples of series types that are provided are: dense, sparse, piecewise constant, heaviside and others, as well as adaptors for providing shifted, scaled and clipped series views.

The documentation can be found here: http://boost-sandbox.sf.net/libs/time_series/doc/html/index.html

The code is in time_series.zip, which can be downloaded here: http://boost-consulting.com/vault/index.php?directory=Math%20-%20Numerics

Any suggestions for improvements are most welcome.

Hi Eric - Looks very interesting. Feedback here is based on extremely cursory look at the docs. And I'll say that I'm not a huge domain expert on the mathematics, but I've run into some pretty simple time series in couple of apps over the years. In particular, I'm sorta surprised there isn't a moving average algorithm as that's time series algorithm with a really wide applicability. Classic example -- calculate a 10 day rolling average of the closing price of some stock. Or calculate a 20 minute rolling average of the trading prices or quotes. And, so on...there are all sorts of other applications for rolling averages. So now for the *other* question. Am I correct in seeing that with the time series facade that I might be able to plug in a different offset_type? Of course, what I'm thinking is for the stock price rolling average example above you might want to use gregorian::date or posix_time::ptime to represent point of measurement of the stock price. I wasn't finding the requirements for the offset_type...which is what I think would need to be replaced. I also find the Discretization intervals interesting. Is that a standard way of handling the breakdown for statistical purposes? The thing being that only days and weeks are actually fixed lengths in the *messy world*...months, quarters, etc are all different depending leap years and such. Anyway, nice job as usual :-) Jeff

Eric Niebler

3:33 a.m.

Jeff Garland wrote:

...

Eric Niebler wrote: <snip>

...
Any suggestions for improvements are most welcome.

Hi Eric -

Looks very interesting. Feedback here is based on extremely cursory look at the docs. And I'll say that I'm not a huge domain expert on the mathematics, but I've run into some pretty simple time series in couple of apps over the years.

In particular, I'm sorta surprised there isn't a moving average algorithm as that's time series algorithm with a really wide applicability. Classic example -- calculate a 10 day rolling average of the closing price of some stock. Or calculate a 20 minute rolling average of the trading prices or quotes. And, so on...there are all sorts of other applications for rolling averages.

Yep, that'd be a nice one to have. As I say in the description, the library is light on algorithms at the moment; the key is the infrastructure that makes the algorithms possible.

...

So now for the *other* question. Am I correct in seeing that with the time series facade that I might be able to plug in a different offset_type?

Yes. Of

...

course, what I'm thinking is for the stock price rolling average example above you might want to use gregorian::date or posix_time::ptime to represent point of measurement of the stock price. I wasn't finding the requirements for the offset_type...which is what I think would need to be replaced.

Right, I should document the requirements on the offset_type. The code has been tested with offset_types of int and double. I think the right concept for offset_type is LessThanComparible.

...

I also find the Discretization intervals interesting. Is that a standard way of handling the breakdown for statistical purposes? The thing being that only days and weeks are actually fixed lengths in the *messy world*...months, quarters, etc are all different depending leap years and such.

The Discretization template parameter is mostly there just for type checking your operations. You don't want to be adding a daily series with a yearly one, because that doesn't make sense. The fact that some years are leap years doesn't matter for the purpose of type checking.

...

Anyway, nice job as usual :-)

Thanks! :-) -- Eric Niebler Boost Consulting www.boost-consulting.com

Jeff Garland

6 a.m.

Eric Niebler wrote:

...

Jeff Garland wrote:

...
In particular, I'm sorta surprised there isn't a moving average algorithm as that's time series algorithm with a really wide applicability. Classic example -- calculate a 10 day rolling average of the closing price of some stock. Or calculate a 20 minute rolling average of the trading prices or quotes. And, so on...there are all sorts of other applications for rolling averages.

Yep, that'd be a nice one to have. As I say in the description, the library is light on algorithms at the moment; the key is the infrastructure that makes the algorithms possible.

Yep...just a suggestion for future expansion :)

...

Of

...
course, what I'm thinking is for the stock price rolling average example above you might want to use gregorian::date or posix_time::ptime to represent point of measurement of the stock price. I wasn't finding the requirements for the offset_type...which is what I think would need to be replaced.

Right, I should document the requirements on the offset_type. The code has been tested with offset_types of int and double. I think the right concept for offset_type is LessThanComparible.

Well, here's the thing. In date-time the 'duration_type' and the 'point_in_time_type' are different. For example, with dates you have 'date' as the 'point_in_time' which is conceptually dimensionless and the duration_type 'day'. If you subtract 2 points in time you get a duration. So it's possible there would need to be a different type in the mix to make things work. Anyway, this will be a good experiment at some point. I'd do it myself if I only had more time.... ;-)

...

...
I also find the Discretization intervals interesting. Is that a standard way of handling the breakdown for statistical purposes? The thing being that only days and weeks are actually fixed lengths in the *messy world*...months, quarters, etc are all different depending leap years and such.

The Discretization template parameter is mostly there just for type checking your operations. You don't want to be adding a daily series with a yearly one, because that doesn't make sense. The fact that some years are leap years doesn't matter for the purpose of type checking.

Got it. Jeff

Eric Niebler

5:54 p.m.

Jeff Garland wrote:

...

Eric Niebler wrote:

...
Jeff Garland wrote: Of

...
course, what I'm thinking is for the stock price rolling average example above you might want to use gregorian::date or posix_time::ptime to represent point of measurement of the stock price. I wasn't finding the requirements for the offset_type...which is what I think would need to be replaced.

Right, I should document the requirements on the offset_type. The code has been tested with offset_types of int and double. I think the right concept for offset_type is LessThanComparible.

Well, here's the thing. In date-time the 'duration_type' and the 'point_in_time_type' are different. For example, with dates you have 'date' as the 'point_in_time' which is conceptually dimensionless and the duration_type 'day'. If you subtract 2 points in time you get a duration. So it's possible there would need to be a different type in the mix to make things work. Anyway, this will be a good experiment at some point. I'd do it myself if I only had more time.... ;-)

Ah. So the type of (offset - offset) is not the same as the type of offset. That's not a problem in theory, but certainly the code doesn't handle that situation right now. Thanks for the feedback. -- Eric Niebler Boost Consulting www.boost-consulting.com

Jeff Garland

9:54 p.m.

Eric Niebler wrote:

...

Jeff Garland wrote:

...
...
Jeff Garland wrote: Of

...
course, what I'm thinking is for the stock price rolling average example above you might want to use gregorian::date or posix_time::ptime to represent point of measurement of the stock price. I wasn't finding the requirements for the offset_type...which is what I think would need to be replaced. Right, I should document the requirements on the offset_type. The code has been tested with offset_types of int and double. I think the right concept for offset_type is LessThanComparible. Well, here's the thing. In date-time the 'duration_type' and the 'point_in_time_type' are different. For example, with dates you have 'date' as the 'point_in_time' which is conceptually dimensionless and the duration_type 'day'. If you subtract 2 points in time you get a duration. So it's possible there would need to be a different type in the mix to make

Eric Niebler wrote: things work. Anyway, this will be a good experiment at some point. I'd do it myself if I only had more time.... ;-)

Ah. So the type of (offset - offset) is not the same as the type of offset.

Well, depends on what we mean by an offset. From a concept perspective in date-time it goes something like this: TimePoint - Timepoint --> Duration TimePoint - Duration --> TimePoint Duration - Duration --> Duration concretely realized this means: date - date --> days date - days --> date days - days --> days Same for the time types only at a higher resolution.

...

That's not a problem in theory, but certainly the code doesn't handle that situation right now.

Yep, that's kinda what I was seeing. Jeff

Matt Hurd

4:09 a.m.

On 12/12/06, Jeff Garland <jeff@crystalclearsoftware.com> wrote:

...

Eric Niebler wrote:

...
I'm pleased to announce the availability of a new library for computing with time series (http://en.wikipedia.org/wiki/Time_series). From the documentation:

Looks very nicely thought out. I'm continually re-inventing series containers and would like to stop ;-) I can see how the discretization type could help many type of applications though it doesn't suit most of the styles I'm used to dealing with. This is due to many timeseries having missing points at similar periods of time. For example, your library seems to have been developed with financial daily and above time frames in mind, though it is obviously not limited to this. Holidays in different markets come into play in the sequences even though the discretization would be the same. To solve this I like the concept of "clocks" from intensional programming. Basically, if two series use the same clock then indexed offsets into the sequence make sense, otherwise a matching procedure has to be used of which the most typical is: matched time = most recent time <=reference time some input clock has to be the reference time which is also used for the output. It is not the only way, for example some of the time only what I call correlated matching makes sense, that is the time exists in both (or all if there are more than two) inputs. This way you get the benefit of direct sequencing when clocks are the same and fast lookups when they are not. Fast lookups are based on the discretization. Like looking up a name in the phone book, if it is a V you go near the back. Calculate the density (num points/ period) and make an educated guess as to the location and binary search from there. This scheme mixes microsecond data and annual data quite freely. Algorithms may chew up degrees of freedom and shorten series, but the clocks will remain the same. For example, a simple moving average over 10 days will not be relevant on the first 9 points. You've chewed up 9 points and your output may reflect this. This is just a simple case. Windowing functions can chew up forward and backwards. Some algorithms may have accuracy requirements that may have minimum input requirements. A simple case is determining the number of points you need to get a certain accuracy for an exponential moving average which deals with a weight sum of infinite points. Where this puppy ends up being quite different is that you want times, real times, associated with the series. The obvious thing to do is tuple them, but this messes up passing blocks of data around efficiently to things that only want to deal with sequences and don't care about the time, but sometimes timed tuples make more efficient sense. So you need flexible mappings and alternative types of containers per situation. So, when I look at this lib, it looks as a neat way to capture singly clocked series, but it also appears that perhaps it is meant to handle multiple clocks given it may be handling daily financial data where holidays come into play based on the acknowledgements section crediting Zürcher Kantonalbank. That is, it seems the discretization is a proxy for clock that suits a particular use. I'm not sure if you can see a way to consider a more flexibly "clocked" POV rather than you current discretization scheme. It seems quite different but tantalisingly close to what you have. Thus I'd think of this library more of a series library than a time series library if such a distinction wasn't just made up by me ;-) Regards, Matt.

Eric Niebler

5:49 p.m.

Hi, Matt. Matt Hurd wrote:

...

On 12/12/06, Jeff Garland <jeff@crystalclearsoftware.com> wrote:

...
Eric Niebler wrote:

...
I'm pleased to announce the availability of a new library for computing with time series (http://en.wikipedia.org/wiki/Time_series). From the documentation:

Looks very nicely thought out.

I'm continually re-inventing series containers and would like to stop ;-)

I can see how the discretization type could help many type of applications though it doesn't suit most of the styles I'm used to dealing with.

This is due to many timeseries having missing points at similar periods of time. For example, your library seems to have been developed with financial daily and above time frames in mind, though

Yes.

...

it is obviously not limited to this. Holidays in different markets come into play in the sequences even though the discretization would be the same. To solve this I like the concept of "clocks" from intensional programming. Basically, if two series use the same clock then indexed offsets into the sequence make sense, otherwise a matching procedure has to be used of which the most typical is: matched time = most recent time <=reference time some input clock has to be the reference time which is also used for the output. It is not the only way, for example some of the time only what I call correlated matching makes sense, that is the time exists in both (or all if there are more than two) inputs.

So in this case, a "clock" is a discretization *and* a set of "holiday" offsets for which there is no data?

...

This way you get the benefit of direct sequencing when clocks are the same and fast lookups when they are not. Fast lookups are based on the discretization. Like looking up a name in the phone book, if it is a V you go near the back. Calculate the density (num points/ period) and make an educated guess as to the location and binary search from there. This scheme mixes microsecond data and annual data quite freely.

I'm having a hard time trying to image what an interface that uses clocks instead of discretizations would look like. Could you mock up an example (pseudo-code is fine)? English is a poor substitute for code.

...

Algorithms may chew up degrees of freedom and shorten series, but the clocks will remain the same. For example, a simple moving average over 10 days will not be relevant on the first 9 points. You've chewed up 9 points and your output may reflect this. This is just a simple case. Windowing functions can chew up forward and backwards. Some algorithms may have accuracy requirements that may have minimum input requirements. A simple case is determining the number of points you need to get a certain accuracy for an exponential moving average which deals with a weight sum of infinite points.

Where this puppy ends up being quite different is that you want times, real times, associated with the series. The obvious thing to do is tuple them, but this messes up passing blocks of data around efficiently to things that only want to deal with sequences and don't care about the time, but sometimes timed tuples make more efficient sense. So you need flexible mappings and alternative types of containers per situation.

I think the current framework can handle this situation quite naturally. The offsets need not be integral multiples of the discretization. The offsets can be floating point, which can be used to represent exact times. Or as Jeff Garland suggested, the offset could in theory be a posix ptime (not tested). That way you wouldn't have to pass around a separate vector representing the times, or make your data tuple with times. Apologies if I misunderstood your suggestion.

...

So, when I look at this lib, it looks as a neat way to capture singly clocked series, but it also appears that perhaps it is meant to handle multiple clocks given it may be handling daily financial data where holidays come into play based on the acknowledgements section crediting Zürcher Kantonalbank. That is, it seems the discretization is a proxy for clock that suits a particular use.

I'm not sure if you can see a way to consider a more flexibly "clocked" POV rather than you current discretization scheme. It seems quite different but tantalisingly close to what you have.

I'd be interested in hearing more about clocks, but I don't feel I understand well enough to say whether the current design can accommodate them.

...

Thus I'd think of this library more of a series library than a time series library if such a distinction wasn't just made up by me ;-)

Thanks for your feedback! -- Eric Niebler Boost Consulting www.boost-consulting.com

Matt Hurd

13 Dec 13 Dec

10:31 a.m.

Hi Eric,

...

So in this case, a "clock" is a discretization *and* a set of "holiday" offsets for which there is no data?

Not really. A clock is a sequence of times, a ptime for example. Each value point in the series has a corresponding time point. Sometimes referred to as the "valid start time" as this the time the data becomes the truth from. Logically it is simply a sequence (Vs, Data). If series have the same underlying clock you can just use index operations freely which are fast. If they don't have the same underlying clock you have to match based on time points. Your discretization say for daily data for stock data from 20 different countries would invite trouble as the different clocks, due to different holiday schedules, could not be mixed by index offsets even though they were all daily. Similarly, for stocks at one exchange, data points are missing due to suspensions and other reasons, meaning clocks are often a little different. Without going to clocks you could simply call them a different discretization in your scheme by assigning a different integer and thus ensuring they don't mix. However you often wish to mix such things. The neat thing about using the clock mechanism is that you can freely mix sub second data, annual data and everything in between without much fuss and if the data does derive from the same clock you simply use direct offsets. Last time I did an implementation the clocks were reference counted attributes of the series. Perhaps having ptime offsets is part of the answer. It seems you have something very close to the solution lurking in there.

...

I'm having a hard time trying to image what an interface that uses clocks instead of discretizations would look like. Could you mock up an example (pseudo-code is fine)? English is a poor substitute for code.

Especially my English ;-) I'll look to dig up some old code. I think your code looks very nice for its intended domain. Regards, Matt.

Eric Niebler

14 Dec 14 Dec

12:56 a.m.

Matt Hurd wrote:

...

Hi Eric,

...
So in this case, a "clock" is a discretization *and* a set of "holiday" offsets for which there is no data?

Not really. A clock is a sequence of times, a ptime for example. Each value point in the series has a corresponding time point. Sometimes referred to as the "valid start time" as this the time the data becomes the truth from. Logically it is simply a sequence (Vs, Data).

<snip>

...

Perhaps having ptime offsets is part of the answer. It seems you have something very close to the solution lurking in there.

Yes, I think what is there already can be pressed into service to achieve the same effect. Conceptually, a TimeSeries (as defined by Boost.Time_series) is two sequences: the elements and the runs. The elements are the values of the time series, and the runs are a sequence of (offset, end offset) pairs [*]. And the offsets can be integral or floating point (or ptimes, I guess). So if I'm understanding correctly, all we need is a flexible way to view the elements of one time series given the runs of another. Does that make sense?

...

...
I'm having a hard time trying to image what an interface that uses clocks instead of discretizations would look like. Could you mock up an example (pseudo-code is fine)? English is a poor substitute for code.

Especially my English ;-)

I'll look to dig up some old code.

That would be great.

...

I think your code looks very nice for its intended domain.

Thanks! [*] Obviously, a time series doesn't need to be stored in memory that way, so long as it's able to present that interface. A sparse series need not store end offsets, and a dense series need not store any offsets at all -- the runs sequence can be generated on the fly. -- Eric Niebler Boost Consulting www.boost-consulting.com

Darryl Green

10:36 a.m.

I (tried to) send this yesterday but some prob with my mailer. I see the idea of nulls has been discussed a bit by Matt Hurd and James Jones. Eric Niebler wrote:

...

Hi, Matt.

Matt Hurd wrote:

...
On 12/12/06, Jeff Garland <jeff@crystalclearsoftware.com> wrote:

...
Eric Niebler wrote:

...
I'm pleased to announce the availability of a new library for computing with time series (http://en.wikipedia.org/wiki/Time_series). From the documentation:

Looks very nicely thought out.

I'm continually re-inventing series containers and would like to stop ;-)

I can see how the discretization type could help many type of applications though it doesn't suit most of the styles I'm used to dealing with.

[snip]

...

...
it is obviously not limited to this. Holidays in different markets come into play in the sequences even though the discretization would be the same. To solve this I like the concept of "clocks" from intensional programming. Basically, if two series use the same clock then indexed offsets into the sequence make sense, otherwise a matching procedure has to be used of which the most typical is: matched time = most recent time <=reference time some input clock has to be the reference time which is also used for the output. It is not the only way, for example some of the time only what I call correlated matching makes sense, that is the time exists in both (or all if there are more than two) inputs.

So in this case, a "clock" is a discretization *and* a set of

"holiday" offsets for which there is no data?

...

...
This way you get the benefit of direct sequencing when clocks are the same and fast lookups when they are not. Fast lookups are based on the discretization. Like looking up a name in the phone book, if it is a V you go near the back. Calculate the density (num points/ period) and make an educated guess as to the location and binary search from there. This scheme mixes microsecond data and annual data quite freely.

I'm having a hard time trying to image what an interface that uses

clocks instead of discretizations would look like. Could you mock up an example (pseudo-code is fine)? English is a poor substitute for code. I'm not sure what this application domain calls for in these circumstances, but surely this is non-uniform sampling, which is a fairly different beast algorithmically compared to uniform sampling. I can see that in some cases the missing "holidays" should just be ignored (those instants don't really exist) but in others (especially if considering international markets) I would imagine "the market never sleeps" and events occurring at any time could be significant. Would an efficient time value pair -> uniform sampling conversion (implemented as an adapter) be enough? Or would it need to preserve the discrete samples/time information somehow? In the latter case it would seem that a full non-uniform sampling model would be needed? Would a unique representation for a "null" (missing) sample help?

...

...
Algorithms may chew up degrees of freedom and shorten series, but the clocks will remain the same. For example, a simple moving average over 10 days will not be relevant on the first 9 points. You've chewed up 9 points and your output may reflect this. This is just a simple case. Windowing functions can chew up forward and backwards. Some algorithms may have accuracy requirements that may have minimum input requirements. A simple case is determining the number of points you need to get a certain accuracy for an exponential moving average which deals with a weight sum of infinite points.

I understand this bit.

...

...
Where this puppy ends up being quite different is that you want times, real times, associated with the series. The obvious thing to do is tuple them, but this messes up passing blocks of data around efficiently to things that only want to deal with sequences and don't care about the time, but sometimes timed tuples make more efficient sense. So you need flexible mappings and alternative types of containers per situation.

I'm less clear on this, though it does sound interesting.

...

I think the current framework can handle this situation quite

naturally. The offsets need not be integral multiples of the discretization. The offsets can be floating point, which can be used to represent exact times. Or as Jeff Garland suggested, the offset could in theory be a posix ptime (not tested). That way you wouldn't have to pass around a separate vector representing the times, or make your data tuple with times.

...

It was only when I read this that I took another look at the docs and realized that your library is using a non-uniform sampling model. This is interesting - but could make some algorithms hard to write and/or much less efficient? I have only a little experience with non-uniform sampling, but I guess this comes back to just how broad a domain you are trying to cover with this library. Is it intended to be usable performance is very important, or is it trading off performance for generality of representation? Of course, if your data is really sparse and inherently sampled in a non-uniform way, efficiency may well be increased by non-uniform sampling approaches, not reduced. I'd be interested in knowing just where you think the libraries strengths/applications are vs more general purpose maths libs. Thats not to say I don't think it has any :-) I'm just not sure of of the intended scope/emphasis. I guess the uniform sampling case is well catered for by standard vector and matrix libs, and the non-uniform aspect is the main distinguishing feature? Regards Darryl.

Eric Niebler

6:46 p.m.

Darryl Green wrote:

...

I (tried to) send this yesterday but some prob with my mailer. I see the idea of nulls has been discussed a bit by Matt Hurd and James Jones.

Eric Niebler wrote:

...
I think the current framework can handle this situation quite

naturally. The offsets need not be integral multiples of the discretization. The offsets can be floating point, which can be used to represent exact times. Or as Jeff Garland suggested, the offset could in theory be a posix ptime (not tested). That way you wouldn't have to pass around a separate vector representing the times, or make your data tuple with times.

...
It was only when I read this that I took another look at the docs and realized that your library is using a non-uniform sampling model. This is interesting - but could make some algorithms hard to write and/or much less efficient?

<snip> I think you misunderstood, or I was unclear. The library accommodates both uniform and non-uniform sampling. In another reply, I wrote:

...

Yes, I think what is there already can be pressed into service to achieve the same effect. Conceptually, a TimeSeries (as defined by Boost.Time_series) is two sequences: the elements and the runs. The elements are the values of the time series, and the runs are a sequence of (offset, end offset) pairs [*]. And the offsets can be integral or floating point (or ptimes, I guess). So if I'm understanding correctly, all we need is a flexible way to view the elements of one time series given the runs of another. Does that make sense?

[*] Obviously, a time series doesn't need to be stored in memory that way, so long as it's able to present that interface. A sparse series need not store end offsets, and a dense series need not store any offsets at all -- the runs sequence can be generated on the fly.

Take the case of a dense_series. It has uniform sampling. Its "runs" sequence has the property of being "dense", meaning that each run is of unit length, and the run offsets are monotonically increasing. Obviously, finding a sample with a particular offset in a dense series is O(1), but not in a sparse series. Algorithms can test for density using a trait and select an optimal implementation. The default implementation would handles series with non-uniform sampling. It's all done using compile time traits, so there is zero runtime overhead. HTH, -- Eric Niebler Boost Consulting www.boost-consulting.com

Darryl Green

11:26 p.m.

Eric Niebler wrote:

...

Darryl Green wrote:

...
It was only when I read this that I took another look at the docs and realized that your library is using a non-uniform sampling model. This is interesting - but could make some algorithms hard to write and/or much less efficient?

I think you misunderstood, or I was unclear. The library accommodates both uniform and non-uniform sampling. In another reply, I wrote:

I was unclear - I (mostly) understood this, I should have said "also accommodates non-uniform sampling" not "is using a non-uniform...".

...

Take the case of a dense_series. It has uniform sampling. Its "runs" sequence has the property of being "dense", meaning that each run is of unit length, and the run offsets are monotonically increasing. Obviously, finding a sample with a particular offset in a dense series is O(1), but not in a sparse series. Algorithms can test for density using a trait and select an optimal implementation. The default implementation would handles series with non-uniform sampling.

Makes sense.

...

It's all done using compile time traits, so there is zero runtime overhead.

Yes, thats true. I was imprecise, and the efficiency (of non-uniform vs uniform algs) is going to be dependent on the type of data (sparseness) and the type of processing being done. The ability to use the same lib for both seems like a great feature. I need to review my non-uniform sampling theory (not something I've ever done much of, and I'm pretty rusty on all this stuff anyway) *and* actually try using your library before I comment any further. Thanks Darryl

Janek Kozicki

12 Dec 12 Dec

11:01 a.m.

Eric Niebler said: (by the date of Mon, 11 Dec 2006 15:51:20 -0800)

...

The purpose of the Boost.Time_series library is to provide data structures, numerical operators and algorithms to operate on time series. A time series is a series of data points, sampled at regular intervals.

I rather won't be able to look at this now, but I recently was working with data obtained from experiments, and I faced a particular problem. I'm curious if your library solves it. I had two distincs gauges which produced data over time. One gauge had a sample rate once every 15 seconds, and another had a smaple rate once every 2 seconds. Like this: A) F - force value every 15 seconds B) d - displacement value every 2 seconds. So I had (F,t) and (d,t), where 't' is time. I wanted to make a plot (F,d). I had to write a simple linear interpolation between two points. Then I was able to resample the data to have a value once every 1 second. Then I was able to plot (F,d). Can I use your library to do this? ie. store data in a container using a time interval t1, then read the data using another time interval t2, and the values will be automatically interpolated when reading. -- Janek Kozicki |

Eric Niebler

5:59 p.m.

Hi, Janek. Janek Kozicki wrote:

...

Eric Niebler said: (by the date of Mon, 11 Dec 2006 15:51:20 -0800)

...
The purpose of the Boost.Time_series library is to provide data structures, numerical operators and algorithms to operate on time series. A time series is a series of data points, sampled at regular intervals.

I rather won't be able to look at this now, but I recently was working with data obtained from experiments, and I faced a particular problem. I'm curious if your library solves it.

I had two distincs gauges which produced data over time. One gauge had a sample rate once every 15 seconds, and another had a smaple rate once every 2 seconds. Like this:

A) F - force value every 15 seconds B) d - displacement value every 2 seconds.

So I had (F,t) and (d,t), where 't' is time. I wanted to make a plot (F,d).

I had to write a simple linear interpolation between two points. Then I was able to resample the data to have a value once every 1 second. Then I was able to plot (F,d).

Can I use your library to do this? ie. store data in a container using a time interval t1, then read the data using another time interval t2, and the values will be automatically interpolated when reading.

This is on the ToDo list. There are coarse_grain() and fine_grain() algorithms for converting series from one discretization to another, but they are quite limited at the moment. (One discretization must be an integral multiple of the other, no interpolation is done.) What we need is an interpolating adapter. It would wrap a TimeSeries of one discretization and make it look like it has a different discretization, interpolating on the fly. Probably wouldn't be too hard. -- Eric Niebler Boost Consulting www.boost-consulting.com

Janek Kozicki

8:26 p.m.

Eric Niebler said: (by the date of Tue, 12 Dec 2006 09:59:19 -0800)

...

What we need is an interpolating adapter. It would wrap a TimeSeries of one discretization and make it look like it has a different discretization, interpolating on the fly. Probably wouldn't be too hard.

Apart from lienar interpolation there are several others available. In my research I needed to find "the best" interpolation function. If you want to know what is the criterion of "the best", I can send you a .pdf of my article that will soon appear in some journal. So the best interpolation function in my reasearch was sinc256, written on the very bottom of this webpage: http://www.path.unimelb.edu.au/%7Edersch/interpolator/interpolator.html In firefox this formula is displayed wrongly, the '1/4' symbols should be replaced with PI. That webpage is about image interpolation in 2D. But it's plain simple to use it in 1D. Simply assume y=0. Sinc256 will require 16 data points around the interpolated value (8 on left, 8 on right). Another a bit worse interpolation is spline36 (also on that webpage), it will need 6 points around the value (3 on left, 3 on right). This is for you, just for the future reference, when you will decide that you need some interpolation function for your library :) -- Janek Kozicki |

Eric Niebler

8:53 p.m.

Janek Kozicki wrote:

...

Eric Niebler said: (by the date of Tue, 12 Dec 2006 09:59:19 -0800)

...
What we need is an interpolating adapter. It would wrap a TimeSeries of one discretization and make it look like it has a different discretization, interpolating on the fly. Probably wouldn't be too hard.

Apart from lienar interpolation there are several others available. In my research I needed to find "the best" interpolation function. If you want to know what is the criterion of "the best", I can send you a .pdf of my article that will soon appear in some journal.

<snip> Yes, there are all kinds of ways to do the interpolation. Clearly an interpolation TimeSeries adapter should be parameterized on the interpolation policy, perhaps defaulting to linear. Thanks for the reference! -- Eric Niebler Boost Consulting www.boost-consulting.com

Darryl Green

13 Dec 13 Dec

3:19 p.m.

Eric Niebler <eric <at> boost-consulting.com> writes:

...

Janek Kozicki wrote:

...
Eric Niebler said: (by the date of Tue, 12 Dec 2006 09:59:19 -0800)

...
What we need is an interpolating adapter. It would wrap a TimeSeries of one discretization and make it look like it has a different discretization, interpolating on the fly. Probably wouldn't be too hard.

Apart from lienar interpolation there are several others available.

Hi Eric, I've only taken a quick look at the docs, it looks pretty good so far. But before you worry about specific multirate filtering algorithms (interpolation/decimation) and kernels how about including convolution (as an adapter)? Or did I just miss it in my skim of the docs? Optimization (polyphase filters) can wait? Regards Darryl.

Eric Niebler

14 Dec 14 Dec

12:45 a.m.

Darryl Green wrote:

...

Eric Niebler <eric <at> boost-consulting.com> writes:

...
Janek Kozicki wrote:

...
Eric Niebler said: (by the date of Tue, 12 Dec 2006 09:59:19 -0800)

...
What we need is an interpolating adapter. It would wrap a TimeSeries of one discretization and make it look like it has a different discretization, interpolating on the fly. Probably wouldn't be too hard.

Apart from lienar interpolation there are several others available.

Hi Eric, I've only taken a quick look at the docs, it looks pretty good so far. But before you worry about specific multirate filtering algorithms (interpolation/decimation) and kernels how about including convolution (as an adapter)? Or did I just miss it in my skim of the docs? Optimization (polyphase filters) can wait?

Am I allowed to plead the 5th? :-) I confess I don't know what convolution is. I learned enough to satisfy the library's requirements, as specified by Zürcher Kantonalbank who sponsored the work, and they apparently didn't need that. Anyway, from Wikipedia, I see that convolution is "the integral of the product of the two functions after one is reversed and shifted." Seems straightforward. There already is a shift adapter. I could add a reverse adapter. (Reversed relative to what time t? Should it be a parameter to the adapter?) Series multiplication is already implemented, as is an integrate() function. So maybe it's almost already there. -- Eric Niebler Boost Consulting www.boost-consulting.com

Darryl Green

12:17 p.m.

Eric Niebler wrote:

...

Anyway, from Wikipedia, I see that convolution is "the integral of the product of the two functions after one is reversed and shifted." Seems straightforward. There already is a shift adapter. I could add a reverse adapter.

If you add a reversed adapter then you can implement convolution and correlation (correlation is the same algorithm without the reversal) in terms of the same primitives. To apply an arbitrary filter (such as the sinc filter mentioned by Janek) you simply convolve a time series consisting of the values of the function (truncated) with the original series. When decimating you obviously only need to evaluate the result at the output rate, not for every sample point at the input rate. The sinc filter is a good choice for decimation because it is the (truncated) time series representation of a "brick wall" filter (ie a hard cutoff in the frequency domain). As many filters of interest are symmetrical, you don't actually need the reverse step anyway.

...

(Reversed relative to what time t? Should it be a parameter to the adapter?)

I don't think the origin of the reversal belongs in reverse. For the decimation or smoothing filters you want to have 0 delay so you would construct your filter sequence to be symmetrical about t=0. A way to apply a time offset to an existing sequence would seem like a reasonable thing to have (and be trivial to implement as an adapter I would think).

...

Series multiplication is already implemented, as is an integrate() function. So maybe it's almost already there.

The multiply and integrate can/should be combined and optimized as multiply-accumulate. It seems you have something very close to a good set of primitives already. A few examples of how to combine/use them (especially if those examples actually provided needed functionality like decimation and interpolation) would help a lot I think. Regards Darryl.

Eric Niebler

6:48 p.m.

Darryl Green wrote: <snip lots of good suggestions>

...

It seems you have something very close to a good set of primitives already. A few examples of how to combine/use them (especially if those examples actually provided needed functionality like decimation and interpolation) would help a lot I think.

Thanks for all your observations! They are very valuable. -- Eric Niebler Boost Consulting www.boost-consulting.com

Zach Laine

7:42 p.m.

I use time series in the signal processing domain. I think there are a couple of valid use cases there that may not apply in the financial domain, and so got left out. The Hello, World! example says: "iterators ... traverse both the zeros and the non-zeros, even if the storage is sparse". What if I want to iterate over only the actual sparse elements? This is actually a very common use case for me. I'd like to be able to do both full and sparse iteration. Commonly in signal processing, you have uniformly spaced sample times, but they are a multiple of some non-integral time increment. This seems to be impossible to specify using Boost.Time_series as-is. From my understanding of the docs, it seems that you cannot have uniformly-spaced floating-point offsets without resorting to dense_series<std::pair<TimeType, ValueType> >. Specifically, this note: "Some of the numeric algorithms do not work with series that have floating-point offsets. For instance, partial_sum() assumes integral offsets; in fact, the discrete nature of the algorithm prohibits its use with any series with floating-point offsets." bothers me. What if I have a time interval that is 3.7367ms, but it is completely regular? My alternatives appear to be a time series and an extrinsic timestep value, which I must multiply by the time series index to get back the actual timestamp of a series element, or the pair I mentioned above. Typo: In http://boost-sandbox.sourceforge.net/libs/time_series/doc/html/time_series/u... "container" is misspelled "contaier". These issues aside, I really like the formalisms introduced by the library, and think it will be very useful. Zach Laine

Eric Niebler

8:37 p.m.

Zach Laine wrote:

...

I use time series in the signal processing domain. I think there are a couple of valid use cases there that may not apply in the financial domain, and so got left out.

The Hello, World! example says: "iterators ... traverse both the zeros and the non-zeros, even if the storage is sparse". What if I want to iterate over only the actual sparse elements? This is actually a very common use case for me. I'd like to be able to do both full and sparse iteration.

You can. Only the clip() view provides STL-style iterators that make the storage appear sparse. But if you look deeper, you can see that all the TimeSeries types export cursors (sequence::begin(series) and sequence::end(series)) that traverse only the non-zeros. For instance, given a TimeSeries, range_run_storage::for_each() will call a user-specified predicate, passing the value, offset and end-offset of every non-zero run in the series. Most of the time series algorithms are implemented in terms of this lower-level interface to deal efficiently with sparse data.

...

Commonly in signal processing, you have uniformly spaced sample times, but they are a multiple of some non-integral time increment. This seems to be impossible to specify using Boost.Time_series as-is. From my understanding of the docs, it seems that you cannot have uniformly-spaced floating-point offsets without resorting to dense_series<std::pair<TimeType, ValueType> >. Specifically, this note: "Some of the numeric algorithms do not work with series that have floating-point offsets. For instance, partial_sum() assumes integral offsets; in fact, the discrete nature of the algorithm prohibits its use with any series with floating-point offsets." bothers me. What if I have a time interval that is 3.7367ms, but it is completely regular? My alternatives appear to be a time series and an extrinsic timestep value, which I must multiply by the time series index to get back the actual timestamp of a series element, or the pair I mentioned above.

Not a problem. You can have a series with a floating point discretization and an integral offset type. Consider the following program: #include <iostream> #include <boost/time_series/sparse_series.hpp> #include <boost/time_series/ordered_inserter.hpp> #include <boost/time_series/numeric/numeric.hpp> using namespace boost; using namespace time_series; int main() { sparse_series< double, double > s1(discretization = 3.7367); make_ordered_inserter(s1) (1.1, 1) (1.1, 2) .commit(); sparse_series< double, double > s2(discretization = 3.7367); make_ordered_inserter(s2) (1.1, 2) (1.1, 3) .commit(); sparse_series< double, double > s3 = s1 + s2; std::cout << s3 << std::endl; return 0; } The series type "sparse_series<double, double>" has a value type of double, a discretization of double, and an offset type which defaults to int. The discretization of both series is initialized to 3.7367. When you add the series, the discretizations are compared at runtime (it would assert if they were not the same), and the resulting series has the same discretization value. The integral offsets into such a series represent integral multiples of the floating point discretization.

...

Typo: In http://boost-sandbox.sourceforge.net/libs/time_series/doc/html/time_series/u... "container" is misspelled "contaier".

Well spotted.

...

These issues aside, I really like the formalisms introduced by the library, and think it will be very useful.

Thanks for your feedback. -- Eric Niebler Boost Consulting www.boost-consulting.com

Zach Laine

8:44 p.m.

[snip]

...

...
These issues aside, I really like the formalisms introduced by the library, and think it will be very useful.

Thanks for your feedback.

-- Eric Niebler Boost Consulting www.boost-consulting.com _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

That's good; I'm glad the "issues" were a lack of understanding on my part. I think this library would be a welcome addition to Boost. Zach Laine

Austin Bingham

10:55 p.m.

Like many other posters, I'm excited to see this work being done. Not to be shown up by Zach, I also have a few questions about the library. First, it's not entirely clear to me whether arithmetic operations between series with different non-integer discretizations is allowed. From the documentation: Arithmetic operations between series is only defined when the series have the same discretization. When the discretization is a plain integer, the discretizations are checked for compatibility at runtime. If I'm reading this correctly, arithmetic between unequal non-integer series is undefined but not checked at runtime. If so, why isn't it checked? Along those lines, I can envision a situation where I *would* want to do the kind of arithmetic described above. In some applications, the notions of time and range are effectively interchangeable. If I have two "time series", one discretized on range and one on time but both representing the same physical space, I might reasonably expect to be able to add them together. I realize I'm stretching a bit here, but I want to get your sense of how things like this might be dealt with in this library. Does the clip() function necessarily copy the source data? It's often nice to be able to generate a sliced view of some source data without actually copying it. I assume clip() does copy, so I was wondering if there is a facility for prodcing non-copying views of a time series. Again, I really like the work you've done so far, and I'd love to have a robust, high-feature time-series library. I'll keep looking things over and let you know if I've got any more questions. -- Austin Bingham Signal & Information Sciences Laboratory Applied Research Laboratories, University of Texas at Austin 10000 Burnet Rd., Austin, TX 78758 email: abingham@arlut.utexas.edu cell: (512) 799-2444 office: (512) 835-3832

Eric Niebler

15 Dec 15 Dec

1:01 a.m.

Austin Bingham wrote:

...

Like many other posters, I'm excited to see this work being done. Not to be shown up by Zach, I also have a few questions about the library.

First, it's not entirely clear to me whether arithmetic operations between series with different non-integer discretizations is allowed. From the documentation:

Arithmetic operations between series is only defined when the series have the same discretization. When the discretization is a plain integer, the discretizations are checked for compatibility at runtime.

If I'm reading this correctly, arithmetic between unequal non-integer series is undefined but not checked at runtime. If so, why isn't it checked?

It is checked. That's a doc bug. It should read, "When the discretization is not a compile-time constant, the discretizations are checked for compatibility at runtime."

...

Along those lines, I can envision a situation where I *would* want to do the kind of arithmetic described above. In some applications, the notions of time and range are effectively interchangeable. If I have two "time series", one discretized on range and one on time but both representing the same physical space, I might reasonably expect to be able to add them together. I realize I'm stretching a bit here, but I want to get your sense of how things like this might be dealt with in this library.

I'm afraid I don't follow. What do you mean by a time series "discretized on range"?

...

Does the clip() function necessarily copy the source data? It's often nice to be able to generate a sliced view of some source data without actually copying it. I assume clip() does copy, so I was wondering if there is a facility for prodcing non-copying views of a time series.

clip() does *not* copy the data. It returns a clipped view. If you want to copy the data, you should assign the clipped view to another series.

...

Again, I really like the work you've done so far, and I'd love to have a robust, high-feature time-series library. I'll keep looking things over and let you know if I've got any more questions.

Thanks. Keep the questions coming. -- Eric Niebler Boost Consulting www.boost-consulting.com

Austin Bingham

18 Dec 18 Dec

2:31 p.m.

...

I'm afraid I don't follow. What do you mean by a time series "discretized on range"?

This is a bit of me thinking out loud, so I apologize. In some time-series for physical processes (i.e. radar returns) it is sometimes easier or more natural to think about things in terms of range rather than time, although both are directly correlated and effectively interchangeable. So the notion was that I could choose a discretization value that represented some distance rather than some amount of time. The more I understand your library, however, the more I see how this really would break a lot of things. I think my thinking really just comes from years of using matrix and vector classes which didn't force me to think purely in terms of time.

...

clip() does *not* copy the data. It returns a clipped view. If you want to copy the data, you should assign the clipped view to another series.

OK, I see that in your documentation now. I had based my assumptions on this: http://boost-sandbox.sourceforge.net/libs/time_series/doc/html/boost/time_se... which doesn't mention views or non-copying. I figured those features would be mentioned if they were being used. -- Austin Bingham Signal & Information Sciences Laboratory Applied Research Laboratories, University of Texas at Austin 10000 Burnet Rd., Austin, TX 78758

6807

Age (days ago)

6814

Last active (days ago)

List overview

Download

26 comments

7 participants

participants (7)

Austin Bingham
Darryl Green
Eric Niebler
Janek Kozicki
Jeff Garland
Matt Hurd
Zach Laine