Possible class submission: Sparse Array

I'd like to canvas opinions on whether it is worth submitting a class I've developed to Boost. It is a policy-based sparse array class. I'm aware there is a sparse_vector class in the uBLAS library, but when I looked at it it didn't seem to offer the degree of flexibility I needed, so I developed my own. Like a lot of things, it grew a little during development, but I've tried to keep the interface as close to std::vector as seemed sensible. To try and make it as flexible as possible, I've implemented it using four policies, as follows: bounds: This determines what the bounds on the sparse_array are, and how values that exceed them are handled. For example, the 'unbounded' policy allows virtually any positive value of the subscript type to be used, 'bounded' only allows values within the (compile-time specified) max value and throws if a subscript exceeding this is given, while 'clamped' makes such value equal to the max value. iterator: This determines how the limits on iterators are defined. So it is possible to specify that iteration should be over the whole range of the sparse_array, or that it should only be from the lowest assigned subscript to the highest assigned subscript. value: This policy is currently used to define the default value for unassigned elements of the sparse_array, and also to determine how default values are handled in cases such as assigning from one sparse_array to another. storage: This allows the user to specify how the underlying storage for the sparse_array is handled. Currently I only have one policy defined (std_map, which unsurprisingly uses a std::map for the storage), but policies to use hash maps or something akin to Loki's AssocVector could be easily defined. I'd like to know if people would be interested in seeing such a class in Boost before I plunge into writing up the documentation (which is currently pretty sparse, and consists primarily of the comments in the code) and finishing off all the test programs prior to submission. -- <<< Eagles may soar, but weasels don't get sucked into jet engines >>> 5:43pm up 39 days 5:16, 19 users, load average: 0.06, 0.05, 0.04 Registered Linux User #232457 | LFS ID 11703

Spencer Collyer wrote:
I'd like to know if people would be interested in seeing such a class in Boost before I plunge into writing up the documentation (which is currently pretty sparse, and consists primarily of the comments in the code) and finishing off all the test programs prior to submission.
In principle, yes. You should also look at Boost.Spirit's character set parser; it has a sparse array implementation optimized, I believe, for a set of ranges. Sebastian Redl

On Wed, 26 Apr 2006 20:34:43 +0200, Sebastian Redl wrote:
... You should also look at Boost.Spirit's character set parser; it has a sparse array implementation optimized, I believe, for a set of ranges.
Sebastian, I must be being dense, because I've had a look through the Spirit code and can't find this sparse array implementation. Can you point me at the file? Thanks. Spencer -- <<< Eagles may soar, but weasels don't get sucked into jet engines >>> 7:45am up 39 days 19:18, 21 users, load average: 0.05, 0.12, 0.26 Registered Linux User #232457 | LFS ID 11703

Sebastian, On Wed, 26 Apr 2006 20:34:43 +0200, Sebastian Redl wrote:
Spencer Collyer wrote:
I'd like to know if people would be interested in seeing such a class in Boost before I plunge into writing up the documentation (which is currently pretty sparse, and consists primarily of the comments in the code) and finishing off all the test programs prior to submission.
In principle, yes. You should also look at Boost.Spirit's character set parser; it has a sparse array implementation optimized, I believe, for a set of ranges.
Thanks for the pointer to the appropriate place. Now I've looked at the range_run class it looks to me like it is not actually a sparse array in the same sense as my class, but is rather a specialized set implementation. Spencer -- <<< Eagles may soar, but weasels don't get sucked into jet engines >>> 2:12pm up 40 days 1:45, 22 users, load average: 0.01, 0.04, 0.14 Registered Linux User #232457 | LFS ID 11703

"Spencer Collyer" wrote:
It is a policy-based sparse array class.
To try and make it as flexible as possible, I've implemented it using four policies, as follows:
storage: This allows the user to specify how the underlying storage for the sparse_array is handled. Currently I only have one policy defined (std_map, which unsurprisingly uses a std::map for the storage), but policies to use hash maps or something akin to Loki's AssocVector could be easily defined.
The storage can get better: - double-array structure provides O(1) retrieving time, requires more-less O(N) space where N = number of nonzero elements (with /no/ overhead due to pointers), deletion and insertions are bounded. It is described in IEEE Transaction of SW Engineering, vol 15, No9, Sep 1989 by Jun-Ichi Aoe. An implementation seems to exist on http://linux.thai.net/~thep/datrie/datrie.html It is also possible to come up with quite a few ad-hoc storages fitting this or that domain and faring better than std::map. BitMagic library http://bmagic.sourceforge.net/ has ability to store sparse bit arrays (and this library would really fit into Boost). /Pavel

On Wed, 26 Apr 2006 22:02:17 +0200, Pavel Vozenilek wrote:
The storage can get better:
- double-array structure provides O(1) retrieving time, requires more-less O(N) space where N = number of nonzero elements (with /no/ overhead due to pointers), deletion and insertions are bounded.
It is described in IEEE Transaction of SW Engineering, vol 15, No9, Sep 1989 by Jun-Ichi Aoe. An implementation seems to exist on http://linux.thai.net/~thep/datrie/datrie.html
It is also possible to come up with quite a few ad-hoc storages fitting this or that domain and faring better than std::map.
BitMagic library http://bmagic.sourceforge.net/ has ability to store sparse bit arrays (and this library would really fit into Boost).
/Pavel
Pavel, Thanks for the pointers to these. Because the storage is specified using a policy there is nothing to stop a user specifying a different storage policy if they have specific needs. I've only developed a policy for std::map because it is available to everyone, and it is quick enough for my needs at present. I'll need to look closer at the double-array structure. Do you think it would be a good idea to develop a Boost implementation of it to submit with my sparse_array class, as another storage policy? I could try and do that if you think it would be a good idea. Mind you, it looks (at a first quick glance) like it could be a useful class on its own, so maybe a separate submission would be appropriate? Thanks for the pointer to the BitMagic stuff. I'll have a look at it later. Spencer -- <<< Eagles may soar, but weasels don't get sucked into jet engines >>> 8:12am up 39 days 19:45, 21 users, load average: 0.00, 0.00, 0.07 Registered Linux User #232457 | LFS ID 11703

On Thu, 27 Apr 2006 08:21:29 +0100, Spencer Collyer wrote:
On Wed, 26 Apr 2006 22:02:17 +0200, Pavel Vozenilek wrote:
The storage can get better:
- double-array structure provides O(1) retrieving time, requires more-less O(N) space where N = number of nonzero elements (with /no/ overhead due to pointers), deletion and insertions are bounded.
It is described in IEEE Transaction of SW Engineering, vol 15, No9, Sep 1989 by Jun-Ichi Aoe. An implementation seems to exist on http://linux.thai.net/~thep/datrie/datrie.html
I'll need to look closer at the double-array structure. Do you think it would be a good idea to develop a Boost implementation of it to submit with my sparse_array class, as another storage policy? I could try and do that if you think it would be a good idea. Mind you, it looks (at a first quick glance) like it could be a useful class on its own, so maybe a separate submission would be appropriate?
Pavel, Now I've had a bit of time to look at the article on double-array you mentioned, I think my comment above about it being useful on its own could prove to be correct. However, developing such a class is more than I want to take on at the moment, especially to make it flexible enough to be able to handle keys that are not strings (the example implementation looks to be string based, whereas I would hope any Boost-ified implementation would be able to handle more generalised keys, possibly using a policy to decompose the keys into values for the trie nodes). Spencer -- <<< Eagles may soar, but weasels don't get sucked into jet engines >>> 4:01pm up 40 days 3:34, 23 users, load average: 0.04, 0.10, 0.17 Registered Linux User #232457 | LFS ID 11703

Spencer Collyer wrote:
I'd like to canvas opinions on whether it is worth submitting a class I've developed to Boost.
It is a policy-based sparse array class. I'm aware there is a sparse_vector class in the uBLAS library, but when I looked at it it didn't seem to offer the degree of flexibility I needed, so I developed my own. Like a lot of things, it grew a little during development, but I've tried to keep the interface as close to std::vector as seemed sensible.
To try and make it as flexible as possible, I've implemented it using four policies, as follows:
bounds: This determines what the bounds on the sparse_array are, and how values that exceed them are handled. For example, the 'unbounded' policy allows virtually any positive value of the subscript type to be used, 'bounded'only allows values within the (compile-time specified) max value and throws if a subscript exceeding this is given, while 'clamped' makes such value equal to the max value.
I'm interested.
iterator: This determines how the limits on iterators are defined. So it is possible to specify that iteration should be over the whole range of the sparse_array, or that it should only be from the lowest assigned subscript to the highest assigned subscript.
what syntax do you use exactly for these different iterators?
value: This policy is currently used to define the default value for unassigned elements of the sparse_array, and also to determine how default values are handled in cases such as assigning from one sparse_array to another.
interesting
storage: This allows the user to specify how the underlying storage for the sparse_array is handled. Currently I only have one policy defined (std_map, which unsurprisingly uses a std::map for the storage), but policies to use hash maps or something akin to Loki's AssocVector could be easily defined.
can you also separate the structure from the values. I.e is it possible to first define the structure and then afterwards assign the values to the corresponding positions? Does the interface also allow me to share the structure-information between multiple sparse vectors?
I'd like to know if people would be interested in seeing such a class in Boost before I plunge into writing up the documentation (which is currently pretty sparse, and consists primarily of the comments in the code) and finishing off all the test programs prior to submission.
What a coincidence, I just started discussing the design of a sparse-vector on the glas mailing list (http://lists.boost.org/mailman/listinfo.cgi/glas). One of the questions that pops up: what should be the return-type of operator[](size_type) ? toon PS: I also cross-posted this to the glas-ml.

On Wed, 26 Apr 2006 22:05:58 +0200, Toon Knapen wrote:
iterator: This determines how the limits on iterators are defined. So it is possible to specify that iteration should be over the whole range of the sparse_array, or that it should only be from the lowest assigned subscript to the highest assigned subscript.
what syntax do you use exactly for these different iterators?
I think you may have misunderstood me (my fault, I probably didn't express myself adequately). The iterators behave exactly the same as you would find on a std::vector, except that the policy defines where the start and end subscripts are obtained from. Thus, one of the polices I've defined (whole_range) will always iterate from the lowest allowed subscript to the highest allowed subscript without caring whether values have been assigned to those elements or not, while another (filled_range) will iterate from the lowest subscript that has had an element assigned (the low-water-mark) to the highest such subscript (the high-water-mark). As you define the iterator policy when you create the sparse_array, these are the limits that all iterators over that sparse_array will have. Note, however, that it is always possible to get the minimum and maximum allowed subscripts and the low- and high-water-marks from a sparse_array, so if you need to process a different range of subscripts you can do so.
storage: This allows the user to specify how the underlying storage for the sparse_array is handled. Currently I only have one policy defined (std_map, which unsurprisingly uses a std::map for the storage), but policies to use hash maps or something akin to Loki's AssocVector could be easily defined.
can you also separate the structure from the values. I.e is it possible to first define the structure and then afterwards assign the values to the corresponding positions? Does the interface also allow me to share the structure-information between multiple sparse vectors?
In principle, your storage policy class can do anything you want it to as long as it supplies the interface the sparse_array is looking for. Bear in mind that the storage policy is inherited by sparse_array, not contained by it.
What a coincidence, I just started discussing the design of a sparse-vector on the glas mailing list (http://lists.boost.org/mailman/listinfo.cgi/glas). One of the questions that pops up: what should be the return-type of operator[](size_type) ?
I have two versions of this function in my sparse_array class: reference operator[](size_type sub); const_reference operator[](size_type sub) const; The one that returns a non-const reference has to handle the case where there is no element existing at the given subscript. Because it can be used in expressions like sa[sub]++; I opted to always create a default-valued element if one did not exist. After reading Meyers' _More Effective C++_ item 30 on proxy classes, I decided not to go down the proxy root as we can't know in advance what our stored values are. The function that returns a const_reference has an easier time if there is no element at the subscript given, as it can just return a const& to the default value, without having to store anything in the sparse_array. Because we are sometimes forced to create default values in the array there is a 'normalise()' function which goes through and removes such values. Note also that assignment and merging to sparse_arrays does not copy default values if the values policy says not to. Spencer -- <<< Eagles may soar, but weasels don't get sucked into jet engines >>> 8:22am up 39 days 19:56, 21 users, load average: 0.07, 0.02, 0.02 Registered Linux User #232457 | LFS ID 11703

Spencer Collyer wrote: [snip]
I have two versions of this function in my sparse_array class:
reference operator[](size_type sub); const_reference operator[](size_type sub) const;
The one that returns a non-const reference has to handle the case where there is no element existing at the given subscript. Because it can be used in expressions like
sa[sub]++;
I opted to always create a default-valued element if one did not exist. After reading Meyers' _More Effective C++_ item 30 on proxy classes, I decided not to go down the proxy root as we can't know in advance what our stored values are.
The function that returns a const_reference has an easier time if there is no element at the subscript given, as it can just return a const& to the default value, without having to store anything in the sparse_array.
Because we are sometimes forced to create default values in the array there is a 'normalise()' function which goes through and removes such values. Note also that assignment and merging to sparse_arrays does not copy default values if the values policy says not to.
but this means also that if I have a sparse-vector and I loop over all indices to print the corresponding values on the screen (doing <code> for( i=0 ; i < size ; ++i ) std::cout << my_spare_vector[i] </code>), I will totally fill up the sparse structure! toon

On Thu, 27 Apr 2006 22:05:44 +0200, Toon Knapen wrote:
Spencer Collyer wrote:
[snip]
I have two versions of this function in my sparse_array class:
reference operator[](size_type sub); const_reference operator[](size_type sub) const;
The one that returns a non-const reference has to handle the case where there is no element existing at the given subscript. Because it can be used in expressions like
sa[sub]++;
I opted to always create a default-valued element if one did not exist. After reading Meyers' _More Effective C++_ item 30 on proxy classes, I decided not to go down the proxy root as we can't know in advance what our stored values are.
The function that returns a const_reference has an easier time if there is no element at the subscript given, as it can just return a const& to the default value, without having to store anything in the sparse_array.
Because we are sometimes forced to create default values in the array there is a 'normalise()' function which goes through and removes such values. Note also that assignment and merging to sparse_arrays does not copy default values if the values policy says not to.
but this means also that if I have a sparse-vector and I loop over all indices to print the corresponding values on the screen (doing <code> for( i=0 ; i < size ; ++i ) std::cout << my_spare_vector[i] </code>), I will totally fill up the sparse structure!
If it is a non-const sparse_array then that is true. If it is a const sparse_array then we don't add entries to it when using operator[](), because you cannot treat such a return value as an lvalue, so we can just return a const_reference to a default value. The problem arises when using the non-const version of operator[]() (which will be used if the sparse_array is non-const). As explained in the Meyers item I reference above, it is possible to distinguish lvalue and rvalue references using proxy classes, so operator[]() could return a proxy. However, as that item further goes on to explain, proxies have limitations in cases such as calling member functions of the proxied type, or even just the increment example I gave above. I would love to hear any ideas that can allow me to change my class so I _don't_ have to always add an element with the default type when using non-const operator[](), but so far I've not come across anything that lets me do it for any general data type whose interface cannot be known at coding time. My class includes a couple of functions to help get around this problem. First, as mentioned above, there is a normalise() function that goes through and deletes any elements that have the default value. There is also an active_subscripts() function that returns to you just the subscripts of the elements that have been set. Spencer -- <<< Eagles may soar, but weasels don't get sucked into jet engines >>> 7:48am up 40 days 19:21, 23 users, load average: 0.00, 0.01, 0.08 Registered Linux User #232457 | LFS ID 11703

Hi,
From: Spencer Collyer
bounds: This determines what the bounds on the sparse_array are, and how values that exceed them are handled. For example, the 'unbounded' policy allows virtually any positive value of the subscript type to be used, 'bounded' only allows values within the (compile-time specified) max value and throws if a subscript exceeding this is given, while 'clamped' makes such value equal to the max value.
I'm working (actually, I'm having a long pause with the work :P) on a Boost library that will actually implement this piece of functionality, i.e. constrained types. In short, it aims to provide a set of flexible, policy-based template classes allowing to create constrained types, like bounded types with compile-type bounds, run-time specified bounds etc. The code is almost ready, but the documentation and tests need work. If you'd be interested in reusing my library in your project, just let me know. Unfortunately I can't tell when the library would be 100% ready, although I try to finish it as soon as possible. Below is an excerpt from the motivation section of the documentation. Best regards, Robert =================== The Boost Constrained Types library contains class templates useful for creating constrained types. The simplest example of a constrained type is hour. The only valid values for an hour within a day are integers from the range [0, 23]. Unfortunately there's no C++ type which exactly matches this set of valid values, so a larger integral type (e.g. int) must be used to represent hours. But this can lead to errors -- for instance, a programmer may accidentally assign an invalid value, say 26, to a variable holding an hour. The Boost Constrained Types library provides a mechanism protecting against such mistakes: bounded_int<int, 0, 23>::type hour; hour = 20; // OK hour = 26; // error -- throws an exception Another feature of the library is the ability to specify what happens in case of assignment of an invalid value. For instance, an exception may be thrown as in the example above, or the value may be adjusted to meet the constraint criterions: wrapping_int<int, 0, 255>::type buffer_index; buffer_index = 257; // OK -- adjusts the value to fit in the range by wrapping it assert(buffer_index == 1); The library doesn't focus only on constrained types that hold a value belonging to a specified range (i.e., bounded types). Virtually any constraint can be imposed by defining an appropriate predicate: // A constraint predicate struct is_odd { bool operator () (int i) const { return (i % 2) != 0; } }; // And the usage is as simple as: constrained<int, is_odd>::type odd_int(1); odd_int = 11; // OK odd_int++; // error -- throws an exception The library is designed to give as much flexibility as possible in defining constraints and behavior in cases when they are crossed, yet to be efficient and allow compilers to apply their optimizations. It uses policies to customize behavior and allows users to supply their custom policy classes. A user has a lot of freedom: a bounded type, like hours or buffer_index above, may have bounds which are defined at compile-time and do not occupy space at all, or bounds that can be adjusted at runtime. Or one compile-time bound and the other adjustable. The behavior in case of crossing the bounds may also differ for each of the bounds -- e.g. an exception may be thrown if the assigned value is too small, but it may be clipped to the upper bound if it is too big. Note that this library doesn't provide a protection against integer operations overflows -- it's simply not its task. For more discussion on this and other issues see the rationale.

On Wed, 26 Apr 2006 22:21:44 +0200, Robert Kawulak wrote:
Hi,
From: Spencer Collyer
bounds: This determines what the bounds on the sparse_array are, and how values that exceed them are handled. For example, the 'unbounded' policy allows virtually any positive value of the subscript type to be used, 'bounded' only allows values within the (compile-time specified) max value and throws if a subscript exceeding this is given, while 'clamped' makes such value equal to the max value.
I'm working (actually, I'm having a long pause with the work :P) on a Boost library that will actually implement this piece of functionality, i.e. constrained types. In short, it aims to provide a set of flexible, policy-based template classes allowing to create constrained types, like bounded types with compile-type bounds, run-time specified bounds etc. The code is almost ready, but the documentation and tests need work.
If you'd be interested in reusing my library in your project, just let me know. Unfortunately I can't tell when the library would be 100% ready, although I try to finish it as soon as possible.
Below is an excerpt from the motivation section of the documentation.
Robert, Yep, I've seen some of the discussions on your Constrained Types library - probably the only thing I miss from my days being forced to program in Pascal many years ago :-) I'd love to see this library make it into Boost. The way my sparse_array class is designed there should in principle be nothing to stop a user using one of your constrained types as a subscript. However, I do have one concern - the way the iterators are handled, I need to be able to specify a value that is one-past the maximum allowed subscript (which is why the 'unbounded' policy doesn't allow to use _all_ positive values - it has to reserve the maximum value for its end() iterator). If I read your suggestion right, you are thinking that we could replace the bounds policy with one of your constrained types? Because of the need to reserve the one-past-the-end value for the end() iterator, I'm not sure that would be feasible. We need to be able to store the one-past-the-end value but not allow the user to specify it as a valid subscript. Does that make sense? I'm beginning to think I need to get something posted on the web so people have something concrete to look at... Spencer -- <<< Eagles may soar, but weasels don't get sucked into jet engines >>> 8:53am up 39 days 20:26, 21 users, load average: 0.15, 0.04, 0.02 Registered Linux User #232457 | LFS ID 11703

Hi,
From: Spencer Collyer
If I read your suggestion right, you are thinking that we could replace the bounds policy with one of your constrained types? Because of the need to reserve the one-past-the-end value for the end() iterator, I'm not sure that would be feasible. We need to be able to store the one-past-the-end value but not allow the user to specify it as a valid subscript.
Does that make sense?
If I understand you correctly, you want the lower bound to be included in the allowed range, but the upper bound to be excluded. If so, then the Constrained Types library allows for such one-side-opened ranges (I hope I express myself precisely enough, I don't know math english that well ;-). However, there may be a problem - the library is designed to work only with random access iterators. Does your library use random access iterators only or any kind of iterator is allowed? Best regards, Robert

Robert, On Thu, 27 Apr 2006 11:18:24 +0200, Robert Kawulak wrote:
If I understand you correctly, you want the lower bound to be included in the allowed range, but the upper bound to be excluded. If so, then the Constrained Types library allows for such one-side-opened ranges (I hope I express myself precisely enough, I don't know math english that well ;-).
The upper bound has to be included as far as the class is concerned, so that it can construct an end() iterator. However, the user cannot specify the upper bound as a subscript.
However, there may be a problem - the library is designed to work only with random access iterators. Does your library use random access iterators only or any kind of iterator is allowed?
Now I'm confused. My sparse_array class allows you to generate iterators for accessing the elements in the sparse_array. I don't have iterators over the subscript values, but over the array elements. I'm confused as to why I would worry about the type of iterators your library is designed to work with. For what it's worth, the sparse_array iterators are random-access. Spencer -- <<< Eagles may soar, but weasels don't get sucked into jet engines >>> 11:21am up 39 days 22:54, 22 users, load average: 0.22, 0.26, 0.45 Registered Linux User #232457 | LFS ID 11703

Hi,
From: Spencer Collyer
However, there may be a problem - the library is designed to work only with random access iterators. Does your library use random access iterators only or any kind of iterator is allowed?
Now I'm confused. My sparse_array class allows you to generate iterators for accessing the elements in the sparse_array. I don't have iterators over the subscript values, but over the array elements. I'm confused as to why I would worry about the type of iterators your library is designed to work with.
Sorry, my misunderstanding ;-) Now I think I see what you mean. My point was that non-random access iterators cannot be used as the underlying type for a bounded type, which is not a case here anyway. BTW, as far as I understand your class allows indexes only of the form [0, max) or [0, max]. How about allowing [min, max] (and all the other opened-range combinations like (min, max])? Maybe this could be achieved easily if the bounded types from my library would be used as the indexer types? Sorry if my question shows my misunderstanding of your class' design again - indeed, I haven't been going deep into it ;-) Best regards, Robert

Robert, On Sun, 30 Apr 2006 01:00:56 +0200, Robert Kawulak wrote:
BTW, as far as I understand your class allows indexes only of the form [0, max) or [0, max]. How about allowing [min, max] (and all the other opened-range combinations like (min, max])? Maybe this could be achieved easily if the bounded types from my library would be used as the indexer types? Sorry if my question shows my misunderstanding of your class' design again - indeed, I haven't been going deep into it ;-)
Actually, I have bounds policies that do indeed allow for [min, max] operations. And of course, the user can always create their own bounds policy to handle the lower and upper limits however they wish. I am working on putting together some webpages that describe the proposed class, although I'm not sure how soon I'll be able to get them uploaded as I am pretty busy most of today and tomorrow, and then I start a new job on Tuesday. Unfortunately, Real Life(tm) is starting to intrude on the things I love to do :-) Spencer -- <<< Eagles may soar, but weasels don't get sucked into jet engines >>> 10:24am up 42 days 21:58, 23 users, load average: 0.08, 0.37, 0.65 Registered Linux User #232457 | LFS ID 11703

Spencer Collyer <spencer@lasermount.plus.com> writes:
I'd like to know if people would be interested in seeing such a class in Boost before I plunge into writing up the documentation (which is currently pretty sparse, and consists primarily of the comments in the code) and finishing off all the test programs prior to submission.
I'm going to be building sparse arrays in the next few days, but using the cursors and property map abstraction: http://www.boost-consulting.com/projects/mtl4/libs/sequence/doc/html/cursors... -- Dave Abrahams Boost Consulting www.boost-consulting.com
participants (6)
-
David Abrahams
-
Pavel Vozenilek
-
Robert Kawulak
-
Sebastian Redl
-
Spencer Collyer
-
Toon Knapen