[interval container (itl)] A pretty large bitset

Dear list! In thread http://lists.boost.org/Archives/boost/2009/06/152786.php Zachary Turner posted an article about a compressed bitset, that uses run-length defined intervals to achieve a compact representation of large bitsets that are useful for managing huge file allocation tables. As I mentioned in a reply to this article, Zach's compressed bitsets, in their basic ideas, are quite similar to itl::interval_sets: They allow for compression of large quantities of elements as long as they occur in (large) uninterrupted chunks. As Zach pointed out, this nice behavior is completely spoiled, if we would want to handle a set like {1,3,5,7,9,...,2(k-1)} or as bitset '0101010101...01' Here, the capability of usual bitsets to compress elements that are distributed over only small ranges can neither be applied to Zach's compressed bitset nor to an itl::interval_set of the Interval Template Library. As a new use case of interval containers I have developed a *large_bitset* class template on top of itl::interval_map that is able to combine interval compression *and* bitset compression. It is, for instance, capable of representing the bitset mentioned above '01010101...010' a b a: bit 0 b: bit 2^64-1 containing 2^63 bits (or elements) with just *one* internal interval associated to only *one* value of 64bit length ;) Curious?? Find the documented example code in my updated library docs at http://www.herold-faulhaber.de/boost_itl/doc/libs/itl/doc/html/boost_itl/pro... While the documented class template 'large_bitset', for reasons of a short presentation, is only a minimal one, you can find a more elaborated version as 'interval_bitset' in the sandbox at https://svn.boost.org/svn/boost/sandbox/itl/boost/itl_xt/interval_bitset.hpp I'd like to ask Zachary Turner to comment on the use case. Can this be useful in your field? Moreover I'd like to encourage everyone to give feedback on my interval container library, report problems, bugs and share suggestions. If you find this stuff useful please volunteer as review manager for the submitted library Boost Interval Containers. Note, that the new example on 'large_bitsets' is not yet included in the library submitted for review in the vault but it is contained in the sandbox. I am currently preparing an update of the review version of the itl, that is adjusted and corrected for the feedback, suggestions and patches that I have received during the last weeks. I will integrate further feedback and account for bug reports if you have any. Best regards Joachim

On Mon, Nov 2, 2009 at 10:25 AM, Joachim Faulhaber <afojgo@googlemail.com> wrote:
Dear list!
In thread
http://lists.boost.org/Archives/boost/2009/06/152786.php
Zachary Turner posted an article about a compressed bitset, that uses run-length defined intervals to achieve a compact representation of large bitsets that are useful for managing huge file allocation tables.
As I mentioned in a reply to this article, Zach's compressed bitsets, in their basic ideas, are quite similar to itl::interval_sets: They allow for compression of large quantities of elements as long as they occur in (large) uninterrupted chunks.
As Zach pointed out, this nice behavior is completely spoiled, if we would want to handle a set like
{1,3,5,7,9,...,2(k-1)} or as bitset '0101010101...01'
Here, the capability of usual bitsets to compress elements that are distributed over only small ranges can neither be applied to Zach's compressed bitset nor to an itl::interval_set of the Interval Template Library.
As a new use case of interval containers I have developed a *large_bitset* class template on top of itl::interval_map that is able to combine interval compression *and* bitset compression. It is, for instance, capable of representing the bitset mentioned above
'01010101...010' a b
a: bit 0 b: bit 2^64-1
containing 2^63 bits (or elements) with just *one* internal interval associated to only *one* value of 64bit length ;)
Curious??
Find the documented example code in my updated library docs at
http://www.herold-faulhaber.de/boost_itl/doc/libs/itl/doc/html/boost_itl/pro...
While the documented class template 'large_bitset', for reasons of a short presentation, is only a minimal one, you can find a more elaborated version as 'interval_bitset' in the sandbox at
https://svn.boost.org/svn/boost/sandbox/itl/boost/itl_xt/interval_bitset.hpp
I'd like to ask Zachary Turner to comment on the use case. Can this be useful in your field?
Moreover I'd like to encourage everyone to give feedback on my interval container library, report problems, bugs and share suggestions. If you find this stuff useful please volunteer as review manager for the submitted library Boost Interval Containers.
Note, that the new example on 'large_bitsets' is not yet included in the library submitted for review in the vault but it is contained in the sandbox. I am currently preparing an update of the review version of the itl, that is adjusted and corrected for the feedback, suggestions and patches that I have received during the last weeks. I will integrate further feedback and account for bug reports if you have any.
Hmm, this looks quite fascinating, I have very random bitset and normal sparse methods do not work, might see if this does.

2009/11/3 OvermindDL1 <overminddl1@gmail.com>:
On Mon, Nov 2, 2009 at 10:25 AM, Joachim Faulhaber <afojgo@googlemail.com> wrote:
Find the documented example code in my updated library docs at
http://www.herold-faulhaber.de/boost_itl/doc/libs/itl/doc/html/boost_itl/pro...
Hmm, this looks quite fascinating, I have very random bitset and normal sparse methods do not work, might see if this does.
I hope I does :) But I have to point out, that a randomly distributed *and* sparse set is the worst case scenario for the 'large_bitset' as well. (1) If you have long runs of bits set to '1' (or repeating bit patterns), interval compression kicks in. (2) If you have regions with randomly but relatively densely distributed bits, bit compression will help. For a randomly distributed and sparse set non of these compression techniques can be applied. In the worst case you are holding one tree node of the underlying rb-tree for every bit. Joachim

2009/11/2 Joachim Faulhaber <afojgo@googlemail.com>:
As a new use case of interval containers I have developed a *large_bitset* class template on top of itl::interval_map that is able to combine interval compression *and* bitset compression. It is, for instance, capable of representing the bitset mentioned above
'01010101...010' a b
a: bit 0 b: bit 2^64-1
containing 2^63 bits (or elements) with just *one* internal interval associated to only *one* value of 64bit length ;)
Curious??
Quite. How does it handle a repeat of 128 1s then 128 0s then 128 1s etc? (Or any repeating signal with a period longer than one storage unit?) ~ Scott

2009/11/3 Scott McMurray <me22.ca+boost@gmail.com>:
2009/11/2 Joachim Faulhaber <afojgo@googlemail.com>:
As a new use case of interval containers I have developed a *large_bitset* class template on top of itl::interval_map that is able to combine interval compression *and* bitset compression. It is, for instance, capable of representing the bitset mentioned above
'01010101...010' a b
a: bit 0 b: bit 2^64-1
containing 2^63 bits (or elements) with just *one* internal interval associated to only *one* value of 64bit length ;)
Curious??
Quite.
The compact storage of repeated bit patterns is only the lead story to baffle the reader a bit ;) In the end I do not think that this is the most important property of large_bitset. In fact it is just a byproduct of the basic idea to combine interval and bitset compression via an itl::interval_map of bitsets.
How does it handle a repeat of 128 1s then 128 0s then 128 1s etc? (Or any repeating signal with a period longer than one storage unit?)
So a large_bitset<uint64_t, mini::bits<uint64_t> > would handle them this way: [0,2)->"111...1" (all 64 bits set) [2,4)->"111...1" ... [k,k+2)->"111...1" which is obviously much less compact than the initial example but still a good compression. But, to be honest, the "periodic bit compression magic" has more limitations than that. For instance it does not manage to represent a bit pattern "100100100...100" with a single node like this can be done for patterns with an even period length <= 64 (or word_length). As I said, it's just a nice byproduct of the design. More important IMO is the combination of interval and bit compression as such and also the fact that large_bitset can be implemented with *very little* code http://www.herold-faulhaber.de/boost_itl/doc/libs/itl/doc/html/boost_itl/pro... on top of itl::interval_map, which demonstrates the potential of interval containers. Cheers Joachim

2009/11/3 Joachim Faulhaber <afojgo@googlemail.com>:
2009/11/3 Scott McMurray <me22.ca+boost@gmail.com>:
2009/11/2 Joachim Faulhaber <afojgo@googlemail.com>:
How does it handle a repeat of 128 1s then 128 0s then 128 1s etc? (Or any repeating signal with a period longer than one storage unit?)
So a large_bitset<uint64_t, mini::bits<uint64_t> > would handle them this way:
[0,2)->"111...1" (all 64 bits set) [2,4)->"111...1" ... [k,k+2)->"111...1"
which is obviously much less compact than the initial example but still a good compression.
Correction: [0,2)->"111...1" (all 128 bits set from bit 0 to bit 127) (none of 128 bits set from bit 128 to bit 255) [4,6)->"111...1" (all 128 bits set from bit 256 to bit 383) and so on ... [4*k, 4*k+2)->"111...1" -- Joachim

On Mon, Nov 2, 2009 at 11:25 AM, Joachim Faulhaber <afojgo@googlemail.com>wrote:
I'd like to ask Zachary Turner to comment on the use case. Can this be useful in your field?
Moreover I'd like to encourage everyone to give feedback on my interval container library, report problems, bugs and share suggestions. If you find this stuff useful please volunteer as review manager for the submitted library Boost Interval Containers.
Note, that the new example on 'large_bitsets' is not yet included in the library submitted for review in the vault but it is contained in the sandbox. I am currently preparing an update of the review version of the itl, that is adjusted and corrected for the feedback, suggestions and patches that I have received during the last weeks. I will integrate further feedback and account for bug reports if you have any.
This is interesting and at some point I might download the library and play with it. Is there any easy way for me to determine the memory usage of an interval_map / interval_set? It's hard to say whether this will offer an advantage in my use case because being that I'm interested in primarily a) which blocks on a filesystem are allocated, and b) which blocks on a filesystem have changed since the last time I checked, it depends heavily on the usage patterns of the user. The which-blocks-are-allocated bitset would be more likely to *not* benefit from this type of additional compression, or at least not significantly, because it would be very rare on a disk to have a significantly high concentration of alternating bits, or extremely short runs. The which-blocks-have-changed-since-last-time bitset is a little more uncertain. Obviously file access patterns aren't truly random, but let's just say in the worst case scenario of completely random disk access, and hence completely random bits being flipped from 0 to 1, how would this affect the memory usage of this template? I feel like this is kind of a difficult situation to analyze so I'm not looking for an exact O() storage complexity or anything, but just any insight you have to offer. My current solution compresses only intervals and thus also suffers in the case of random access, but I wonder if this solution would allow me to save space. What kind of iteration support is provided? I often need to iterate through each bit that is set to 1, where the value_type of the iterator is the 0-based index of the bit in the bitset. So for example if I have a bitset representing the bit string 0110110101100 I would want, or at least I would want to be able to easily create, an iterator that returns 1, 2, 4, 5, 7, 9, 10. The other main important thing would be the ability to do in-place bitwise |, &, ~. Advancing an iterator also needs to have constant-time complexity for me, and the bitwise operators should probably be no greater than n log(n). O(n) would be ideal on the bitwiswe operators, but I don't think my current implementation supports that due to the need to do interval merging and balancing.

Hi Zach, 2009/11/3 Zachary Turner <divisortheory@gmail.com>:
On Mon, Nov 2, 2009 at 11:25 AM, Joachim Faulhaber <afojgo@googlemail.com>wrote:
I'd like to ask Zachary Turner to comment on the use case. Can this be useful in your field?
This is interesting and at some point I might download the library and play with it. Is there any easy way for me to determine the memory usage of an interval_map / interval_set?
interval_map / interval_set are implemented via std::map / std::set. So for every node carrying an interval / interval-value-pair you have three pointers (left,right,parent) and a color bit. The interval itself needs 2*sizeof(domain_type) plus one byte for the border information. Finally a map node has to provide space for the value of it's codomain type. The number of interval or interval-value nodes is returned by the function interval_count() or iterative_size() on all interval containers.
It's hard to say whether this will offer an advantage in my use case because being that I'm interested in primarily a) which blocks on a filesystem are allocated, and b) which blocks on a filesystem have changed since the last time I checked, it depends heavily on the usage patterns of the user.
The which-blocks-are-allocated bitset would be more likely to *not* benefit from this type of additional compression, or at least not significantly, because it would be very rare on a disk to have a significantly high concentration of alternating bits,
as I have already mentioned in this thread, the periodic bit compression is a nice byproduct but should not be overemphasized. Clearly we won't find periodic patterns very often in file allocation tables ;)
or extremely short runs.
... hmm, I thought, due to fragmentation there could be more narrow sequences of changing block patterns that could profit from bitcompression ...
The which-blocks-have-changed-since-last-time bitset is a little more uncertain. Obviously file access patterns aren't truly random, but let's just say in the worst case scenario of completely random disk access, and hence completely random bits being flipped from 0 to 1, how would this affect the memory usage of this template? I feel like this is kind of a difficult situation to analyze so I'm not looking for an exact O() storage complexity or anything, but just any insight you have to offer. My current solution compresses only intervals and thus also suffers in the case of random access, but I wonder if this solution would allow me to save space.
It would allow to save space only, if in addition to interval compression your blocks-have-changed-patterns allow for additional bit compression: Clusters of changed-blocks within not too big ranges, that can be compressed into 'local' bitsets. In the worst case scenario of randomly and sparsely distributed bits 'large_bitset' has no advantage over an itl::interval_set or your compressed_bitset implementation (if I understood it right).
What kind of iteration support is provided? I often need to iterate through each bit that is set to 1, where the value_type of the iterator is the 0-based index of the bit in the bitset. So for example if I have a bitset representing the bit string 0110110101100 I would want, or at least I would want to be able to easily create, an iterator that returns 1, 2, 4, 5, 7, 9, 10.
mini::large_bitset started as a non trivial example on the usefulness of itl::interval_maps. So this implementation http://www.herold-faulhaber.de/boost_itl/doc/libs/itl/doc/html/boost_itl/pro... aims to be very *mini*mal. Also the more elaborated version 'interval_bitset' https://svn.boost.org/svn/boost/sandbox/itl/boost/itl_xt/interval_bitset.hpp is a first shot and not yet very much field tested (therefore not part of the core library submitted for review). With itl::interval_bitset you can currently iterate over intervals only. I am generally reluctant to implement element iterators for interval containers, because this would be IMO a permanent invitation for inefficient usage. Apart from that, it wouldn't be difficult to provide such an iterator for itl::interval_bitset.
The other main important thing would be the ability to do in-place bitwise |, &, ~.
There are, even in mini::large_bitset, three different overloads for bitwise operators (except for ~) that are performing inplace, if the inserted range is already in the container. As a useful addition I think about adding operators o= to combine whole (unit-) bitsets at a given offset interval_bitset& interval_bitset::operator o= (const std::pair<domain_type, bitset_type>& bits_at);
Advancing an iterator also needs to have constant-time complexity for me, ...
that would be no problem (apart from the reluctance problem ;)
... and the bitwise operators should probably be no greater than n log(n).
currently O(log(n)) for bits (elements) O(log(n)) for unit bitsets O(n) for intervals of elements (interval defined bitset) O(m log(n+m)) for interval_bitsets of iterative_size n and m Thank you for all your comments and suggestions! Cheers Joachim
participants (4)
-
Joachim Faulhaber
-
OvermindDL1
-
Scott McMurray
-
Zachary Turner