boost::compressed_bitset

I know there's over a zillion bitset implementations, but often times I have found myself in need of a bitset that is efficient (space-wise) at storing millions, billions, or even trillions (no joke) of bits. For this I have developed an in-house class which stores bits in a structure vaguely reminiscent of a "T-Tree" (http://en.wikipedia.org/wiki/T_tree). You can think of it basically as an std::map<boost::uint64_t, boost::uint64_t> where <key, value> represents the *interval* of bits [key, key+value-1]. So it's basically run-length encoded, with logarithmic lookups instead of linear lookups. Obviously the worst case scenario is where bits alternate 10101... throughout the entire bitset, and this structure will be terrible in that case. But where there are frequent runs of 0 and 1 bits, this gives orders of magnitude savings in space. Since a map is sorted on key, it's still a valid binary search tree and one can find the node for the k'th bit by map::lower_bound(k) and quickly check whether the bit is set by testing whether k is in [iter->first , iter->first+iter->second). The most difficult part is insertion / deletion since it involves splitting (or merging) intervals and rebalancing but the complexity is still no worse than that of a normal rebalancing avl tree. I almost exclusively use this structure in a way that interval nodes represent sequences of 1 bits, but it can just as easily be used the other way around, where intervals represent sequences of 0 bits. In fact now that I think about it probably doesn't even matter. Performing bitwise operations on such a structure is very fast. All you need to do is expand / reduce the intervals appropriately. Lookup / assignment is obviously not constant time but of course that is the tradeoff. Forward / reverse iteration is still very fast however as you only need to take the non-constant time hit once on begin(). It is also very fast to iterate only 1 bits or only 0 bits for the same reason, which can be quite slow with normal bitsets. The implementation I currently have is very unsuitable for boost for many reasons, but if there is enough interest I will (slowly) work on adapting this to be more generic / usable by a wider audience, of course with suggestions and input from the community. Zach

I almost exclusively use this structure in a way that interval nodes represent sequences of 1 bits, but it can just as easily be used the other way around, where intervals represent sequences of 0 bits. In fact now that I think about it probably doesn't even matter.
Since it will use less space if the bitset is sparse with occasional 0 or 1 bits, this sounds like a good choice for a nontype template parameter. Obviously the user could just invert the sense of the bits themselves (instead of representing presence in a set, it could represent "lack of absence"), but a template parameter would be more user-friendly. A converting constructor could be provided for converting between the two. --Jeffrey Bosboom

2009/6/10 Jeffrey Bosboom <jbosboom@uci.edu>:
I almost exclusively use this structure in a way that interval nodes represent sequences of 1 bits, but it can just as easily be used the other way around, where intervals represent sequences of 0 bits. In fact now that I think about it probably doesn't even matter.
Since it will use less space if the bitset is sparse with occasional 0 or 1 bits, this sounds like a good choice for a nontype template parameter. Obviously the user could just invert the sense of the bits themselves (instead of representing presence in a set, it could represent "lack of absence"), but a template parameter would be more user-friendly. A converting constructor could be provided for converting between the two.
I'm not sure it's worth the complexity. It's usually not that hard to invert a condition in user code, since renaming the variable from visited to pending, free to allocated, or similar makes it just about as readable. I remember hearing about an interval template library; Is there any possible useful overlap here? Also, at first glace I thought you were talking about what I'd call a sparse_bitset, where the storage is an optimization of a map<offset, bitset<128> >. (Specifically I'm thinking of the LLVM class described here: http://llvm.org/docs/ProgrammersManual.html#dss_sparsebitvector) Obviously the interval method is more efficient for long runs of ones, but the offset one is much better for short runs. Can you elaborate more on your usage? Should boost maybe provide both? And out of curiosity, did you compare the T-tree to other trees? Perhaps there's also an intrusive::t_tree hiding in here =)

Scott McMurray wrote:
2009/6/10 Jeffrey Bosboom <jbosboom@uci.edu>:
I almost exclusively use this structure in a way that interval nodes represent sequences of 1 bits, but it can just as easily be used the other way around, where intervals represent sequences of 0 bits. In fact now that I think about it probably doesn't even matter. Since it will use less space if the bitset is sparse with occasional 0 or 1 bits, this sounds like a good choice for a nontype template parameter. Obviously the user could just invert the sense of the bits themselves (instead of representing presence in a set, it could represent "lack of absence"), but a template parameter would be more user-friendly. A converting constructor could be provided for converting between the two.
I'm not sure it's worth the complexity. It's usually not that hard to invert a condition in user code, since renaming the variable from visited to pending, free to allocated, or similar makes it just about as readable.
You may be right. I was thinking it would help with type-checking if two modules were passing one to one another but didn't agree on the sense of the bitset, but that doesn't seem to be a problem with std::bitset either. --Jeffrey Bosboom

On Thu, Jun 11, 2009 at 1:52 AM, Scott McMurray <me22.ca+boost@gmail.com>wrote:
2009/6/10 Jeffrey Bosboom <jbosboom@uci.edu>:
I almost exclusively use this structure in a way that interval nodes represent sequences of 1 bits, but it can just as easily be used the other way around, where intervals represent sequences of 0 bits. In fact now that I think about it probably doesn't even matter.
Since it will use less space if the bitset is sparse with occasional 0 or 1 bits, this sounds like a good choice for a nontype template parameter. Obviously the user could just invert the sense of the bits themselves (instead of representing presence in a set, it could represent "lack of absence"), but a template parameter would be more user-friendly. A converting constructor could be provided for converting between the two.
I'm not sure it's worth the complexity. It's usually not that hard to invert a condition in user code, since renaming the variable from visited to pending, free to allocated, or similar makes it just about as readable.
I remember hearing about an interval template library; Is there any possible useful overlap here? Also, at first glace I thought you were talking about what I'd call a sparse_bitset, where the storage is an optimization of a map<offset, bitset<128> >. (Specifically I'm thinking of the LLVM class described here: http://llvm.org/docs/ProgrammersManual.html#dss_sparsebitvector) Obviously the interval method is more efficient for long runs of ones, but the offset one is much better for short runs. Can you elaborate more on your usage? Should boost maybe provide both?
And out of curiosity, did you compare the T-tree to other trees? Perhaps there's also an intrusive::t_tree hiding in here =)
To be honest I don't even know a whole lot about T-Trees. I just had come up with this idea for a compressed bitset and then started searching around to see if similar data structures had ever been invented, and then I found a t-tree which seemed to be reminiscent. I will take a look at the sparsebitvector. It does seem quite similar just from how you described as a map<offset, bitset<128> >. The first thing I wonder though (without having read the link yet), is why does the value need to be an entire bitset?. Perhaps they are also trying to make it efficient in the case where the bitset is "almost" sparse, but there are occasional sequences of interleaved 0s and 1s. That's an interesting idea I hadn't thought of. In any case it does seem like there is some potential for overlap so I'll read the link in a little bit. The two usage scenarios I can give you off the top of my head are one that I'm using it for, and one that the Windows Kernel uses the same data structure for. I use it to store a filesystem allocation bitmap. For every block on the filesystem (where the block size can be 1KB, 4KB, etc depending on user settings), there is 1 bit to say whether the block is allocated or unallocated. If you work out the math, for a 1TB drive you can end up with a bitset taking up 33MB of data if you use an uncompressed bitset. But on the other hand, massive storage capacity can often be reminiscent of a RAID array, which is likely to be defragmented. And of course most people in general tend to do some sort of defragmentation pretty often. Although even if that were not the case, typical filesystem allocation patterns even for heavily fragmented drives tend to have lots of runs of 1s or 0s. The usage in the Windows kernel is similar, although they use it to store what they call the "VAD tree" (virtual address descriptor). For every process running on the system, there is a 4GB or larger address space (depending on architecture). They use this structure to determine which parts of this 4GB address space are allocated and which are not, so they can quickly satisfy allocation requests. It's likely that linux uses something similar but I don't know much about linux kernel.

2009/6/11 Zachary Turner <divisortheory@gmail.com>
The usage in the Windows kernel is similar, although they use it to store what they call the "VAD tree" (virtual address descriptor). For every process running on the system, there is a 4GB or larger address space (depending on architecture). They use this structure to determine which parts of this 4GB address space are allocated and which are not, so they can quickly satisfy allocation requests. It's likely that linux uses something similar but I don't know much about linux kernel.
More well-engineered filesystems have a similar approach called Extents, which is basically a run-length encoded FAT table, split across the file descriptors for quick access. Examples include ReiserFS, NTFS, ZFS and Ext4 (all IIRC). Kind regards, Peter Bindels

Since it seems like there is at least enough interest that it wouldn't be a total waste of time to work on this a little more, does anyone have any thoughts regarding implementation? I currently just store an std::map internally since it makes the implementation considerably simpler, can anyone think of any reasons why a different approach might be preferable?

2009/6/11 Zachary Turner <divisortheory@gmail.com>:
I will take a look at the sparsebitvector. It does seem quite similar just from how you described as a map<offset, bitset<128> >. The first thing I wonder though (without having read the link yet), is why does the value need to be an entire bitset?. Perhaps they are also trying to make it efficient in the case where the bitset is "almost" sparse, but there are occasional sequences of interleaved 0s and 1s. That's an interesting idea I hadn't thought of. In any case it does seem like there is some potential for overlap so I'll read the link in a little bit.
Well, by the time you have a map<intptr_t, T>, the size of a node is at least 3 pointers plus the amount for T, so for alignment reasons you might as well store at least a pointer's worth of bits in it, since its free to do so, and then it's quite tempting to store more, since tripling the number of bits per node only costs 50% extra memory at worst, the first time, and can reduce it in some cases. And on some platforms, with the way new is implemented, using the same amount of storage for bits as for the allocator and node overhead actually lets you store something like 8 pointers worth of bits per node.
The two usage scenarios I can give you off the top of my head are one that I'm using it for, and one that the Windows Kernel uses the same data structure for. I use it to store a filesystem allocation bitmap. For every block on the filesystem (where the block size can be 1KB, 4KB, etc depending on user settings), there is 1 bit to say whether the block is allocated or unallocated. If you work out the math, for a 1TB drive you can end up with a bitset taking up 33MB of data if you use an uncompressed bitset. But on the other hand, massive storage capacity can often be reminiscent of a RAID array, which is likely to be defragmented. And of course most people in general tend to do some sort of defragmentation pretty often. Although even if that were not the case, typical filesystem allocation patterns even for heavily fragmented drives tend to have lots of runs of 1s or 0s.
Ah, I see. The block allocator will already be trying pretty hard to avoid the pathological case for the range version. Smart. LLVM's version uses a linked list, keeping an iterator to an internal element so adjactent requests can be O(1). I wonder whether an intrusive::splay_tree might make an interesting backend for compressed_bitset for similar reasons? Not that this helps in any way, but whole "storing a run of one bits as one element" reminds be of segmented binary numbers, which I first saw from Okasaki's thesis. ~ Scott

Jeffrey Bosboom wrote:
I almost exclusively use this structure in a way that interval nodes represent sequences of 1 bits, but it can just as easily be used the other way around, where intervals represent sequences of 0 bits. In fact now that I think about it probably doesn't even matter.
Since it will use less space if the bitset is sparse with occasional 0 or 1 bits, this sounds like a good choice for a nontype template parameter. Obviously the user could just invert the sense of the bits themselves (instead of representing presence in a set, it could represent "lack of absence"), but a template parameter would be more user-friendly. A converting constructor could be provided for converting between the two.
Surely the number of runs of 0s differs from the number of runs of 1s by at most 1, so inverting the meaning will only gain you a negligible benefit? John Bytheway

Hi Zach, 2009/6/11 Zachary Turner <divisortheory@gmail.com>
I know there's over a zillion bitset implementations, but often times I have found myself in need of a bitset that is efficient (space-wise) at storing millions, billions, or even trillions (no joke) of bits. For this I have developed an in-house class which stores bits in a structure vaguely reminiscent of a "T-Tree" (http://en.wikipedia.org/wiki/T_tree). You can think of it basically as an std::map<boost::uint64_t, boost::uint64_t> where <key, value> represents the *interval* of bits [key, key+value-1]. So it's basically run-length encoded, with logarithmic lookups instead of linear lookups.
I have developed a very similar class coming form a different background. Working on problems of date and time, I came up with a collection of interval set and map class templates, that I have called Interval Template Library (ITL). http://www.herold-faulhaber.de/boost_itl/doc/libs/itl/doc/html/ It is interesting to see that the same useful data types occur and reoccur from different problem domains. Fortunately boost is a good place to collect, discuss and develop those useful data types. The compressed_bitset that you are presenting seems to be similar to the ITL's interval_set<T>. As with your class the idea is to represent large sets in a compact way using intervals that are efficiently organized using a balanced tree. May be, a look at my library can be helpful for your project. enjoy Joachim --- The current sources of the ITL are in the boost sandbox: https://svn.boost.org/svn/boost/sandbox/itl/ The latest release (3.0.0) is in the boost vault: http://sourceforge.net/projects/itl (section containers). Release 2.0.1 is available form sourceforge: http://sourceforge.net/projects/itl Online documentaiton at: http://www.herold-faulhaber.de/boost_itl/doc/libs/itl/doc/html/
participants (6)
-
Jeffrey Bosboom
-
Joachim Faulhaber
-
John Bytheway
-
Peter Bindels
-
Scott McMurray
-
Zachary Turner