[BTREE] Get BTREE library into boost - Market Research

My organization is investigating contracting for enhancements to the boost::btree library developed by Beman Dawes with all enhancements to remain under the boost license and author(s) retaining all copyright. No travel is expected to be required as part of this project. We are considering the following enhancements: 1. Develop detailed developer and tutorial documentation in customary boost formats 2. Enhancing the interface to provide more explicit preload and caching control to include the amount of RAM to reserve, perhaps with the ability to specify posix alignment to maximize benefit of linux anonymous huge page support. 3. Make the library C++ 11 ready 4. Finalize data layout and support for variable length payloads 5. Provide convenient interface for storing std::string (or the contents thereof) in payloads 6. Prepare the library for submission for formal review Any comments and suggestions on this are greatly appreciated. Of particular interest are discussions of technical approaches and estimates of hours required. Please do not send formal proposals. In addition to replying on this list (preferred) I can be reached at the telephone numbers and email below. Comments sent by 21 December will be most valuable. Discussion after that is still very welcome! Please don’t include any proprietary data. Also note that I am not authorized to commit the government in any way--this is simply market research. Thanks! Joel Dr. JOEL D. YOUNG Assistant Professor Computer Science Department Naval Postgraduate School Glasgow East Rm 338 jdyoung@nps.edu (831) 656-3518

2012/12/20 Joel Young <jdy@cryregarder.com>:
My organization is investigating contracting for enhancements to the boost::btree library developed by Beman Dawes with all enhancements to remain under the boost license and author(s) retaining all copyright. No travel is expected to be required as part of this project.
Great, I`d love to see it!
Any comments and suggestions on this are greatly appreciated. Of particular interest are discussions of technical approaches and estimates of hours required. Please do not send formal proposals. In addition to replying on this list (preferred) I can be reached at the telephone numbers and email below.
I did not look through Boost.BTREE implementation, but I worked with some proprietary btree implementations and following features are quite useful or provide good performance: 1) CRC for page (extremely helpful for data corruption detection and for debug) 2) Data compression for pages (less file sizes, less memory usage, ordered data can be compressed very good) 3) Ability for user to provide custom read/write mutexes (fake mutex, interprocess mutex, std::mutex) 4) Ability for user to provide stateful allocators (might be usefull for those, who use interprocess) 5) Ability for user to provide comparators 6) Search methods can be speed up, if a `position hint` iterator is provided (just like in ` iterator std::set::insert (iterator position, const value_type& val);`)
Thanks!
Joel
Dr. JOEL D. YOUNG Assistant Professor Computer Science Department Naval Postgraduate School Glasgow East Rm 338 jdyoung@nps.edu (831) 656-3518
-- Best regards, Antony Polukhin

On Thu, Dec 20, 2012 at 6:30 AM, Antony Polukhin <antoshkka@gmail.com> wrote: ...
I did not look through Boost.BTREE implementation, but I worked with some proprietary btree implementations and following features are quite useful or provide good performance:
1) CRC for page (extremely helpful for data corruption detection and for debug)
That would be a useful option. The tricky part is how to specify it so that there is zero cost if you don't want invoke the option. The implementation already allows recovery of non-corrupt pages in the event of a failure.
2) Data compression for pages (less file sizes, less memory usage, ordered data can be compressed very good)
I'm strongly against that. The full rationale for not doing compression is lengthy research paper, but the bottom line is what Rudolf Bayer said so many years ago with regard to prefix and suffix key compression - the increased complexity and reduced reliability makes compression very unattractive. The problems of compression are closely related to the problems of variable length data. If the application is willing to tolerate sequential (rather than binary) search at the page level, or is willing to tolerate an indexed organization on pages, or even (gasp!) an additional disk access per comparison, these problems aren't necessarily showstoppers. But if some applications won't tolerate even a dispatch (either a virtual function or a hand-rolled dispatch) to select the approach being employed, then the only choice I can see is to provide essentially multiple sets of classes, and that gets complex and messy.
3) Ability for user to provide custom read/write mutexes (fake mutex, interprocess mutex, std::mutex)
There is a spectrum of needs. I've seen various designs that are optimal for various points in that spectrum. Can you point to any design that is optimal across the spectrum from single thread, single process, single machine, on up through multi-thread, multi-process, multi-machine?
4) Ability for user to provide stateful allocators (might be usefull for those, who use interprocess)
Agreed, and I'm hoping that C++11 allocators will ease that problem. I haven't looked at it yet, however.
5) Ability for user to provide comparators
Yes. That's a requirement for an STL-like approach, so was built in right from the start.
6) Search methods can be speed up, if a `position hint` iterator is provided (just like in ` iterator std::set::insert (iterator position, const value_type& val);`)
The implementation already does a lot of page caching; if there is an outstanding iterator the page is already in memory, as are all pages above it in the Btree. Thus it isn't obvious that a hint would actually have much effect. But it would be interesting to add such a search function and run some timings. Thanks for the suggestions! --Beman

Beman, Antony, thanks for the feedback. Beman Dawes <bdawes <at> acm.org> writes:
On Thu, Dec 20, 2012 at 6:30 AM, Antony Polukhin <antoshkka <at> gmail.com> >
6) Search methods can be speed up, if a `position hint` iterator is provided (just like in ` iterator std::set::insert (iterator position, const value_type& val);`)
The implementation already does a lot of page caching; if there is an outstanding iterator the page is already in memory, as are all pages above it in the Btree. Thus it isn't obvious that a hint would actually have much effect. But it would be interesting to add such a search function and run some timings.
One query mode we often need to do is retrieve the payloads for an ordered list of keys. It would be advantageous if the search could take advantage of knowing that the next answer is guaranteed to be after the previous one (and probably closer to the previous one than to the last record in the btree). Joel

On Thu, Dec 20, 2012 at 8:41 PM, Joel <jdy@cryregarder.com> wrote:
Beman, Antony, thanks for the feedback.
Beman Dawes <bdawes <at> acm.org> writes:
On Thu, Dec 20, 2012 at 6:30 AM, Antony Polukhin <antoshkka <at> gmail.com> >
6) Search methods can be speed up, if a `position hint` iterator is provided (just like in ` iterator std::set::insert (iterator position, const value_type& val);`)
The implementation already does a lot of page caching; if there is an outstanding iterator the page is already in memory, as are all pages above it in the Btree. Thus it isn't obvious that a hint would actually have much effect. But it would be interesting to add such a search function and run some timings.
One query mode we often need to do is retrieve the payloads for an ordered list of keys. It would be advantageous if the search could take advantage of knowing that the next answer is guaranteed to be after the previous one (and probably closer to the previous one than to the last record in the btree).
The cache keeps the most recently used tree in memory, so the basic mechanisms for optimizing such retrievals are already in place. What might be helpful would be to add some profiling or even automatic tuning to take advantage of particular workloads. Automatic, rather than manual, packing might also be worthwhile, particularly for applications with sufficient idle time. --Beman

On 2012-12-20 16:45, Beman Dawes wrote:
2) Data compression for pages (less file sizes, less memory usage, ordered data can be compressed very good)
I'm strongly against that. The full rationale for not doing compression is lengthy research paper, but the bottom line is what Rudolf Bayer said so many years ago with regard to prefix and suffix key compression - the increased complexity and reduced reliability makes compression very unattractive.
The problems of compression are closely related to the problems of variable length data. If the application is willing to tolerate sequential (rather than binary) search at the page level, or is willing to tolerate an indexed organization on pages, or even (gasp!) an additional disk access per comparison, these problems aren't necessarily showstoppers. But if some applications won't tolerate even a dispatch (either a virtual function or a hand-rolled dispatch) to select the approach being employed, then the only choice I can see is to provide essentially multiple sets of classes, and that gets complex and messy.
At what point do you expect serialization to be executed? Do pages need to be kept in a serialized state in memory? Is it wrong to, say, represent a page with a std::map in memory, and serialize to a binary page when writing? In the latter case, compression seems to be less difficult.
3) Ability for user to provide custom read/write mutexes (fake mutex, interprocess mutex, std::mutex)
There is a spectrum of needs. I've seen various designs that are optimal for various points in that spectrum. Can you point to any design that is optimal across the spectrum from single thread, single process, single machine, on up through multi-thread, multi-process, multi-machine?
MVCC for b-trees comes close. See, e.g., http://guide.couchdb.org/draft/btree.html Cheers, Rutger

On Fri, Dec 21, 2012 at 2:26 AM, Rutger ter Borg <rutger@terborg.net> wrote:
On 2012-12-20 16:45, Beman Dawes wrote:
2) Data compression for pages (less file sizes, less memory usage, ordered data can be compressed very good)
I'm strongly against that. The full rationale for not doing compression is lengthy research paper, but the bottom line is what Rudolf Bayer said so many years ago with regard to prefix and suffix key compression - the increased complexity and reduced reliability makes compression very unattractive.
The problems of compression are closely related to the problems of variable length data. If the application is willing to tolerate sequential (rather than binary) search at the page level, or is willing to tolerate an indexed organization on pages, or even (gasp!) an additional disk access per comparison, these problems aren't necessarily showstoppers. But if some applications won't tolerate even a dispatch (either a virtual function or a hand-rolled dispatch) to select the approach being employed, then the only choice I can see is to provide essentially multiple sets of classes, and that gets complex and messy.
At what point do you expect serialization to be executed? Do pages need to be kept in a serialized state in memory? Is it wrong to, say, represent a page with a std::map in memory, and serialize to a binary page when writing? In the latter case, compression seems to be less difficult.
It's been some time since I've taken a look at Beman's btree code, so if I'm wrong about something please correct me. There is no support for variable-length keys at the moment. The code is structured to always have the same number of keys in one page. Compression would throw this off, adding a good bit of complexity. Also, I believe the pages are kept in a blittable format (i.e. you can mmap it and use it right away without any deserializing), so there's a relatively minor complexity of changing that. I do see a use for compression, and I'd definitely like the option. Not necessarily prefix/suffix folding, but a general (ie. zlib) page compression could come in handy on e.g. phones and SD cards, or over a network -- scenarios where I/O is particularly expensive and CPU is plentiful. -- Cory Nelson http://int64.org

On Fri, Dec 21, 2012 at 11:56 AM, Cory Nelson <phrosty@gmail.com> wrote:
On Fri, Dec 21, 2012 at 2:26 AM, Rutger ter Borg <rutger@terborg.net> wrote:
It's been some time since I've taken a look at Beman's btree code, so if I'm wrong about something please correct me.
There is no support for variable-length keys at the moment. The code is structured to always have the same number of keys in one page. Compression would throw this off, adding a good bit of complexity.
Yes. To be a bit more specific, the same *maximum* number of keys on a page. My implementation uses the same page size for both branch and leaf pages, so the maximum number may differ between branch and leaf pages.
Also, I believe the pages are kept in a blittable format (i.e. you can mmap it and use it right away without any deserializing), so there's a relatively minor complexity of changing that.
Yes.
I do see a use for compression, and I'd definitely like the option. Not necessarily prefix/suffix folding, but a general (ie. zlib) page compression could come in handy on e.g. phones and SD cards, or over a network -- scenarios where I/O is particularly expensive and CPU is plentiful.
That's an interesting point! If the cost of data transmission + cost of decompression of a compressed page was less that the cost of data transmission of an uncompressed page, compression would appear to make sense. But if transmission is that slow, an even better approach might be to just send the search key, do the search on the server, and return the found data only. Thanks, --Beman

2012/12/22 Beman Dawes <bdawes@acm.org>:
On Fri, Dec 21, 2012 at 11:56 AM, Cory Nelson <phrosty@gmail.com> wrote:
I do see a use for compression, and I'd definitely like the option. Not necessarily prefix/suffix folding, but a general (ie. zlib) page compression could come in handy on e.g. phones and SD cards, or over a network -- scenarios where I/O is particularly expensive and CPU is plentiful.
That's an interesting point!
If the cost of data transmission + cost of decompression of a compressed page was less that the cost of data transmission of an uncompressed page, compression would appear to make sense. But if transmission is that slow, an even better approach might be to just send the search key, do the search on the server, and return the found data only.
It is also faster to read multiple compressed pages from disk at once and uncompress them, rather than reading multiple uncompressed pages (As I remember guys from UPX measured that) Also, page with ordered data can be compressed extremely effectively, leading up to 20 times less files (remember, that big parts of btree can be filled with zeros because they are empty) 2012/12/20 Beman Dawes <bdawes@acm.org>:
On Thu, Dec 20, 2012 at 6:30 AM, Antony Polukhin <antoshkka@gmail.com> wrote:
6) Search methods can be speed up, if a `position hint` iterator is provided (just like in ` iterator std::set::insert (iterator position, const value_type& val);`)
The implementation already does a lot of page caching; if there is an outstanding iterator the page is already in memory, as are all pages above it in the Btree. Thus it isn't obvious that a hint would actually have much effect. But it would be interesting to add such a search function and run some timings.
For big enough trees there is a chance, that lots of iterators iterate through different leaf nodes, so the pages above are not in memory any more. In that case searches with hint can give some advantage, however it is a rare use case. For cases, when all the pages are already in memory this optimization is doubtful. -- Best regards, Antony Polukhin

On Fri, Dec 21, 2012 at 3:26 AM, Rutger ter Borg <rutger@terborg.net> wrote:
On 2012-12-20 16:45, Beman Dawes wrote:
2) Data compression for pages (less file sizes, less memory usage, ordered data can be compressed very good)
I'm strongly against that. The full rationale for not doing compression is lengthy research paper, but the bottom line is what Rudolf Bayer said so many years ago with regard to prefix and suffix key compression - the increased complexity and reduced reliability makes compression very unattractive.
The problems of compression are closely related to the problems of variable length data. If the application is willing to tolerate sequential (rather than binary) search at the page level, or is willing to tolerate an indexed organization on pages, or even (gasp!) an additional disk access per comparison, these problems aren't necessarily showstoppers. But if some applications won't tolerate even a dispatch (either a virtual function or a hand-rolled dispatch) to select the approach being employed, then the only choice I can see is to provide essentially multiple sets of classes, and that gets complex and messy.
At what point do you expect serialization to be executed?
It is a btree. That's a very specific data structure that is the same in memory and on disk. There is no serialization.
Do pages need to be kept in a serialized state in memory?
It makes no sense to talk about serializing a btree. If you serialized it, it would no longer be a btree. Is it wrong to, say, represent a
page with a std::map in memory, and serialize to a binary page when writing?
That would be very, very, slow because of serialization. Since Bayer invented both the btree and also the red-black tree used to implement maps, presumably he would have combined them if there was any advantage.
In the latter case, compression seems to be less difficult.
One commonly used approach to deal with large volumes of variable length data, such as compressed data, is to put the data in a sequential file that can be accessed randomly, and put the key into a btree that as a payload has only the position (and possibly length) of the actual data in the sequential file. In other words, use the btree as an index to the data rather than a container for the data itself.
3) Ability for user to provide custom read/write mutexes (fake mutex, interprocess mutex, std::mutex)
There is a spectrum of needs. I've seen various designs that are optimal for various points in that spectrum. Can you point to any design that is optimal across the spectrum from single thread, single process, single machine, on up through multi-thread, multi-process, multi-machine?
MVCC for b-trees comes close. See, e.g., http://guide.couchdb.org/draft/btree.html
That looks like the usual design for lock avoidance, but that exacts a performance penalty from apps that don't need locking at all. Also, lock avoidance designs often require writes be atomic, so they don't work in apps that must rely on non-atomic writes, unless some additional mechanism is added. In other words, that's design works well in part of the middle part of the spectrum, but not the whole spectrum. Thanks, --Beman

On 21-12-2012 23:13, Beman Dawes wrote:
It is a btree. That's a very specific data structure that is the same in memory and on disk. There is no serialization.
I always thought a B-tree is an (abstract) data-structure. As such, implementation details, such as what is on disk and what is in memory, are irrelevant. Could you please point to the wording where it says that it has to be identical on disk and memory? Maybe I misunderstood Wikipedia and several other in-depth research papers discussing them.
Do pages need to be kept in a serialized state in memory?
It makes no sense to talk about serializing a btree. If you serialized it, it would no longer be a btree.
I was talking about the serialization of the nodes, or pages, of a b-tree. I am claiming it will still be a B-tree if you do so. Store a bunch of keys and { data, references } in nodes. No matter what the storage scheme of the node is (e.g., by serialization), IMHO it will still be a B-tree.
Is it wrong to, say, represent a
page with a std::map in memory, and serialize to a binary page when writing?
That would be very, very, slow because of serialization. Since Bayer invented both the btree and also the red-black tree used to implement maps, presumably he would have combined them if there was any advantage.
I guess it will be slower for writes, but couldn't it be faster for doing (many) lookups? What operation happens more often?
In the latter case, compression seems to be less difficult.
One commonly used approach to deal with large volumes of variable length data, such as compressed data, is to put the data in a sequential file that can be accessed randomly, and put the key into a btree that as a payload has only the position (and possibly length) of the actual data in the sequential file. In other words, use the btree as an index to the data rather than a container for the data itself.
The problem of variable-length keys still has to be solved, then. Cheers, Rutger

Rutger ter Borg <rutger <at> terborg.net> writes:
I guess it will be slower for writes, but couldn't it be faster for doing (many) lookups? What operation happens more often?
For my primary application, the DB is written once on an arbitrarily beefy machine or set of machines and read many from very small machines (say SSD laptops with 8GB ram). Joel
participants (6)
-
Antony Polukhin
-
Beman Dawes
-
Cory Nelson
-
Joel
-
Joel Young
-
Rutger ter Borg