Interest in B-tree library for Boost?

A Boost B-tree library would provide disk-based associative containers that scale all the way from really, really, small to really, really, large. B-trees perform well on hardware ranging from ancient floppy disk drives all the way up to humongous disk arrays. They are the technology behind most high-performance disk file systems and databases. If you don't know what a B-tree is, or think that "B-tree" is an abbreviation for "Binary tree", you might want to read the Wikipedia B-Tree article: http://en.wikipedia.org/wiki/B-tree. Knuth and other computer science texts also supply descriptions. See http://mysite.verizon.net/beman/btree/index.html for a more complete description of the proposed library. A preliminary implementation is available. Download http://mysite.verizon.net/beman/btree/btree-preview-1.zip, and unpack into a current Boost distribution. A few existing Jamfiles will be overwritten. The only compilers tested and working so far are VC++ 10 and GCC/MinGW 4.5. You can view the library online at http://github.com/Beman/Boost-Btree There is no detailed documentation yet, but if you know how to use std::map and other standard library associative containers, you already know most of what you need to know. Critical differences are documented in the Library Characteristics section of the index.html page referenced above. Initial timing tests have been very encouraging. The time test compares the btree_map implementation to std::map. As expected, a btree_map is many times slower than a std::map, given a cold operating system file cache and constraining the btree_map's caching to a minimum. But at the other extreme, when the O/S file cache is warm and the btree_map's cache is as large as the file, the btree_map is very similar in speed to std::map.The btree library's code is still very new, so timings and everything else may be subject to lots of revision. Any interest? --Beman

So you plan to code B-Tree, B*Tree, B+tree. Interesting. I give it a look! -- Quiero ser el rayo de sol que cada día te despierta para hacerte respirar y vivir en me. "Favola -Moda".

On Wed, Sep 15, 2010 at 12:15 PM, Giorgio Zoppi <giorgio.zoppi@gmail.com> wrote:
So you plan to code B-Tree, B*Tree, B+tree. ...
I use the term "B-tree" as the generic name, as is the usual practice in the literature. The particular implementation is a B+ tree with leaf pages in a doubly linked list. But I consider that an implementation detail subject to change. Single link lists have concurrency advantages. My C implementation used no linked lists at all to advantage. A pure B-tree has advantages for sets, although perhaps not for multisets. I'm not much interested in B*trees, for the usual reasons. --Beman

On Wed, Sep 15, 2010 at 12:15 PM, Giorgio Zoppi <giorgio.zoppi@gmail.com> wrote:
So you plan to code B-Tree, B*Tree, B+tree. ...
I use the term "B-tree" as the generic name, as is the usual practice in the literature.
The particular implementation is a B+ tree with leaf pages in a doubly linked list. But I consider that an implementation detail subject to change. Single link lists have concurrency advantages. My C implementation used no linked lists at all to advantage. It would be interesting see how in this case could work skip-lists in term of time and concurrency. Best Regards -- Quiero ser el rayo de sol que cada día te despierta
2010/9/15 Beman Dawes <bdawes@acm.org>: para hacerte respirar y vivir en me. "Favola -Moda".

On Wed, Sep 15, 2010 at 7:23 PM, Giorgio Zoppi <giorgio.zoppi@gmail.com> wrote:
It would be interesting see how in this case could work skip-lists in term of time and concurrency. Best Regards
Skip-list are a quite bad "on disk" data structure, since they tend to be sparse. There are canonically two approaches in implementing Skip-lists: - the one with the "next nodes" is implemented as a list, which consumes a reasonable amount of memory but is (very) cache unfriendly (== time consuming) - the one with the "next nodes" is implemented as an array, which is more cache friendly (just the list, not the pointed nodes) but uses too much memory (unless you incrementally increase the size of the array... which is slow) I always had bad performance experiences with Skip-lists and, for this reason, I am biased in my judgment. Cheers, Francesco

On Wed, Sep 15, 2010 at 2:23 PM, Giorgio Zoppi <giorgio.zoppi@gmail.com> wrote:
It would be interesting see how in this case could work skip-lists in term of time and concurrency.
Disk-based skip lists where you keep the top N levels of skips in memory work well. Or at least did 15-20 years ago. It's been a while. Tony

Zitat von Beman Dawes <bdawes@acm.org>:
A Boost B-tree library would provide disk-based associative containers that scale all the way from really, really, small to really, really, large. B-trees perform well on hardware ranging from ancient floppy disk drives all the way up to humongous disk arrays. They are the technology behind most high-performance disk file systems and databases.
If you don't know what a B-tree is, or think that "B-tree" is an abbreviation for "Binary tree", you might want to read the Wikipedia B-Tree article: http://en.wikipedia.org/wiki/B-tree. Knuth and other computer science texts also supply descriptions.
here`s a proposed library by bob walters that deals with large containers on disk: http://stldb.sourceforge.net/ I don`t think it implements a B-trees at this point (but some kind of tree, see trans_map), but aims to provide an infrastructure for any kind of tree or container.

On Wed, Sep 15, 2010 at 12:17 PM, Stefan Strasser <strasser@uni-bremen.de> wrote:
Zitat von Beman Dawes <bdawes@acm.org>: ...
here`s a proposed library by bob walters that deals with large containers on disk:
I don`t think it implements a B-trees at this point (but some kind of tree, see trans_map), but aims to provide an infrastructure for any kind of tree or container.
That library seems to deal with a higher, more database -like, level. Interesting in its own right, and with some overlap in applications, but not really competitive with a b-tree associative container AFAICS. Thanks for the link, --Beman

Zitat von Beman Dawes <bdawes@acm.org>:
On Wed, Sep 15, 2010 at 12:17 PM, Stefan Strasser <strasser@uni-bremen.de> wrote:
Zitat von Beman Dawes <bdawes@acm.org>: ...
here`s a proposed library by bob walters that deals with large containers on disk:
I don`t think it implements a B-trees at this point (but some kind of tree, see trans_map), but aims to provide an infrastructure for any kind of tree or container.
That library seems to deal with a higher, more database -like, level.
I`m not sure if that`s the case. if your goal was just to implement a STL "map" interface with a B-tree data structure instead of a rb-tree I would see the difference, but you stated that your implementation is aimed towards disk-based B-trees. then I don`t see how you don`t end up with some of the things STLdb tries to do, like logging and crash recovery. in case it is the transaction and concurrency stuff of STLdb that reminded you of "database-like" features, I don`t think that makes it an entirely different problem domain. for example, Berkeley DB can be used as a simple associative container on disk, without any transactions (and is even sold as such seperately), but can optionally be used with transactions.
Interesting in its own right, and with some overlap in applications, but not really competitive with a b-tree associative container AFAICS.

On Wed, Sep 15, 2010 at 8:02 PM, Stefan Strasser <strasser@uni-bremen.de> wrote:
Zitat von Beman Dawes <bdawes@acm.org>:
I don`t think it implements a B-trees at this point (but some kind of tree, see trans_map), but aims to provide an infrastructure for any kind of tree or container.
That library seems to deal with a higher, more database -like, level.
I`m not sure if that`s the case. if your goal was just to implement a STL "map" interface with a B-tree data structure instead of a rb-tree I would see the difference, but you stated that your implementation is aimed towards disk-based B-trees.
then I don`t see how you don`t end up with some of the things STLdb tries to do, like logging and crash recovery.
Applications are free to do things like logging or crash recovery if they care too. But uses that have no need for such features shouldn't have to pay for them. That said, the current implementation does mark pages in such a why that undamaged portions of a damaged file can be recovered. That has proven useful in practice, although as systems have gotten more reliable people don't seem to be doing that sort of damage recovery as often as 20 years ago. Providing damaged file recovery tools is on my do list.
in case it is the transaction and concurrency stuff of STLdb that reminded you of "database-like" features, I don`t think that makes it an entirely different problem domain. for example, Berkeley DB can be used as a simple associative container on disk, without any transactions (and is even sold as such seperately), but can optionally be used with transactions.
The first step is having a reliable B-tree implementation. My own experience has been in applications where subsequent interest is more in areas like multidimensional search, but that's just because my background is in applications where fully featured database software was traditionally way too slow. Thanks, --Beman

On Wed, 15 Sep 2010 12:04:08 -0400 Beman Dawes <bdawes@acm.org> wrote:
A Boost B-tree library would provide disk-based associative containers that scale all the way from really, really, small to really, really, large. B-trees perform well on hardware ranging from ancient floppy disk drives all the way up to humongous disk arrays. They are the technology behind most high-performance disk file systems and databases. [...] Any interest?
Definitely. I've written one before, for one project, but I utterly *detest* that code. I did it to learn how databases were put together, but it took me forever to get all the bugs out, so I'm not keen to rewrite it properly now that I know what I'm doing. I've got another project coming up that will demand the same ability. I've been looking into the Tokyo Cabinet database as an alternative to that horrible B-tree code, but I don't think it's portable to non-*NIX operating systems, and the project needs that. A Boost implementation would be very welcome. I'll take a look at your preview code when I have some time, though it might be a couple weeks from now. -- Chad Nelson Oak Circle Software, Inc. * * *

Beman Dawes wrote:
A Boost B-tree library would provide disk-based associative containers that scale all the way from really, really, small to really, really, large. B-trees perform well on hardware ranging from ancient floppy disk drives all the way up to humongous disk arrays. They are the technology behind most high-performance disk file systems and databases. ... Initial timing tests have been very encouraging. The time test compares the btree_map implementation to std::map. As expected, a btree_map is many times slower than a std::map, given a cold operating system file cache and constraining the btree_map's caching to a minimum. But at the other extreme, when the O/S file cache is warm and the btree_map's cache is as large as the file, the btree_map is very similar in speed to std::map.The btree library's code is still very new, so timings and everything else may be subject to lots of revision.
Was this post prompted by the Judy tree question raised yesterday? Is the code in the sandbox? Now that we have broken free of the 32bit addressable limit for commodity hardware there is less need for self paging data structures than there was a few years ago. Unless there is a way to configure it that gives a performance advantage over std::map we might have to wait until 64 bits (or 48 as the case may be) becomes too confining. People will generally invest in more memory than better code since memory can be mass produced in a factory while smart programmers are not yet completely commoditized. I looked into the details a little and I found it funny that you followed the advice I gave on the geometry list a couple days ago in the discussion of the design of a "generic" tree: "Node could even be declared as a private sub-class of tree to keep people's hands off of it and make it easier to implement by allowing it to share the template parameters of tree." I notice you made your private branch and leaf types classes instead of structs. Since they are tightly coupled to the tree by virtue of being private member classes I'm not sure they need data hiding to protect their members from the tree. I would have made them structs. You could be right, however, that protecting oneself from oneself is worthwhile. Certainly there is rarely anyone else writing bugs and violating design intent in the details of most libraries. Regards, Luke

On Wed, 15 Sep 2010 09:44:42 -0700 "Simonson, Lucanus J" <lucanus.j.simonson@intel.com> wrote:
[...] Now that we have broken free of the 32bit addressable limit for commodity hardware there is less need for self paging data structures than there was a few years ago. Unless there is a way to configure it that gives a performance advantage over std::map we might have to wait until 64 bits (or 48 as the case may be) becomes too confining. [...]
Not necessarily. There's quite a lot of need for an automatically *persistent* data structure like this -- i.e. one that survives between different runs of a program -- right now. Full-blown databases are overkill for many more-modest purposes, especially SQL databases. I realize that this could be done with serialization and such, and that would make sense if you needed to use the data on different architectures or operating systems. But if you just need to store, for example, the unfinished results of a long calculation while the user shuts down the program, and pick them up when the program is resumed, this would be ideal. -- Chad Nelson Oak Circle Software, Inc. * * *

On Wed, Sep 15, 2010 at 2:19 PM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Wed, 15 Sep 2010 09:44:42 -0700 "Simonson, Lucanus J" <lucanus.j.simonson@intel.com> wrote:
[...] Now that we have broken free of the 32bit addressable limit for commodity hardware there is less need for self paging data structures than there was a few years ago. Unless there is a way to configure it that gives a performance advantage over std::map we might have to wait until 64 bits (or 48 as the case may be) becomes too confining. [...]
Not necessarily. There's quite a lot of need for an automatically *persistent* data structure like this -- i.e. one that survives between different runs of a program -- right now. Full-blown databases are overkill for many more-modest purposes, especially SQL databases.
I realize that this could be done with serialization and such, and that would make sense if you needed to use the data on different architectures or operating systems...
The proposed B-tree library supplies traits classes for endian implementation types, so that the data is portable between architectures or operating systems if desired. The needs of the B-tree library were what motivated my work on the proposed Endian library. That's already part of the B-tree code, and is passing the test suite. I've also been hand inspecting dumps of disk data just to be sure. Cheers, --Beman

On Wed, Sep 15, 2010 at 12:44 PM, Simonson, Lucanus J <lucanus.j.simonson@intel.com> wrote:
Beman Dawes wrote:
A Boost B-tree library would provide disk-based associative containers that scale all the way from really, really, small to really, really, large. B-trees perform well on hardware ranging from ancient floppy disk drives all the way up to humongous disk arrays. They are the technology behind most high-performance disk file systems and databases. ... Initial timing tests have been very encouraging. The time test compares the btree_map implementation to std::map. As expected, a btree_map is many times slower than a std::map, given a cold operating system file cache and constraining the btree_map's caching to a minimum. But at the other extreme, when the O/S file cache is warm and the btree_map's cache is as large as the file, the btree_map is very similar in speed to std::map.The btree library's code is still very new, so timings and everything else may be subject to lots of revision.
Was this post prompted by the Judy tree question raised yesterday?
No, I haven't been reading the list recently. I started work on the library in 2000, but became convinced that the C++ language needed changes to really support B-tree's well. That was the motivation for my POD proposals, now part of C++0x. I tried again in 2006, but didn't separate binary file, buffer management, and core B-tree needs sufficiently, so abandoned that work. Then in May of this year, I started again, and have been much happier with the results.
Is the code in the sandbox?
It is available as a .zip file from http://mysite.verizon.net/beman/btree/btree-preview-1.zip, and as a git repository from http://github.com/Beman/Boost-Btree
Now that we have broken free of the 32bit addressable limit for commodity hardware there is less need for self paging data structures than there was a few years ago.
That may be true, but it has little to do with many of the uses of B-trees. Even when you have enough memory to read an entire container into memory, why would you want to do that if you just want to access a tiny subset in any given time interval? Preloading is useful, and an option in the proposed library, when there is a lot of memory available and you may need access a high proportion of the entries. But it shouldn't be a requirement as that limits use to environments where the amount of memory is large in relationship to the needs of the application. One of the nice characteristics of a B-tree is that they work very well in constrained memory situation, but then can take advantage of lots of memory when available.
Unless there is a way to configure it that gives a performance advantage over std::map...
If the data lives on disk, a B-tree will have much better performance than loading an entire std::map into memory, and then performing sparse operations on it.
I looked into the details a little and I found it funny that you followed the advice I gave on the geometry list a couple days ago in the discussion of the design of a "generic" tree:
"Node could even be declared as a private sub-class of tree to keep people's hands off of it and make it easier to implement by allowing it to share the template parameters of tree."
The implementation of nodes is a messy detail. No the sort of low-level stuff users should have access to.
I notice you made your private branch and leaf types classes instead of structs. Since they are tightly coupled to the tree by virtue of being private member classes I'm not sure they need data hiding to protect their members from the tree. I would have made them structs. You could be right, however, that protecting oneself from oneself is worthwhile. Certainly there is rarely anyone else writing bugs and violating design intent in the details of most libraries.
It will get even messier when there is an alternative note representation for variable length data. So hiding the details, even from other parts of the implementation will be a necessity. Cheers, --Beman

On Wed, Sep 15, 2010 at 9:04 AM, Beman Dawes <bdawes@acm.org> wrote:
A Boost B-tree library would provide disk-based associative containers that scale all the way from really, really, small to really, really, large. B-trees perform well on hardware ranging from ancient floppy disk drives all the way up to humongous disk arrays. They are the technology behind most high-performance disk file systems and databases.
...
Any interest?
Absolutely! Some questions: - Is concurrent access okay if it's opened as read-only? - Right now it's using synchronous I/O. Can async support (via asio) be added? How about memory-mapped I/O for small B-trees or large address spaces? - Will this support multiple B-trees in one file? - Will this implement ACID properties? - Can this be adapted for in-memory use as well, with full non-POD support? Thanks! -- Cory Nelson http://int64.org

On Wed, Sep 15, 2010 at 7:26 PM, Cory Nelson <phrosty@gmail.com> wrote:
- Is concurrent access okay if it's opened as read-only?
Yes, as long as the file isn't being written by a different instance.
- Right now it's using synchronous I/O. Can async support (via asio) be added?
I don't know the answer to that. I would need help from someone who understands async I/O far better than I do.
How about memory-mapped I/O for small B-trees or large address spaces?
I've did some preliminary investigation about five years ago, and memory-mapped I/O looked like it might be cost effective. But I'd give it a lower priority than adding support for variable length keys/mapped_types or concurrency.
- Will this support multiple B-trees in one file?
Not currently. That would be relatively easy to support. What is the motivating use case?
- Will this implement ACID properties?
No. The model is a standard library associative container, not a database. ACID properties can be built on top of B-trees, but the B-trees themselves don't have ACID properties, other than a few aspects like atomic writes.
- Can this be adapted for in-memory use as well, with full non-POD support?
No current plans for that. Why wouldn't you just a standard library associative container for that? Cheers, --Beman

On Wed, Sep 15, 2010 at 6:08 PM, Beman Dawes <bdawes@acm.org> wrote:
On Wed, Sep 15, 2010 at 7:26 PM, Cory Nelson <phrosty@gmail.com> wrote:
- Will this support multiple B-trees in one file?
Not currently. That would be relatively easy to support. What is the motivating use case?
Just ease of distribution, really. I'm currently using a SQLite database to distribute a few tables of read-only data that doesn't need SQL, relational, or ACID properties -- it's just the best thing available in terms of ease of distribution and portability. I'd love to remove all the unneeded layers and just use a B-tree directly.
- Will this implement ACID properties?
No. The model is a standard library associative container, not a database. ACID properties can be built on top of B-trees, but the B-trees themselves don't have ACID properties, other than a few aspects like atomic writes.
Alright, that clears up my understanding about the scope of this. Still sounds great!
- Can this be adapted for in-memory use as well, with full non-POD support?
No current plans for that. Why wouldn't you just a standard library associative container for that?
Storing keys together improves cache usage a lot for small keys, and allows the CPU to detect ordered iteration to prefetch the next cache line. It also reduces fragmentation and the maximum number of pages that must be touched to find a result, which in turn reduces TLB misses and page faults. The tests here indicate that this can be very significant with large collections: http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.htm... -- Cory Nelson http://int64.org

On Wed, Sep 15, 2010 at 10:40 PM, Cory Nelson <phrosty@gmail.com> wrote:
On Wed, Sep 15, 2010 at 6:08 PM, Beman Dawes <bdawes@acm.org> wrote:
On Wed, Sep 15, 2010 at 7:26 PM, Cory Nelson <phrosty@gmail.com> wrote:
- Will this support multiple B-trees in one file?
Not currently. That would be relatively easy to support. What is the motivating use case?
Just ease of distribution, really.
I'm currently using a SQLite database to distribute a few tables of read-only data that doesn't need SQL, relational, or ACID properties -- it's just the best thing available in terms of ease of distribution and portability. I'd love to remove all the unneeded layers and just use a B-tree directly.
Sounds quite useful. I can think of several past apps that would have benefited. I've added this to the Wish List: Multiple B-trees in one file. Requested by Cory Nelson. Possible design: Initial btree_map in the file is an index of its children. Key: name, Data: page id of the child's header page. Recursion allowed; the header page pointed could itself be an index header. Current interface might gain additional constructor and open() overload taking a ref to the parent index and the desired name.
- Will this implement ACID properties?
No. The model is a standard library associative container, not a database. ACID properties can be built on top of B-trees, but the B-trees themselves don't have ACID properties, other than a few aspects like atomic writes.
Alright, that clears up my understanding about the scope of this. Still sounds great!
- Can this be adapted for in-memory use as well, with full non-POD support?
No current plans for that. Why wouldn't you just a standard library associative container for that?
Storing keys together improves cache usage a lot for small keys, and allows the CPU to detect ordered iteration to prefetch the next cache line. It also reduces fragmentation and the maximum number of pages that must be touched to find a result, which in turn reduces TLB misses and page faults. The tests here indicate that this can be very significant with large collections:
That's also the point made by the article Thorsten referenced, http://queue.acm.org/detail.cfm?id=1814327, although it is claiming much greater speedups IIRC.
http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.htm...
Interesting. So someone has already done the work for drop in STL associative container replacements. The timings only covered small trees. 16,000 was the number of elements mentioned. I'm testing with up to 100 million elements, and plan to expand that to a few billion elements. Thanks, --Beman

Zitat von Beman Dawes <bdawes@acm.org>:
http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.htm...
Interesting. So someone has already done the work for drop in STL associative container replacements.
The timings only covered small trees. 16,000 was the number of elements mentioned. I'm testing with up to 100 million elements, and plan to expand that to a few billion elements.
here is a benchmark of various disk-based associative containers, including some B+Trees: http://fallabs.com/tokyocabinet/benchmark.pdf (performed by the author of tokyo cabinet)

On 9/15/2010 10:40 PM, Cory Nelson wrote:
On Wed, Sep 15, 2010 at 6:08 PM, Beman Dawes<bdawes@acm.org> wrote:
On Wed, Sep 15, 2010 at 7:26 PM, Cory Nelson<phrosty@gmail.com> wrote:
- Will this support multiple B-trees in one file? Not currently. That would be relatively easy to support. What is the motivating use case? Just ease of distribution, really.
I'm currently using a SQLite database to distribute a few tables of read-only data that doesn't need SQL, relational, or ACID properties -- it's just the best thing available in terms of ease of distribution and portability. I'd love to remove all the unneeded layers and just use a B-tree directly.
That's funny! I would like it from the opposite direction. I use SQLite but would really like to use SQLite++. Maybe this author could hurry up and get this added to boost. Then someone else could build the ACID, relational and SQL projects on top of this one. One small step for man. One leap for the C++ community. C has SQLite and Java has JavaDB. As a devout C++ programmer, I am starting to feel left out.

Zitat von Jarrad Waterloo <jwaterloo@dynamicquest.com>:
That's funny! I would like it from the opposite direction. I use SQLite but would really like to use SQLite++. Maybe this author could hurry up and get this added to boost. Then someone else could build the ACID, relational and SQL projects on top of this one. One small step for man. One leap for the C++ community. C has SQLite and Java has JavaDB. As a devout C++ programmer, I am starting to feel left out.
regarding database libraries and ACID, there is a library in the sandbox we dubbed "Boost.Transact", which is supposed to handle transactions for boost. its design is similar to the XA distributed transaction specification, but I'm only mentioning this because there is no documentation yet. its syntax is obviously very different. it is based on the input/requirements of 3 planned transactional libraries for boost at this point (transactional STL containers, transactional memory, object persistence). https://svn.boost.org/svn/boost/sandbox/transaction/boost/

El 16/09/2010 3:08, Beman Dawes escribió:
- Can this be adapted for in-memory use as well, with full non-POD support?
No current plans for that. Why wouldn't you just a standard library associative container for that?
I think it's about performance/node overhead (less rebalances, you allocate arrays and not individual nodes). But for memory, T-Trees are the way to go. They are used by many in-memory DBs. http://en.wikipedia.org/wiki/T-tree "In computer science a T-tree is a type of binary tree data structure that is used by main-memory databases, such as Datablitz, eXtremeDB, MySQL Cluster, Oracle TimesTen and KairosMobileLite" Best, Ion

Ion Gaztañaga wrote:
El 16/09/2010 3:08, Beman Dawes escribió:
- Can this be adapted for in-memory use as well, with full non-POD support?
No current plans for that. Why wouldn't you just a standard library associative container for that?
I think it's about performance/node overhead (less rebalances, you allocate arrays and not individual nodes). But for memory, T-Trees are the way to go. They are used by many in-memory DBs.
http://en.wikipedia.org/wiki/T-tree
"In computer science a T-tree is a type of binary tree data structure that is used by main-memory databases, such as Datablitz, eXtremeDB, MySQL Cluster, Oracle TimesTen and KairosMobileLite"
In the same article, however, one reads: "Although T-trees seem to be widely used for main-memory databases, recent research indicates that they actually do not perform better than B-trees on modern hardware." Followed by links to research to justify the claim. Then, again, there's the recent thread on Judy arrays to consider, too. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

2010/9/16 Ion Gaztañaga <igaztanaga@gmail.com>:
El 16/09/2010 3:08, Beman Dawes escribió:
- Can this be adapted for in-memory use as well, with full non-POD support?
No current plans for that. Why wouldn't you just a standard library associative container for that?
I think it's about performance/node overhead (less rebalances, you allocate arrays and not individual nodes). But for memory, T-Trees are the way to go. They are used by many in-memory DBs.
http://en.wikipedia.org/wiki/T-tree
"In computer science a T-tree is a type of binary tree data structure that is used by main-memory databases, such as Datablitz, eXtremeDB, MySQL Cluster, Oracle TimesTen and KairosMobileLite"
Correct me if I'm wrong, but it looks like a T-tree is just a plain binary tree where the data is referenced with pointers and not stored in the tree -- so it's designed to be memory-efficient when the key is larger than a pointer and is already stored elsewhere. Useful for databases that want low-overhead (in terms of key size, not tree implementation) indexes into tables. If that's the case, then it is not applicable here. -- Cory Nelson http://int64.org

El 16/09/2010 13:21, Cory Nelson escribió:
Correct me if I'm wrong, but it looks like a T-tree is just a plain binary tree where the data is referenced with pointers and not stored in the tree -- so it's designed to be memory-efficient when the key is larger than a pointer and is already stored elsewhere. Useful for databases that want low-overhead (in terms of key size, not tree implementation) indexes into tables.
If that's the case, then it is not applicable here.
Yes, you are right, but you could just store values and move semantics would make data movement quite fast (after all, std containers dont' store pointers, but values). Node overhead would be better than with std containers and searches faster (I think you need less comparisons). Insertion and erasure would be slower, I guess. Ion

2010/9/16 Ion Gaztañaga <igaztanaga@gmail.com>:
El 16/09/2010 3:08, Beman Dawes escribió:
- Can this be adapted for in-memory use as well, with full non-POD support?
No current plans for that. Why wouldn't you just a standard library associative container for that?
I think it's about performance/node overhead (less rebalances, you allocate arrays and not individual nodes). But for memory, T-Trees are the way to go. They are used by many in-memory DBs.
http://en.wikipedia.org/wiki/T-tree
"In computer science a T-tree is a type of binary tree data structure that is used by main-memory databases, such as Datablitz, eXtremeDB, MySQL Cluster, Oracle TimesTen and KairosMobileLite"
The point is that for memory resident data, it is better to use a data structure known to be optimal for memory resident data. A B-tree doesn't really qualify. For disk resident data, it is better to use a data structure known to be optimal for disk resident data. The B-tree has no serious competitors for general purpose disk resident associative containers. That's why I'm trying to keep the focus of the B-tree library proposal on disk resident cases, and not get sidetracked into memory resident cases. Something else is probably better for most general in memory uses. Thanks, --Beman

Den 15-09-2010 18:04, Beman Dawes skrev:
A Boost B-tree library would provide disk-based associative containers that scale all the way from really, really, small to really, really, large. B-trees perform well on hardware ranging from ancient floppy disk drives all the way up to humongous disk arrays. They are the technology behind most high-performance disk file systems and databases.
Any interest?
Yes! Also, it would be useful to have a B-heap: http://queue.acm.org/detail.cfm?id=1814327 -Thorsten

On Wed, Sep 15, 2010 at 7:39 PM, Thorsten Ottosen <nesotto@cs.aau.dk> wrote:
Den 15-09-2010 18:04, Beman Dawes skrev:
A Boost B-tree library would provide disk-based associative containers that scale all the way from really, really, small to really, really, large. B-trees perform well on hardware ranging from ancient floppy disk drives all the way up to humongous disk arrays. They are the technology behind most high-performance disk file systems and databases.
Any interest?
Yes!
Also, it would be useful to have a B-heap:
<grin> Yeah, I saw that article. Very interesting. I wonder how many other algorithms could be sped up markedly by clumping data to take advantage of cache characteristics. Once all the data is in memory, my B-tree find() function timing test actually runs a bit faster than a std::map, using the Microsoft compiler and standard library, in spite of all it's cache management overhead. I suspect that's the impact of the CPU memory cache. OTOH, the B-tree version is 20% slower than the standard library map for GCC compiler and its standard library, so it may just be a Microsoft implementation issue. Their timings are about that amount slower than GCC's. Thanks, --Beman

On Wed, Sep 15, 2010 at 6:04 PM, Beman Dawes <bdawes@acm.org> wrote:
Any interest?
Wow! This is one of the most exciting library proposals I've seen. When do you plan to support variable keys/values ? Also, do you plan to provide or have some thoughts on Hash file index or R-Tree ? thanks and regards jose

On Thu, Sep 16, 2010 at 1:37 PM, Jose <jmalv04@gmail.com> wrote:
On Wed, Sep 15, 2010 at 6:04 PM, Beman Dawes <bdawes@acm.org> wrote:
Any interest?
Wow! This is one of the most exciting library proposals I've seen.
Glad it is of interest!
When do you plan to support variable keys/values ?
That's the next major item on my do list. But I've only gotten as far as a conceptual approach, and haven't had a chance to look at the implementation experience of others. I think I read that the usual approach of others was the obvious fixed element length array on each page, pointing to the variable length storage portion of the page, but I don't know if that is still considered state-of-the-art. My old C package just kept the variable length elements ordered on each page, and then did sequential searches! I intend to do better than that this time around.
Also, do you plan to provide or have some thoughts on Hash file index
Hash index: No plans. I've done that, many years ago, and it can be effective. But it is fairly special purpose and probably better done a separate library.
or R-Tree ?
I've done an N-tree on top of a B-tree. (An N-tree is the same thing as a Z-tree, just rotate the image 90 degrees.) Slick. I've worked on a project where another developer added an R-tree on top of my C language B-tree package. Very slick, although note that different people use the term R-tree to mean somewhat different things - we were focused on spatial search - i.e. range trees on minimum bounding rectangles. Bayer is pushing UB-trees, although that may be just a different name for an N or Z tree, perhaps extended to more dimensions. So, yeah, I'm interested. But those would be version 2 add-ons. Cheers, --Beman

Hi Beman, Beman Dawes wrote:
A Boost B-tree library would provide disk-based associative containers that scale all the way from really, really, small to really, really, large.
Any interest?
Yes, of course! A couple of questions: Does anyone have any thoughts about how atomic inserts can be implemented on top of this? I'm not too concerned about recovering from random disk corruption, but I'd like to survive in the case where the program dies (e.g. another thread segfaults!) while inserting a new record. Or, is the implementation "not quite atomic but safe enough" ? I have various lumps of code where the data is either read-only or append-only and this is not a problem. I'm now faced with something where I need proper safe writes; I don't want to have to use SQLlite.... I'm also interested in spatial applications. I have successfully used Z-curve techniques to adapt 1D containers for storing 2D (or higher) values and I'm sure this could also be used here (I guess that's what you mean by "Z or N tree"). Harder is the implementation of range storage (1D or e.g. bounding rectangles); Beman, you mentioned that someone had implemented an R-tree on top of your previous B-tree; can you say something about how that is done? I have sometimes thought about how one could implement an interval tree as a layer on top of a simpler associative container, but have not found a solution. Thanks, Phil.

On Mon, Sep 20, 2010 at 7:48 PM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
Hi Beman,
Beman Dawes wrote:
A Boost B-tree library would provide disk-based associative containers that scale all the way from really, really, small to really, really, large.
Any interest?
Yes, of course! A couple of questions:
Does anyone have any thoughts about how atomic inserts can be implemented on top of this? I'm not too concerned about recovering from random disk corruption, but I'd like to survive in the case where the program dies (e.g. another thread segfaults!) while inserting a new record. Or, is the implementation "not quite atomic but safe enough" ?
The btree classes have a flush() function that forces writing out of internal buffers. That is intended to help provide "not quite atomic but safe enough". Beyond that, I'd like to hear ideas. Is some form of journaling the preferred approach?
I have various lumps of code where the data is either read-only or append-only and this is not a problem. I'm now faced with something where I need proper safe writes; I don't want to have to use SQLlite....
How do SQLlite or other libraries provide safety against another thread crashing? Do they actually stop the other threads until the b-tree modification completes, have some form of recovery, or some other approach? Do they have protection against another process corrupting the B-tree?
I'm also interested in spatial applications. I have successfully used Z-curve techniques to adapt 1D containers for storing 2D (or higher) values and I'm sure this could also be used here (I guess that's what you mean by "Z or N tree").
Yes, at least as I'm using the terms. But I suspect that Bayer's UB-trees may have gone further.
Harder is the implementation of range storage (1D or e.g. bounding rectangles); Beman, you mentioned that someone had implemented an R-tree on top of your previous B-tree; can you say something about how that is done?
It has been 10 or 15 years. I can remember the general principles, which are that keys represent minimum bounding rectangles (MBR's), and that as the search down the tree proceeds you may have visit multiple child nodes if the shape you are looking for spans several MBR's. Very similar to an N-tree or Z-tree. But beyond that I'd have to dredge around in my memory to come up with more details.
I have sometimes thought about how one could implement an interval tree as a layer on top of a simpler associative container, but have not found a solution.
Interesting stuff, that's for sure! --Beman

On Tue, 21 Sep 2010 18:25:16 -0400 Beman Dawes <bdawes@acm.org> wrote:
The btree classes have a flush() function that forces writing out of internal buffers. That is intended to help provide "not quite atomic but safe enough".
Beyond that, I'd like to hear ideas. Is some form of journaling the preferred approach?
That's how Tokyo Cabinet handles it, if I remember correctly. -- Chad Nelson Oak Circle Software, Inc. * * *

Beman Dawes wrote:
On Mon, Sep 20, 2010 at 7:48 PM, Phil Endecott <spam_from_boost_dev@chezphil.org> wrote:
Does anyone have any thoughts about how atomic inserts can be implemented on top of this? ?I'm not too concerned about recovering from random disk corruption, but I'd like to survive in the case where the program dies (e.g. another thread segfaults!) while inserting a new record. ?Or, is the implementation "not quite atomic but safe enough" ?
The btree classes have a flush() function that forces writing out of internal buffers. That is intended to help provide "not quite atomic but safe enough".
Beyond that, I'd like to hear ideas. Is some form of journaling the preferred approach?
How do SQLlite or other libraries provide safety against another thread crashing? Do they actually stop the other threads until the b-tree modification completes, have some form of recovery, or some other approach? Do they have protection against another process corrupting the B-tree?
There are a couple of pages here that describe how SQLite does it: http://www.sqlite.org/atomiccommit.html http://www.sqlite.org/wal.html I think those are worth reading. It's also possible to find docs about the MVCC system that PostgreSQL uses, but that seems to have a lot of functionality for multiple concurrent transactions that isn't needed here. I've tried to find something about how Berkely DB does it but without success; does anyone know anything about that? The important question for now is whether this sort of thing can be implemented as a layer on top of your library, or whether something more intrusive is needed. I think that the SQLite journals work in terms of disk blocks, which are not units that are exposed in the application API. Perhaps what is needed is the ability to insert a layer between your code and the filesystem; SQLite has a "VFS" that I think may do something like this.
Harder is the implementation of range storage (1D or e.g. bounding rectangles); Beman, you mentioned that someone had implemented an R-tree on top of your previous B-tree; can you say something about how that is done?
It has been 10 or 15 years. I can remember the general principles, which are that keys represent minimum bounding rectangles (MBR's), and that as the search down the tree proceeds you may have visit multiple child nodes if the shape you are looking for spans several MBR's. Very similar to an N-tree or Z-tree. But beyond that I'd have to dredge around in my memory to come up with more details.
As it happens I also found some R-tree material on the SQLite web site: http://www.sqlite.org/rtree.html It looks, though I am not certain, as if they have implemented R*-trees on top of their regular tables by creating a "virtual tree", i.e. there are database rows that define parent-child relationships. I think this reduces the performance from O(log N) to O((log N)^2), but I haven't worked out the details. Regards, Phil.

On Tue, Sep 21, 2010 at 3:25 PM, Beman Dawes <bdawes@acm.org> wrote:
How do SQLlite or other libraries provide safety against another thread crashing? Do they actually stop the other threads until the b-tree modification completes, have some form of recovery, or some other approach? Do they have protection against another process corrupting the B-tree?
I believe this is how it's done: 1) Put any changes into new or free pages -- don't ever overwrite any old data. 2) Write a list of "free" page indexes -- the indexes of the header and ones that (1) replaced. 3) Fsync. 4) Write a new header to the end of the file. 5) Fsync. If you start an operation and a valid header isn't found at the end of the file, just read in reverse until you find one. This necessitates extra code to encapsulate multiple operations in transactions, because fsync is expensive. You still need to maintain a reader-writer lock on top of the file. -- Cory Nelson http://int64.org

On Wed, Sep 15, 2010 at 9:04 AM, Beman Dawes <bdawes@acm.org> wrote:
A Boost B-tree library would provide disk-based associative containers that scale all the way from really, really, small to really, really, large. B-trees perform well on hardware ranging from ancient floppy disk drives all the way up to humongous disk arrays. They are the technology behind most high-performance disk file systems and databases.
Hi Beman, any updates on this? Or are you too busy with the coming Boost release? -- Cory Nelson http://int64.org

On Tue, Nov 16, 2010 at 7:04 AM, Cory Nelson <phrosty@gmail.com> wrote:
On Wed, Sep 15, 2010 at 9:04 AM, Beman Dawes <bdawes@acm.org> wrote:
A Boost B-tree library would provide disk-based associative containers that scale all the way from really, really, small to really, really, large. B-trees perform well on hardware ranging from ancient floppy disk drives all the way up to humongous disk arrays. They are the technology behind most high-performance disk file systems and databases.
Hi Beman, any updates on this? Or are you too busy with the coming Boost release?
Haven't done anything in the last month, due mostly to a major house renovation in-process, and helping with the Library Working Group organization during the C++ committee meeting last week. You can see earlier progress by taking a look at https://github.com/Beman/Boost-Btree/commits/master. A lot of effort has gone into test and timing code, and it has been awhile now since any bugs surfaced. My next work item is support for variable length data. It will probably be another month before there is much to report on that. In the meantime, see boost/btree/detail/fixstr.hpp for a little string class that has a fixed maximum length. Although it wastes space compared to true variable length I/O-able data, it would be fine for loads of applications. --Beman
participants (14)
-
Beman Dawes
-
Chad Nelson
-
Cory Nelson
-
Francesco Nidito
-
Giorgio Zoppi
-
Gottlob Frege
-
Ion Gaztañaga
-
Jarrad Waterloo
-
Jose
-
Phil Endecott
-
Simonson, Lucanus J
-
Stefan Strasser
-
Stewart, Robert
-
Thorsten Ottosen