
On Sat, Mar 25, 2017 at 4:01 PM, Lee Clagett via Boost
The other responses to this thread reiterated what I thought could occur - there should be corruption "races" from a write call to file sync completion.
NuDB makes the same assumptions regarding the underlying file system capabilities as SQLite. In particular, if there are two calls to fsync in a row, it assumes that the first fsync will complete before the second one starts. And that upon return from a successful call to fsync, the data has been written. When there is a power loss or device failure, it is possible that recent insertions are lost. The library only guarantees that there will be no corruption. Specifically, any insertions which happen after a commit, might be rolled back if the recover process is invoked. Since the commit process runs every second, not much will be lost.
Writing the blocks to the log file are superfluous because it is writing to multiple sectors and there is no mechanism to detect a partial write after power failure.
Hmm, I don't think there's anything superfluous in this library. The log file is a "rollback file." It contains blocks from the key file in the state they were in before being modified. During the commit phase, nothing in the key file is modified until all of the blocks intended to be modified are first backed up to the log file. If the power goes out while these blocks are written to the log file, there is no loss.
I jumped into the internal fetch function which was sorted within a single bucket and had a linked list of spills. Reading the README first would've made it clear that there was more to the implementation.
The documentation for NuDB needs work! I can only vouch for the maturity of the source code, not the documentation :)
So the worst case performance is a link-list if a hash collision.
Right, although the creation parameters are tuned such that less than 1% of buckets have 1 spill record, and 0% of buckets have 2 or more spill records.
Was the primary decision for the default hash implementation performance?
If you're talking about xxhasher, it was chosen for being the best balance of performance, good distribution properties, and decent security. NuDB was designed to handle adversarial inputs since most envisioned use-cases insert data from untrusted sources / the network.