Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11

26 Mar 2017

      On Sat, 25 Mar 2017 16:22:50 -0400
Vinnie Falco via Boost <boost@lists.boost.org> wrote:
...
On Sat, Mar 25, 2017 at 4:01 PM, Lee Clagett via Boost
<boost@lists.boost.org> wrote:
...
The other responses to this thread reiterated what I thought could
occur - there should be corruption "races" from a write call to file
sync completion.
NuDB makes the same assumptions regarding the underlying file system
capabilities as SQLite. In particular, if there are two calls to fsync
in a row, it assumes that the first fsync will complete before the
second one starts. And that upon return from a successful call to
fsync, the data has been written.
I think SQLite makes more stringent assumptions - that between the
write and the sync the file metadata and sectors can be written in any
order. And one further - that a single sector could be partially
written but only sequentially forwards or backwards. This last
assumption sounds like a legacy assumption from spinning disks.

I have been (poorly) suggesting that the order of the new buckets in the
log and log file header be swapped. Writing the header which fits in a
single sector is _more_ likely to be all-or-nothing than the bucket
records. At least that's my understanding of this insane situation.

The recover method in NuDB appears to be assuming that if it can read
the entire bucket in the log, that its valid. Using the header to
"post" completed changes seems like it has a smaller window for
failure. If the header was not written correctly, truncate.
...
When there is a power loss or device failure, it is possible that
recent insertions are lost. The library only guarantees that there
will be no corruption. Specifically, any insertions which happen after
a commit, might be rolled back if the recover process is invoked.
Since the commit process runs every second, not much will be lost.
...
Writing the blocks to the log file are superfluous because it is
writing to multiple sectors and there is no mechanism to detect a
partial write after power failure.
Hmm, I don't think there's anything superfluous in this library. The
log file is a "rollback file." It contains blocks from the key file in
the state they were in before being modified. During the commit phase,
nothing in the key file is modified until all of the blocks intended
to be modified are first backed up to the log file. If the power goes
out while these blocks are written to the log file, there is no loss.
I've proven twice in this thread I cannot read code quickly. Crap. I
thought the key file was append-only like the data file, which is not
possible. No idea why I thought this, other than it would make the
implementation easier. I think the only way to get rid of the log file
in this design is via a COW filesystem (but I have no experience with
those).
...
...
I jumped into the internal fetch function which was sorted within a
single bucket and had a linked list of spills. Reading the README
first would've made it clear that there was more to the
implementation.
The documentation for NuDB needs work! I can only vouch for the
maturity of the source code, not the documentation :)
...
So the worst case performance is a link-list if a hash collision.
Right, although the creation parameters are tuned such that less than
1% of buckets have 1 spill record, and 0% of buckets have 2 or more
spill records.
...
Was the primary decision for the default hash implementation
performance?
If you're talking about xxhasher, it was chosen for being the best
balance of performance, good distribution properties, and decent
security. NuDB was designed to handle adversarial inputs since most
envisioned use-cases insert data from untrusted sources / the network.
Yes, I asked because I was curious about the security analysis. The only
security analysis I could find quickly was the SMHasher test, and it
got the same score as `murmurhash 3a`. Murmurhash 3 was shown to
vulnerable to collisions, even with a random seed. I could not find the
significance of the 'a' (i.e. `3a`) listed in the table provided by
xxHash.

Lee