Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11

25 Mar 2017

      On Tue, 21 Mar 2017 23:08:30 -0400
Vinnie Falco via Boost <boost@lists.boost.org> wrote:
...
On Tue, Mar 21, 2017 at 10:50 PM, Lee Clagett via Boost
<boost@lists.boost.org> wrote:
[...snip...]
...
...
The pathological case is power loss, which cannot be unit tested
easily (and probably isn't possible).
I think this can be unit tested, and I believe that NuDB's unit test
covers the case of power loss. I think we can agree that power loss on
a read is uninteresting (since it can't corrupt data). The unit test
models a power loss as a fatal error during a write. The test
exercises all possible fatal errors using an incremental approach (I
alluded to this in my previous message).
The database template requires a File type to instantiate:
template<class Hasher, class File>
  class basic_store;
NuDB comes with posix_file and win32_file. The unit test uses its own
fail_file which wraps the native file and generates a simulated error
on the Nth write:
https://github.com/vinniefalco/NuDB/blob/master/extras/nudb/test/fail_file.h...
The test implementation tries to insert a bunch of records into a new
database using a fail_file, incrementing the counter as it goes. On a
failure it tries to recover (which can also result in a failed write).
After the recover fails, it verifies that the database is still
consistent. You can see the test implementation:
https://github.com/vinniefalco/NuDB/blob/master/test/recover.cpp#L125
I *think* this test proves that the implementation maintains
consistency on any power loss. would very much like anyone who is
interested to tell me whether this approach works.
The other responses to this thread reiterated what I thought could
occur - there should be corruption "races" from a write call to file
sync completion.

Writing the blocks to the log file are superfluous because it is
writing to multiple sectors and there is no mechanism to detect a
partial write after power failure. So the information cannot be read
from reliably. If these writes were omitted, the presence of the log
file header indicates whether a commit failed to complete, and this
header should fit in a single sector. The key and data file sizes in
that header indicate what portion at the end of the key and data files
should be ignored.

If the log file header was moved to beginning of the key file, after the
records are appended and sync'ed, the header could be overwritten with
the new file lengths. The key file would not need to be opened with the
`_POSIX_SYNCHRONIZED_IO` flag, and a fsync after the writing header can
be omitted (which means a commit also has "relaxed" disk flushing).
This should improve commit performance.

At this point I'm sure Niall will tell me that writing <512 bytes to
the beginning of the file is also not an atomic operation. Which is
exactly what SQLite documentation assumes. The power-loss recovery
algorithm gets tougher with this assumption.
...
...
- The key-file is append only so NuDB has multiple blocks of
sorted records. So hopefully the insertions do not cause lots of
blocks or the algorithm becomes a linear search.
There's no sorting going on here, the key file is an on-disk hash
table where each bucket holds a good-sized number of keys (so that a
single read from the key file does the most work).
I jumped into the internal fetch function which was sorted within a
single bucket and had a linked list of spills. Reading the README first
would've made it clear that there was more to the implementation.

So the worst case performance is a link-list if a hash collision. Was
the primary decision for the default hash implementation performance?

Lee