On Sun, 26 Mar 2017 12:11:14 +0100
Niall Douglas via Boost
On 26/03/2017 05:08, Lee Clagett via Boost wrote:
On Sat, 25 Mar 2017 16:22:50 -0400 Vinnie Falco via Boost
wrote: On Sat, Mar 25, 2017 at 4:01 PM, Lee Clagett via Boost
wrote: The other responses to this thread reiterated what I thought could occur - there should be corruption "races" from a write call to file sync completion.
NuDB makes the same assumptions regarding the underlying file system capabilities as SQLite. In particular, if there are two calls to fsync in a row, it assumes that the first fsync will complete before the second one starts. And that upon return from a successful call to fsync, the data has been written.
I think SQLite makes more stringent assumptions - that between the write and the sync the file metadata and sectors can be written in any order. And one further - that a single sector could be partially written but only sequentially forwards or backwards. This last assumption sounds like a legacy assumption from spinning disks.
If you want durability across multiple OSs and filing systems, you need to assume that fsync's are reordered with respect to one another. All major databases assume this, and so should NuDB, or else NuDB needs to remove all claims regarding durability of any kind. In fact, you might as well remove the fsync code path entirely, from everything I've seen to date its presence provides a false sense of assurance which is much worse than providing no guarantees at all.
You snipped my suggestion after this for swapping the order of the header write, which assumed/implied `_POSIX_SYNCHRONIZED_IO` due to the reasons you established previously. The only caveat you gave earlier was that Linux did not implement this properly for the metadata flushing on some of its file systems. Which means the log file could be "empty" after a restart in the current implementation, even if both fsync checkpoints completed. At that point the hope on my part was that the sector metadata for the log file header remained unchanged on the overwrite. And also hope that the log file open on initialization with a zeroed header would provide ample time for a metadata flush. This was a crappy (longer but still) race condition, AND of course there are the COW filesystems which have to change the sector for the feature! And since Linux `_POSIX_SYNCHRONIZED_IO` is (apparently) partially meaningless in Linux, BTRFS is madness. So the write order swap suggestion _possibly_ improves durability on some subset of filesystems. But most users are going to be ext4 Linux anyway, which sounds like one of the problematic cases.
Additionally, strongly consider a O_SYNC based design instead of an fsync based design. fsync() performs pathologically awful on copy-on-write filing systems, it unavoidably forces a full RCU cycle of multiple blocks. Opening the fd with O_SYNC causes COW filing systems to use an alternate caching algorithm, one without pathological performance.
Note that O_SYNC on some filing systems still has the metadata reordering problem. You should always assume that fsync/O_SYNC writes are reordered with respect to one another across inodes. They are only sequentially consistent within the same inode when performed on the same fd.
Again, I'd personally recommend you just remove all durability claims entirely, and remove the code claiming to implement it as an unnecessary overhead. You need to start with a design that assumes the filing system reorders everything all the time, it can't be retrofitted.
When there is a power loss or device failure, it is possible that recent insertions are lost. The library only guarantees that there will be no corruption. Specifically, any insertions which happen after a commit, might be rolled back if the recover process is invoked. Since the commit process runs every second, not much will be lost.
Writing the blocks to the log file are superfluous because it is writing to multiple sectors and there is no mechanism to detect a partial write after power failure.
Hmm, I don't think there's anything superfluous in this library. The log file is a "rollback file." It contains blocks from the key file in the state they were in before being modified. During the commit phase, nothing in the key file is modified until all of the blocks intended to be modified are first backed up to the log file. If the power goes out while these blocks are written to the log file, there is no loss.
You can never assume writes to one inode will reach storage before another in portable code. You can only assume in portable code that writes to the same inode via the same fd will reach storage in the order issued.
You chopped my response here too, and I think this was in response to the COW + inode suggestion. If the design knew COW was available for the filesystem in use, couldn't it also know whether data + metadata is synchronized as expected? The suggestion clearly was not portable anyway.
Was the primary decision for the default hash implementation performance?
If you're talking about xxhasher, it was chosen for being the best balance of performance, good distribution properties, and decent security. NuDB was designed to handle adversarial inputs since most envisioned use-cases insert data from untrusted sources / the network.
In which case you did not make a great choice.
Much, much better would be Blake2b. 2 cycles/byte, cryptographically secure, collision probability exceeds life of the universe.
Niall
Lee