
On 26/03/2017 05:08, Lee Clagett via Boost wrote:
On Sat, 25 Mar 2017 16:22:50 -0400 Vinnie Falco via Boost
wrote: On Sat, Mar 25, 2017 at 4:01 PM, Lee Clagett via Boost
wrote: The other responses to this thread reiterated what I thought could occur - there should be corruption "races" from a write call to file sync completion.
NuDB makes the same assumptions regarding the underlying file system capabilities as SQLite. In particular, if there are two calls to fsync in a row, it assumes that the first fsync will complete before the second one starts. And that upon return from a successful call to fsync, the data has been written.
I think SQLite makes more stringent assumptions - that between the write and the sync the file metadata and sectors can be written in any order. And one further - that a single sector could be partially written but only sequentially forwards or backwards. This last assumption sounds like a legacy assumption from spinning disks.
If you want durability across multiple OSs and filing systems, you need to assume that fsync's are reordered with respect to one another. All major databases assume this, and so should NuDB, or else NuDB needs to remove all claims regarding durability of any kind. In fact, you might as well remove the fsync code path entirely, from everything I've seen to date its presence provides a false sense of assurance which is much worse than providing no guarantees at all. Additionally, strongly consider a O_SYNC based design instead of an fsync based design. fsync() performs pathologically awful on copy-on-write filing systems, it unavoidably forces a full RCU cycle of multiple blocks. Opening the fd with O_SYNC causes COW filing systems to use an alternate caching algorithm, one without pathological performance. Note that O_SYNC on some filing systems still has the metadata reordering problem. You should always assume that fsync/O_SYNC writes are reordered with respect to one another across inodes. They are only sequentially consistent within the same inode when performed on the same fd. Again, I'd personally recommend you just remove all durability claims entirely, and remove the code claiming to implement it as an unnecessary overhead. You need to start with a design that assumes the filing system reorders everything all the time, it can't be retrofitted.
When there is a power loss or device failure, it is possible that recent insertions are lost. The library only guarantees that there will be no corruption. Specifically, any insertions which happen after a commit, might be rolled back if the recover process is invoked. Since the commit process runs every second, not much will be lost.
Writing the blocks to the log file are superfluous because it is writing to multiple sectors and there is no mechanism to detect a partial write after power failure.
Hmm, I don't think there's anything superfluous in this library. The log file is a "rollback file." It contains blocks from the key file in the state they were in before being modified. During the commit phase, nothing in the key file is modified until all of the blocks intended to be modified are first backed up to the log file. If the power goes out while these blocks are written to the log file, there is no loss.
You can never assume writes to one inode will reach storage before another in portable code. You can only assume in portable code that writes to the same inode via the same fd will reach storage in the order issued.
Was the primary decision for the default hash implementation performance?
If you're talking about xxhasher, it was chosen for being the best balance of performance, good distribution properties, and decent security. NuDB was designed to handle adversarial inputs since most envisioned use-cases insert data from untrusted sources / the network.
In which case you did not make a great choice. Much, much better would be Blake2b. 2 cycles/byte, cryptographically secure, collision probability exceeds life of the universe. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/