
On Wed, 29 Mar 2017 18:06:02 +0100
Niall Douglas via Boost
On 29/03/2017 17:32, Lee Clagett via Boost wrote:
Read this [paper on crash-consistent applications][0]. Table 1 on page 5
I particularly like the sentence:
"However, not issuing such an fsync() is perhaps more safe in modern file systems than out-of-order persistence of directory operations. We believe the developers’ interest in fixing this problem arises from the Linux documentation explicitly recommending an fsync() after creating a file."
I think in this instance the authors were referring to the recommendation to fsync a file after creation. The paper is primarily about the properties of the filesystem, not lies by the hardware. Later on they comment about how developers generally disregard fsync as being unreliable, but that its possible that the root cause to their problems is an incorrect assumption about filesystem properties/behavior (based on the number of problems they found with common software).
I agree with them. fsync() gives false assurance. Better to not use it, and certainly never rely on it.
should be of particular interest. I _think_ the bucket portion of NuDB's log has no size constraint, so its algorithm is either going to be "single sector append", "single block append", or "multi-block append/writes" depending on the total size of the buckets. The algorithm is always problematic when metadata journaling is disabled. Your assumptions of fsync have not been violated to achieve those inconsistencies.
One of my biggest issues with NuDB is the log file. Specifically, it's worse than useless, it actively interferes with database integrity.
If you implemented NuDB as a simple data file and a memory mapped key file and always atomic appended transactions to the data file when inserting items, then after power loss you could check if the key file mentions extents not possible given the size of the data file. You then can rebuild the key file simply by replaying through the data file, being careful to ignore any truncated final append.
I think this is trading an atomic `truncate(0)` assumption with an atomic multi-block overwrite assumption. So this seems like something that is more likely to have a torn write that is hard to notice.
That would be a reasonable power loss recovery algorithm. A little slow to do recovery for large databases, but safe, reliable, predictable and it would only run on a badly closed database. You can also turn off fsync entirely, and let the atomic appends land on storage in an order probably close to the append order. Ought to be quicker than NuDB by a fair bit, much fewer i/o ops, simpler design.
How would it notice that a bucket was partially overwritten though? Wouldn't it have to _always_ inspect the entire key file? Lee