
On Wed, 29 Mar 2017 17:18:23 -0400
Vinnie Falco via Boost
On Wed, Mar 29, 2017 at 5:04 PM, Niall Douglas via Boost
wrote: as that paper Lee linked to points out, everybody writing storage algorithms - even the professionals - consistently gets sudden power loss wrong without fail.
That paper found power loss bugs in the OS, filing systems, all the major databases and source control implementations and so on. This is despite all of those being written and tested very carefully for power loss correctness. They ALL made mistakes.
I have to agree. For now I am withdrawing NuDB from consideration - the paper that Lee linked is very informative.
However just to make sure I have the scenario that Lee pointed out in my head, here's the sequence of events:
1. NuDB makes a system call to append to the file 2. The file system increases the metadata indicating the new file size 3. The file system writes the new data into the correct location
Lee, and I think Niall (hard to tell through the noise), are saying that if a crash occurs after 2 but before or during 3, the contents of the new portion of the file may be undefined.
Is this correct?
The best explanation of the problem I have seen is [described in a paper discussing the implementation details of the ext filesystem on linux][0]. The paper also discusses the various issues that the filesystem designers have had to face, which has been helpful to me. The thing to remember is that the filesystem metadata is not just the filesize; the filesystem actually has to write information about which portions of the disk are in use for the file. This is why a crash during an append could contain old log file contents after a restart - the filesystem added pointers to new sectors but not the data at those pointers.
If so, then I do need to go back and make improvements to prevent this. While its true that I have not seen a corrupted database despite numerous production deployments and over 2TB data file, it would seem this case is sufficiently rare (and data-center hardware sufficiently reliable) that it is unlikely to have come up.
Yes it is likely pretty rare, especially on a journaled filesystem. The system has to halt a very specific point in time. This is why I suggested swapping the order of the writes to: write_buckets -> fsync -> write_header -> fsync ... write_header(zeroes) -> fsync -> truncate(header_size - not `0`) This still has implementation/system defined behavior, but overwriting a single sector is more likely to be "atomic" from the perspective of the filesystem (but not necessarily the hard-drive). And it didn't require massive structural changes. Writing out a cryptographic hash of the header would leave a single assumption - fsync is a proper write barrier in the OS/filesystem and in the hard-drive. Niall has been particularly harsh on fsync, but I do not think its all bad. With the exception of OSX, it seems that many filesystems implement it properly (might regret saying this), and a user can purchase an "enterprise" hard-drive that is not trying to artificial boost benchmarks stats. At the very least the number of assumptions has been decreased. FWIW, I _think_ Niall's suggestion to remove the log file also might be an interesting to investigate. Lee [0] http://pages.cs.wisc.edu/~remzi/OSTEP/file-journaling.pdf