On Sun, 26 Mar 2017 23:07:14 +0100
Niall Douglas via Boost
You snipped my suggestion after this for swapping the order of the header write, which assumed/implied `_POSIX_SYNCHRONIZED_IO` due to the reasons you established previously. The only caveat you gave earlier was that Linux did not implement this properly for the metadata flushing on some of its file systems. Which means the log file could be "empty" after a restart in the current implementation, even if both fsync checkpoints completed.
The reason I snipped it was because the original algorithm is broken, and so is yours. You are not conceptualising the problem correctly: consider storage after sudden power loss to be exactly the same as a malicious attacker on the internet capable of manipulating bits on storage to any value in order to cause misoperation. That includes making use of collisions in weak hashes to direct your power loss recovery implementation to sabotage and destroy additional data after the power loss.
I do understand the problem - I stated 3 times in this thread that even writing the small header to a single sector could still be problematic. My suggestions were primarily trying to tweak the existing design to improve durability with minimal impact. If I convinced Vinnie (and perhaps even myself) that writing the log header after its contents could reduce the probability of an undetected incomplete write, the next obvious suggestion was to append a cryptographic hash of the header(s). The buckets in the log would then be valid if fsync blocking until metadata + data completion can be assumed. Even if the hardware lies about immediately writing to the physical medium, it should still be reducing the time window where data loss can occur. Hashing over the entire log file would be a more portable/optimal solution, but adds _more_ CPU time and would deviate from the current implementation a bit more. I think there is a point where handling difficult filesystems and hardware is out of scope for this library. If the library cannot assume that a returned fsync call means the hardware "stored" the data + metadata, it could make the implementation more complex/costly. Checking for `_POSIX_SYNCHRONIZED_IO` and calling an OSX `fnctl` instead of `fsync` is probably the limit of actions a library like NuDB should take. NuDB already has a file concept that needs documenting and formalizing before any potential boost review. These harder edge cases could be provided by an implementation of this concept instead of NuDB directly. If the highly durable implementation required a noticeable amount of CPU cycles, existing and new users of the library could remain on the potentially less durable and faster direct platform versions that "steals" less CPU cycles from their system. [...snip...]
You can never assume writes to one inode will reach storage before another in portable code. You can only assume in portable code that writes to the same inode via the same fd will reach storage in the order issued.
You chopped my response here too, and I think this was in response to the COW + inode suggestion. If the design knew COW was available for the filesystem in use, couldn't it also know whether data + metadata is synchronized as expected? The suggestion clearly was not portable anyway.
COW filing systems generally offer much stronger guarantees than non-COW filing systems. You are correct that if you are operating on one of those, you can skip a ton of work to implement durability. This is why AFIO v2 has "storage profiles" where ZFS, ReFS and BtrFS are all top of tree in terms of disabling work done by AFIO and its clients. FAT32, meanwhile, sits at the very bottom.
Getting the NuDB file concept to work with AFIOv2 seems like it could be very useful then. Does v2 have a method for specifying dependency order on writes (I couldn't find any)? I thought v1 had this feature - does v2 drop it? Lee