Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11

28 Mar 2017

      On Sun, 26 Mar 2017 23:07:14 +0100
Niall Douglas via Boost <boost@lists.boost.org> wrote:
...
...
You snipped my suggestion after this for swapping the order of the
header write, which assumed/implied `_POSIX_SYNCHRONIZED_IO` due to
the reasons you established previously. The only caveat you gave
earlier was that Linux did not implement this properly for the
metadata flushing on some of its file systems. Which means the log
file could be "empty" after a restart in the current
implementation, even if both fsync checkpoints completed.
The reason I snipped it was because the original algorithm is broken,
and so is yours. You are not conceptualising the problem correctly:
consider storage after sudden power loss to be exactly the same as a
malicious attacker on the internet capable of manipulating bits on
storage to any value in order to cause misoperation. That includes
making use of collisions in weak hashes to direct your power loss
recovery implementation to sabotage and destroy additional data after
the power loss.
I do understand the problem - I stated 3 times in this thread that even
writing the small header to a single sector could still be problematic.
My suggestions were primarily trying to tweak the existing design to
improve durability with minimal impact. If I convinced Vinnie (and
perhaps even myself) that writing the log header after its contents
could reduce the probability of an undetected incomplete write, the next
obvious suggestion was to append a cryptographic hash of the header(s).
The buckets in the log would then be valid if fsync blocking until
metadata + data completion can be assumed. Even if the hardware lies
about immediately writing to the physical medium, it should still be
reducing the time window where data loss can occur. Hashing over the
entire log file would be a more portable/optimal solution, but adds
_more_ CPU time and would deviate from the current implementation a bit
more.

I think there is a point where handling difficult filesystems and
hardware is out of scope for this library. If the library cannot assume
that a returned fsync call means the hardware "stored" the data
+ metadata, it could make the implementation more complex/costly.
Checking for `_POSIX_SYNCHRONIZED_IO` and calling an OSX `fnctl`
instead of `fsync` is probably the limit of actions a library like NuDB
should take.

NuDB already has a file concept that needs documenting and formalizing
before any potential boost review. These harder edge cases could be
provided by an implementation of this concept instead of NuDB directly.
If the highly durable implementation required a noticeable amount of
CPU cycles, existing and new users of the library could remain on the
potentially less durable and faster direct platform versions that
"steals" less CPU cycles from their system.

[...snip...]
...
...
...
You can never assume writes to one inode will reach storage before
another in portable code. You can only assume in portable code that
writes to the same inode via the same fd will reach storage in the
order issued.
You chopped my response here too, and I think this was in response
to the COW + inode suggestion. If the design knew COW was available
for the filesystem in use, couldn't it also know whether data +
metadata is synchronized as expected? The suggestion clearly was
not portable anyway.
COW filing systems generally offer much stronger guarantees than
non-COW filing systems. You are correct that if you are operating on
one of those, you can skip a ton of work to implement durability.
This is why AFIO v2 has "storage profiles" where ZFS, ReFS and BtrFS
are all top of tree in terms of disabling work done by AFIO and its
clients. FAT32, meanwhile, sits at the very bottom.
Getting the NuDB file concept to work with AFIOv2 seems like it could
be very useful then. Does v2 have a method for specifying dependency
order on writes (I couldn't find any)? I thought v1 had this feature -
does v2 drop it?

Lee

Re: [boost] NuDB: A fast key/value insert-only database for SSD drives in C++11

Lee Clagett