NuDB: A fast key/value insert-only database for SSD drives in C++11

Vinnie Falco

21 Mar 2017 21 Mar '17

8:36 p.m.

I just released version 1.0.0 of my header-only C++11 library NuDB. Its quite a mature library, haven't had to make any changes in a while. Its been running on production servers for 2 years or so with no problems, managing ever-growing databases of 2TB. I'm wondering what the level of interest, if any, there would be for making this a part of Boost. You can check it out here: https://github.com/vinniefalco/NuDB

Show replies by date

Peter Dimov

21 Mar 21 Mar

9:01 p.m.

Vinnie Falco wrote:

...

I just released version 1.0.0 of my header-only C++11 library NuDB. Its quite a mature library, haven't had to make any changes in a while. Its been running on production servers for 2 years or so with no problems, managing ever-growing databases of 2TB.

I'm wondering what the level of interest, if any, there would be for making this a part of Boost. You can check it out here:

https://github.com/vinniefalco/NuDB

The obvious question one might have is, in what scenarios is a database that only supports insertions, but not updates or deletions, useful? A follow-up to that is, is it not possible to add update/delete support to NuDB by just appending the new data and making the key file point to the new location (using zero-sized data for deletions)? Last, the README says "Value sizes from 1 to 2^32 bytes (4GB)", but the file format says uint48_t. Which is it? 2^32 is perhaps a bit limiting nowadays.

Vinnie Falco

9:13 p.m.

On Tue, Mar 21, 2017 at 5:01 PM, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

The obvious question one might have is, in what scenarios is a database that only supports insertions, but not updates or deletions, useful?

So this actually comes up quite a bit in decentralized systems, where the cryptographic digest (or hash) of a binary object is used as the key to identify the object. Its called "Content-addressable storage:" https://en.wikipedia.org/wiki/Content-addressable_storage In digital currencies such as Bitcoin and Ripple these values represent immutable information such as transactions that have taken place in the past. It also comes up in distributed storage systems such as file sharding, to represent pieces of files. There's no question that this library is quite specialized and applicable only in niche use-cases. But for those cases, it works amazingly well.

...

A follow-up to that is, is it not possible to add update/delete support to NuDB by just appending the new data and making the key file point to the new location (using zero-sized data for deletions)?

That is certainly possible, But space for the unused objects created as a result of updates and deletes could not be reclaimed in real-time. Administrators would have to use an offline process to take a database and re-create it to exclude those objects. This could take quite a long time for databases in the tens of terabytes, on the order of days or weeks. It is also not a use-case required for the applications that I developed NuDB for, so I haven't done it. But it could be explored.

...

Last, the README says "Value sizes from 1 to 2^32 bytes (4GB)", but the file format says uint48_t. Which is it? 2^32 is perhaps a bit limiting nowadays.

Probably the documentation is a touch out of date. The database imposes a limit of 2^32 bytes for individual values. I think that's big enough. NuDB is designed for database that have an enormous number of keys, not a small number of keys with enormous values. If an application needs a key/value store where typical values are in excess of 4GB, they don't need support for billions of keys (since a billion of those values would exceed all known storage limits). Also I think I changed that limit from 2^48 down to 2^32 to make NuDB work identically when built for 32-bit processor targets as when built for 64-bit processor targets. Thanks for the questions!

Niall Douglas

11:36 p.m.

On 21/03/2017 20:36, Vinnie Falco via Boost wrote:

...

I just released version 1.0.0 of my header-only C++11 library NuDB. Its quite a mature library, haven't had to make any changes in a while. Its been running on production servers for 2 years or so with no problems, managing ever-growing databases of 2TB.

I'm wondering what the level of interest, if any, there would be for making this a part of Boost. You can check it out here:

https://github.com/vinniefalco/NuDB

As I said to you at CppCon, I'm very keen on the concept. It's why I started AFIO back in 2012, is to one day deliver a low level key-value store suitable for C++ standardisation. And I'm still at that five years later, with Outcome being my next step to getting that achieved as the key-value store's API uses a ton of outcome::result<T>. But as I also said to you at CppCon, I'm not keen on your NuDB. I have problems with its API, with its flexibility and scalability, and lots more problems with its implementation. I am unsure what benefit it confers over the much safer choices of SQLite or UnQLite both of which are battle tested and written by database experts to ensure that other people's data is never, ever lost. I felt when I reviewed NuDB's implementation some time ago that your implementation would be highly susceptible to data loss in a wide set of circumstances. I also felt it wouldn't scale well to the > 1M IOPS storage just around the corner. I appreciate that as you say, it's been running in production for two years, but I would suspect it is being used in a highly restricted use case where corner cases don't present. As a library in Boost, you would find people doing all sorts of daft things with it, and it would therefore need to be absolutely rock solid reliable on a wide variety of operating systems, filing systems and use cases e.g. over a Samba share. This is why it's taking me so long, writing the testing for this stuff takes a very, very long time. All that said, I haven't delivered a key-value store after five years of outside-of-work effort. You have. I would therefore endorse you bringing this to review, but be aware that in a review I would be fairly certain it'll be rejected. That said, the feedback you'd get would be very useful, not just to you but also to me for my planned key-value store, so if you are up for getting this to a state ready to submit to Boost, I'd welcome it. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Vinnie Falco

22 Mar 22 Mar

midnight

On Tue, Mar 21, 2017 at 7:36 PM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...

On 21/03/2017 20:36, Vinnie Falco via Boost wrote: I am unsure what benefit it confers over the much safer choices of SQLite or UnQLite both of which are battle tested and written by database experts to ensure that other people's data is never, ever lost.

SQLite is painfully slow compared to NuDB. UnQLite supports features not present in NuDB such as cursor iteration, moving, and deleting. The tradeoff for not supporting these features is that NuDB can make better claims about its performance.

...

I felt when I reviewed NuDB's implementation some time ago that your implementation would be highly susceptible to data loss in a wide set of circumstances.

NuDB makes some assumptions about the underlying file system. In particular, that when a call to fsync() returns it is assured the data is reliably written. If these invariants are met there is no data loss. In fact, one of NuDB's unit tests exhaustively verifies that the database recovery process performs correctly after an I/O failure. The unit test works by instantiating a database with a custom File template parameter. This custom file implementation simulates an I/O failure on the Nth operation. The test runs with N=1 to failure, then increments N, and re-runs the test until there are no failures. At each iteration, the test verifies that the database recover process left the database in a consistent state. You can see this code here: https://github.com/vinniefalco/NuDB/blob/master/test/recover.cpp#L125 I don't doubt that you think this database is vulnerable to data loss under some condition. However, I would appreciate specifics so that I can verify under what conditions your claim is correct, or if it is correct at all.

...

I also felt it wouldn't scale well to the > 1M IOPS storage just around the corner.

While it hasn't yet been tested on these non-existent devices, I am optimistic, since NuDB was specifically designed for very intense fetch workloads with light concurrent insertions. Fetches can happen concurrently, it should be possible to saturate the read throughput of a device. See: https://github.com/vinniefalco/NuDB/blob/master/include/nudb/impl/basic_stor...

...

I would suspect it is being used in a highly restricted use case where corner cases don't present.

Maybe. I'll counter by saying, there are only three operations possible on an open database: insert, fetch, and close.

...

if you are up for getting this to a state ready to submit to Boost, I'd welcome it.

Cool! The docs just need a bit of work, the library code is more than ready. Thanks for taking the time to go through it!

Lee Clagett

2:50 a.m.

On Tue, 21 Mar 2017 20:00:39 -0400 Vinnie Falco via Boost <boost@lists.boost.org> wrote:

...

On Tue, Mar 21, 2017 at 7:36 PM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...
On 21/03/2017 20:36, Vinnie Falco via Boost wrote: I am unsure what benefit it confers over the much safer choices of SQLite or UnQLite both of which are battle tested and written by database experts to ensure that other people's data is never, ever lost.

SQLite is painfully slow compared to NuDB. UnQLite supports features not present in NuDB such as cursor iteration, moving, and deleting. The tradeoff for not supporting these features is that NuDB can make better claims about its performance.

Have you compared this to LMDB? That should be the best comparison for performance and features, although I've never seen LMDB compared directly to UnQLite. One thing that I noticed is that the documentation of NuDB indirectly indicates that recent data loss can occur since an insertion is not immediately synced to disk. This is going to give NuDB a write advantage when comparing against many databases.

...

...
I felt when I reviewed NuDB's implementation some time ago that your implementation would be highly susceptible to data loss in a wide set of circumstances.

NuDB makes some assumptions about the underlying file system. In particular, that when a call to fsync() returns it is assured the data is reliably written. If these invariants are met there is no data loss.

In fact, one of NuDB's unit tests exhaustively verifies that the database recovery process performs correctly after an I/O failure. The unit test works by instantiating a database with a custom File template parameter. This custom file implementation simulates an I/O failure on the Nth operation. The test runs with N=1 to failure, then increments N, and re-runs the test until there are no failures. At each iteration, the test verifies that the database recover process left the database in a consistent state. You can see this code here: https://github.com/vinniefalco/NuDB/blob/master/test/recover.cpp#L125

I don't doubt that you think this database is vulnerable to data loss under some condition. However, I would appreciate specifics so that I can verify under what conditions your claim is correct, or if it is correct at all.

The pathological case is power loss, which cannot be unit tested easily (and probably isn't possible). I _think_ you took this into consideration with the log file approach. But one potential issue is when power loss occurs before the log file header is completely synch'ed. The filesystem could sync the FILE metadata first (size, etc), so the file appears large enough to contain the entire header metadata, but actually contains nothing or a portion of the data. Niall thoughts ? Also: - The key-file is append only so NuDB has multiple blocks of sorted records. So hopefully the insertions do not cause lots of blocks or the algorithm becomes a linear search. - There does not appear to be an advantage to writing the blocks to the log file first. There is already a time window after an insertion where data loss can occur, so the data and key file sizes in the header seem sufficient for rollback.

...

...
I also felt it wouldn't scale well to the > 1M IOPS storage just around the corner.

While it hasn't yet been tested on these non-existent devices, I am optimistic, since NuDB was specifically designed for very intense fetch workloads with light concurrent insertions. Fetches can happen concurrently, it should be possible to saturate the read throughput of a device. See: https://github.com/vinniefalco/NuDB/blob/master/include/nudb/impl/basic_stor...

This is surprisingly unlikely to scale to the IOPS Niall mentioned. The reader lock requires exclusive cacheline access to synchronize with the writer, which forces the core to wait for a response from other cores/processors. IIRC the problem is amplified in multi-sockets systems where the communication must leave the "die". The speculative locking feature may help here, but Niall tested this a while ago and showed some surprising performance quirks (although I do not recall specifics). Niall or Dimov should be better informed on this subject, and may correct me a bit. Generally this is why something like LMDB does so well in benchmarks; the design/implementation was tailored to modern virtual page caching and cachelines. Which isn't to say its 100% perfect, just that its likely to perform better than any implementation which does not take those system designs into consideration.

...

...
I would suspect it is being used in a highly restricted use case where corner cases don't present.

Maybe. I'll counter by saying, there are only three operations possible on an open database: insert, fetch, and close.

...
if you are up for getting this to a state ready to submit to Boost, I'd welcome it.

Cool! The docs just need a bit of work, the library code is more than ready.

Thanks for taking the time to go through it!

Lee

Vinnie Falco

3:08 a.m.

On Tue, Mar 21, 2017 at 10:50 PM, Lee Clagett via Boost <boost@lists.boost.org> wrote:

...

Have you compared this to LMDB?

In our application (rippled) LMDB performs quite poorly even compared to LevelDB, which performed poorly compared to NuDB. Its true that the rippled workload is quite niche.

...

The pathological case is power loss, which cannot be unit tested easily (and probably isn't possible).

I think this can be unit tested, and I believe that NuDB's unit test covers the case of power loss. I think we can agree that power loss on a read is uninteresting (since it can't corrupt data). The unit test models a power loss as a fatal error during a write. The test exercises all possible fatal errors using an incremental approach (I alluded to this in my previous message). The database template requires a File type to instantiate: template<class Hasher, class File> class basic_store; NuDB comes with posix_file and win32_file. The unit test uses its own fail_file which wraps the native file and generates a simulated error on the Nth write: https://github.com/vinniefalco/NuDB/blob/master/extras/nudb/test/fail_file.h... The test implementation tries to insert a bunch of records into a new database using a fail_file, incrementing the counter as it goes. On a failure it tries to recover (which can also result in a failed write). After the recover fails, it verifies that the database is still consistent. You can see the test implementation: https://github.com/vinniefalco/NuDB/blob/master/test/recover.cpp#L125 I *think* this test proves that the implementation maintains consistency on any power loss. would very much like anyone who is interested to tell me whether this approach works.

...

- The key-file is append only so NuDB has multiple blocks of sorted records. So hopefully the insertions do not cause lots of blocks or the algorithm becomes a linear search.

There's no sorting going on here, the key file is an on-disk hash table where each bucket holds a good-sized number of keys (so that a single read from the key file does the most work).

...

This is surprisingly unlikely to scale to the IOPS Niall mentioned. The reader lock requires exclusive cacheline access to synchronize with the writer, which forces the core to wait for a response from other cores/processors.

That's possible. I haven't benchmarked it for that scenario and I am not knowledgeable on optimizing for memory access / cache performance. Thanks

Gavin Lambert

4:13 a.m.

On 22/03/2017 16:08, Vinnie Falco via Boost wrote:

...

I think this can be unit tested, and I believe that NuDB's unit test covers the case of power loss. I think we can agree that power loss on a read is uninteresting (since it can't corrupt data). The unit test models a power loss as a fatal error during a write. The test exercises all possible fatal errors using an incremental approach (I alluded to this in my previous message).

A power loss is more like a fatal error that fails to execute any subsequent clean-up code, so it might not be quite the same. There are also more pathological cases such as where a write has been partially successful and done some subset of increasing the file size, zeroing the extra file space, and writing some subset of the intended data. So it's not necessarily that data is missing; there might be invalid data in its place.

Karen Shaeffer

4:48 a.m.

On Wed, Mar 22, 2017 at 05:13:31PM +1300, Gavin Lambert via Boost wrote:

...

On 22/03/2017 16:08, Vinnie Falco via Boost wrote:

...
I think this can be unit tested, and I believe that NuDB's unit test covers the case of power loss. I think we can agree that power loss on a read is uninteresting (since it can't corrupt data). The unit test models a power loss as a fatal error during a write. The test exercises all possible fatal errors using an incremental approach (I alluded to this in my previous message).

A power loss is more like a fatal error that fails to execute any subsequent clean-up code, so it might not be quite the same.

Agreed. The hardware is losing power. The peripheral devices will likely not all lose power simultaneously. The kernel is suffering a fatal crash. In my experience, when you pull the power plug on a server, the execution of the kernel simply ends without any trace of what happened. The application will suffer a fatal error without any trace of it having happened. Karen. -- Karen Shaeffer The subconscious mind is driven by your deeply Neuralscape Services held beliefs -- not your deeply held desires.

Olaf van der Spek

8:01 a.m.

On Wed, Mar 22, 2017 at 5:13 AM, Gavin Lambert via Boost <boost@lists.boost.org> wrote:

...

On 22/03/2017 16:08, Vinnie Falco via Boost wrote:

...
I think this can be unit tested, and I believe that NuDB's unit test covers the case of power loss. I think we can agree that power loss on a read is uninteresting (since it can't corrupt data). The unit test models a power loss as a fatal error during a write. The test exercises all possible fatal errors using an incremental approach (I alluded to this in my previous message).

A power loss is more like a fatal error that fails to execute any subsequent clean-up code, so it might not be quite the same.

Recovery obviously runs on the next power on. ;)

...

There are also more pathological cases such as where a write has been partially successful and done some subset of increasing the file size, zeroing the extra file space, and writing some subset of the intended data. So it's not necessarily that data is missing; there might be invalid data in its place.

Do modern FS still allow that to happen? I guess some do.. Don't (log) records contain a checksum to guard against this? -- Olaf

Niall Douglas

11:23 a.m.

On 22/03/2017 04:13, Gavin Lambert via Boost wrote:

...

On 22/03/2017 16:08, Vinnie Falco via Boost wrote:

...
I think this can be unit tested, and I believe that NuDB's unit test covers the case of power loss. I think we can agree that power loss on a read is uninteresting (since it can't corrupt data). The unit test models a power loss as a fatal error during a write. The test exercises all possible fatal errors using an incremental approach (I alluded to this in my previous message).

A power loss is more like a fatal error that fails to execute any subsequent clean-up code, so it might not be quite the same.

There are also more pathological cases such as where a write has been partially successful and done some subset of increasing the file size, zeroing the extra file space, and writing some subset of the intended data. So it's not necessarily that data is missing; there might be invalid data in its place.

There are a few rings to testing data loss safety: Ring 1: Does my application code correctly handle all possible errors in all possible contexts? This can be tested using Monte Carlo methods, fuzzing, parameter permutations, unit and functional testing. Ring 2: Does my code correctly handle sudden stop? This can be tested using LXC containers where you kill -9 the container mid-test. Monte Carlo to verification. Ring 3: Does my code correctly handle sudden kernel stop? This can be tested using kvm or qemu where you kill -9 the virtualised OS mid-test. Ring 4: Does my code correctly handle sudden power loss to the CPU? This can be tested using a few dozen cheap odroid devices where you manually trip their watchdog hard reset hardware feature. This solution has the big advantage of not requiring the SSD used to be sudden power loss safe :) Ring 5: Does my code correctly handle sudden power loss to the storage? It requires more work and you'll find endless bugs in the kernel, filing system and the storage device, but you can install a hardware switch to cut power to the storage device mid-test. This is a never ending "fun" task, it's far too uncommonly tested by the kernel vendors, but it's a great simulation of how well faulty storage is handled. Ring 6: Does my code correctly handle sudden power loss to the system? Unlike Ring 5 this is actually a better tested situation. Sudden power loss to everything at once is probably less buggy than Ring 5. Still, you can get data loss at any level from the kernel, to the SATA chip, to the device itself. There are also other test rings not related to sudden power loss. For example, single and paired bit flips are not uncommon in terabytes of storage, either transient or permanent. These can be simulated using kvm with you manually flipping random bits in the disc image as it runs. You might become quite appalled at what data gets destroyed by bugs in the filing system when facing flipped bits. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Glen Fernandes

4:21 a.m.

On Tue, Mar 21, 2017 at 11:08 PM, Vinnie Falco via Boost wrote:

...

I just released version 1.0.0 of my header-only C++11 library NuDB. Its quite a mature library, haven't had to make any changes in a while. Its been running on production servers for 2 years or so with no problems, managing ever-growing databases of 2TB.

I'm wondering what the level of interest, if any, there would be for making this a part of Boost. You can check it out here:

I only had a brief look - one quick question: Is your arena allocating storage for a sequence of uint8_t's, and you're constructing 'element' objects in that storage? Glen

Vinnie Falco

25 Mar 25 Mar

3:04 p.m.

On Wed, Mar 22, 2017 at 12:21 AM, Glen Fernandes via Boost <boost@lists.boost.org> wrote:

...

I only had a brief look - one quick question: Is your arena allocating storage for a sequence of uint8_t's, and you're constructing 'element' objects in that storage?

Kind of, each object allocated in the arena is an `element` type, followed by a variable number of bytes of storage.

Lee Clagett

8:01 p.m.

On Tue, 21 Mar 2017 23:08:30 -0400 Vinnie Falco via Boost <boost@lists.boost.org> wrote:

...

On Tue, Mar 21, 2017 at 10:50 PM, Lee Clagett via Boost <boost@lists.boost.org> wrote:

[...snip...]

...

...
The pathological case is power loss, which cannot be unit tested easily (and probably isn't possible).

I think this can be unit tested, and I believe that NuDB's unit test covers the case of power loss. I think we can agree that power loss on a read is uninteresting (since it can't corrupt data). The unit test models a power loss as a fatal error during a write. The test exercises all possible fatal errors using an incremental approach (I alluded to this in my previous message).

The database template requires a File type to instantiate:

template<class Hasher, class File> class basic_store;

NuDB comes with posix_file and win32_file. The unit test uses its own fail_file which wraps the native file and generates a simulated error on the Nth write: https://github.com/vinniefalco/NuDB/blob/master/extras/nudb/test/fail_file.h...

The test implementation tries to insert a bunch of records into a new database using a fail_file, incrementing the counter as it goes. On a failure it tries to recover (which can also result in a failed write). After the recover fails, it verifies that the database is still consistent. You can see the test implementation: https://github.com/vinniefalco/NuDB/blob/master/test/recover.cpp#L125

I *think* this test proves that the implementation maintains consistency on any power loss. would very much like anyone who is interested to tell me whether this approach works.

The other responses to this thread reiterated what I thought could occur - there should be corruption "races" from a write call to file sync completion. Writing the blocks to the log file are superfluous because it is writing to multiple sectors and there is no mechanism to detect a partial write after power failure. So the information cannot be read from reliably. If these writes were omitted, the presence of the log file header indicates whether a commit failed to complete, and this header should fit in a single sector. The key and data file sizes in that header indicate what portion at the end of the key and data files should be ignored. If the log file header was moved to beginning of the key file, after the records are appended and sync'ed, the header could be overwritten with the new file lengths. The key file would not need to be opened with the `_POSIX_SYNCHRONIZED_IO` flag, and a fsync after the writing header can be omitted (which means a commit also has "relaxed" disk flushing). This should improve commit performance. At this point I'm sure Niall will tell me that writing <512 bytes to the beginning of the file is also not an atomic operation. Which is exactly what SQLite documentation assumes. The power-loss recovery algorithm gets tougher with this assumption.

...

...
- The key-file is append only so NuDB has multiple blocks of sorted records. So hopefully the insertions do not cause lots of blocks or the algorithm becomes a linear search.

There's no sorting going on here, the key file is an on-disk hash table where each bucket holds a good-sized number of keys (so that a single read from the key file does the most work).

I jumped into the internal fetch function which was sorted within a single bucket and had a linked list of spills. Reading the README first would've made it clear that there was more to the implementation. So the worst case performance is a link-list if a hash collision. Was the primary decision for the default hash implementation performance? Lee

Vinnie Falco

8:22 p.m.

On Sat, Mar 25, 2017 at 4:01 PM, Lee Clagett via Boost <boost@lists.boost.org> wrote:

...

The other responses to this thread reiterated what I thought could occur - there should be corruption "races" from a write call to file sync completion.

NuDB makes the same assumptions regarding the underlying file system capabilities as SQLite. In particular, if there are two calls to fsync in a row, it assumes that the first fsync will complete before the second one starts. And that upon return from a successful call to fsync, the data has been written. When there is a power loss or device failure, it is possible that recent insertions are lost. The library only guarantees that there will be no corruption. Specifically, any insertions which happen after a commit, might be rolled back if the recover process is invoked. Since the commit process runs every second, not much will be lost.

...

Writing the blocks to the log file are superfluous because it is writing to multiple sectors and there is no mechanism to detect a partial write after power failure.

Hmm, I don't think there's anything superfluous in this library. The log file is a "rollback file." It contains blocks from the key file in the state they were in before being modified. During the commit phase, nothing in the key file is modified until all of the blocks intended to be modified are first backed up to the log file. If the power goes out while these blocks are written to the log file, there is no loss.

...

I jumped into the internal fetch function which was sorted within a single bucket and had a linked list of spills. Reading the README first would've made it clear that there was more to the implementation.

The documentation for NuDB needs work! I can only vouch for the maturity of the source code, not the documentation :)

...

So the worst case performance is a link-list if a hash collision.

Right, although the creation parameters are tuned such that less than 1% of buckets have 1 spill record, and 0% of buckets have 2 or more spill records.

...

Was the primary decision for the default hash implementation performance?

If you're talking about xxhasher, it was chosen for being the best balance of performance, good distribution properties, and decent security. NuDB was designed to handle adversarial inputs since most envisioned use-cases insert data from untrusted sources / the network.

Lee Clagett

26 Mar 26 Mar

4:08 a.m.

On Sat, 25 Mar 2017 16:22:50 -0400 Vinnie Falco via Boost <boost@lists.boost.org> wrote:

...

On Sat, Mar 25, 2017 at 4:01 PM, Lee Clagett via Boost <boost@lists.boost.org> wrote:

...
The other responses to this thread reiterated what I thought could occur - there should be corruption "races" from a write call to file sync completion.

NuDB makes the same assumptions regarding the underlying file system capabilities as SQLite. In particular, if there are two calls to fsync in a row, it assumes that the first fsync will complete before the second one starts. And that upon return from a successful call to fsync, the data has been written.

I think SQLite makes more stringent assumptions - that between the write and the sync the file metadata and sectors can be written in any order. And one further - that a single sector could be partially written but only sequentially forwards or backwards. This last assumption sounds like a legacy assumption from spinning disks. I have been (poorly) suggesting that the order of the new buckets in the log and log file header be swapped. Writing the header which fits in a single sector is _more_ likely to be all-or-nothing than the bucket records. At least that's my understanding of this insane situation. The recover method in NuDB appears to be assuming that if it can read the entire bucket in the log, that its valid. Using the header to "post" completed changes seems like it has a smaller window for failure. If the header was not written correctly, truncate.

...

When there is a power loss or device failure, it is possible that recent insertions are lost. The library only guarantees that there will be no corruption. Specifically, any insertions which happen after a commit, might be rolled back if the recover process is invoked. Since the commit process runs every second, not much will be lost.

...
Writing the blocks to the log file are superfluous because it is writing to multiple sectors and there is no mechanism to detect a partial write after power failure.

Hmm, I don't think there's anything superfluous in this library. The log file is a "rollback file." It contains blocks from the key file in the state they were in before being modified. During the commit phase, nothing in the key file is modified until all of the blocks intended to be modified are first backed up to the log file. If the power goes out while these blocks are written to the log file, there is no loss.

I've proven twice in this thread I cannot read code quickly. Crap. I thought the key file was append-only like the data file, which is not possible. No idea why I thought this, other than it would make the implementation easier. I think the only way to get rid of the log file in this design is via a COW filesystem (but I have no experience with those).

...

...
I jumped into the internal fetch function which was sorted within a single bucket and had a linked list of spills. Reading the README first would've made it clear that there was more to the implementation.

The documentation for NuDB needs work! I can only vouch for the maturity of the source code, not the documentation :)

...
So the worst case performance is a link-list if a hash collision.

Right, although the creation parameters are tuned such that less than 1% of buckets have 1 spill record, and 0% of buckets have 2 or more spill records.

...
Was the primary decision for the default hash implementation performance?

If you're talking about xxhasher, it was chosen for being the best balance of performance, good distribution properties, and decent security. NuDB was designed to handle adversarial inputs since most envisioned use-cases insert data from untrusted sources / the network.

Yes, I asked because I was curious about the security analysis. The only security analysis I could find quickly was the SMHasher test, and it got the same score as `murmurhash 3a`. Murmurhash 3 was shown to vulnerable to collisions, even with a random seed. I could not find the significance of the 'a' (i.e. `3a`) listed in the table provided by xxHash. Lee

Niall Douglas

11:11 a.m.

On 26/03/2017 05:08, Lee Clagett via Boost wrote:

...

On Sat, 25 Mar 2017 16:22:50 -0400 Vinnie Falco via Boost <boost@lists.boost.org> wrote:

...
On Sat, Mar 25, 2017 at 4:01 PM, Lee Clagett via Boost <boost@lists.boost.org> wrote:

...
The other responses to this thread reiterated what I thought could occur - there should be corruption "races" from a write call to file sync completion.

NuDB makes the same assumptions regarding the underlying file system capabilities as SQLite. In particular, if there are two calls to fsync in a row, it assumes that the first fsync will complete before the second one starts. And that upon return from a successful call to fsync, the data has been written.

I think SQLite makes more stringent assumptions - that between the write and the sync the file metadata and sectors can be written in any order. And one further - that a single sector could be partially written but only sequentially forwards or backwards. This last assumption sounds like a legacy assumption from spinning disks.

If you want durability across multiple OSs and filing systems, you need to assume that fsync's are reordered with respect to one another. All major databases assume this, and so should NuDB, or else NuDB needs to remove all claims regarding durability of any kind. In fact, you might as well remove the fsync code path entirely, from everything I've seen to date its presence provides a false sense of assurance which is much worse than providing no guarantees at all. Additionally, strongly consider a O_SYNC based design instead of an fsync based design. fsync() performs pathologically awful on copy-on-write filing systems, it unavoidably forces a full RCU cycle of multiple blocks. Opening the fd with O_SYNC causes COW filing systems to use an alternate caching algorithm, one without pathological performance. Note that O_SYNC on some filing systems still has the metadata reordering problem. You should always assume that fsync/O_SYNC writes are reordered with respect to one another across inodes. They are only sequentially consistent within the same inode when performed on the same fd. Again, I'd personally recommend you just remove all durability claims entirely, and remove the code claiming to implement it as an unnecessary overhead. You need to start with a design that assumes the filing system reorders everything all the time, it can't be retrofitted.

...

...
When there is a power loss or device failure, it is possible that recent insertions are lost. The library only guarantees that there will be no corruption. Specifically, any insertions which happen after a commit, might be rolled back if the recover process is invoked. Since the commit process runs every second, not much will be lost.

...
Writing the blocks to the log file are superfluous because it is writing to multiple sectors and there is no mechanism to detect a partial write after power failure.

Hmm, I don't think there's anything superfluous in this library. The log file is a "rollback file." It contains blocks from the key file in the state they were in before being modified. During the commit phase, nothing in the key file is modified until all of the blocks intended to be modified are first backed up to the log file. If the power goes out while these blocks are written to the log file, there is no loss.

You can never assume writes to one inode will reach storage before another in portable code. You can only assume in portable code that writes to the same inode via the same fd will reach storage in the order issued.

...

...
...
Was the primary decision for the default hash implementation performance?

If you're talking about xxhasher, it was chosen for being the best balance of performance, good distribution properties, and decent security. NuDB was designed to handle adversarial inputs since most envisioned use-cases insert data from untrusted sources / the network.

In which case you did not make a great choice. Much, much better would be Blake2b. 2 cycles/byte, cryptographically secure, collision probability exceeds life of the universe. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Lee Clagett

1:57 p.m.

On Sun, 26 Mar 2017 12:11:14 +0100 Niall Douglas via Boost <boost@lists.boost.org> wrote:

...

On 26/03/2017 05:08, Lee Clagett via Boost wrote:

...
On Sat, 25 Mar 2017 16:22:50 -0400 Vinnie Falco via Boost <boost@lists.boost.org> wrote:

...
On Sat, Mar 25, 2017 at 4:01 PM, Lee Clagett via Boost <boost@lists.boost.org> wrote:

...
The other responses to this thread reiterated what I thought could occur - there should be corruption "races" from a write call to file sync completion.

NuDB makes the same assumptions regarding the underlying file system capabilities as SQLite. In particular, if there are two calls to fsync in a row, it assumes that the first fsync will complete before the second one starts. And that upon return from a successful call to fsync, the data has been written.

I think SQLite makes more stringent assumptions - that between the write and the sync the file metadata and sectors can be written in any order. And one further - that a single sector could be partially written but only sequentially forwards or backwards. This last assumption sounds like a legacy assumption from spinning disks.

If you want durability across multiple OSs and filing systems, you need to assume that fsync's are reordered with respect to one another. All major databases assume this, and so should NuDB, or else NuDB needs to remove all claims regarding durability of any kind. In fact, you might as well remove the fsync code path entirely, from everything I've seen to date its presence provides a false sense of assurance which is much worse than providing no guarantees at all.

You snipped my suggestion after this for swapping the order of the header write, which assumed/implied `_POSIX_SYNCHRONIZED_IO` due to the reasons you established previously. The only caveat you gave earlier was that Linux did not implement this properly for the metadata flushing on some of its file systems. Which means the log file could be "empty" after a restart in the current implementation, even if both fsync checkpoints completed. At that point the hope on my part was that the sector metadata for the log file header remained unchanged on the overwrite. And also hope that the log file open on initialization with a zeroed header would provide ample time for a metadata flush. This was a crappy (longer but still) race condition, AND of course there are the COW filesystems which have to change the sector for the feature! And since Linux `_POSIX_SYNCHRONIZED_IO` is (apparently) partially meaningless in Linux, BTRFS is madness. So the write order swap suggestion _possibly_ improves durability on some subset of filesystems. But most users are going to be ext4 Linux anyway, which sounds like one of the problematic cases.

...

Additionally, strongly consider a O_SYNC based design instead of an fsync based design. fsync() performs pathologically awful on copy-on-write filing systems, it unavoidably forces a full RCU cycle of multiple blocks. Opening the fd with O_SYNC causes COW filing systems to use an alternate caching algorithm, one without pathological performance.

Note that O_SYNC on some filing systems still has the metadata reordering problem. You should always assume that fsync/O_SYNC writes are reordered with respect to one another across inodes. They are only sequentially consistent within the same inode when performed on the same fd.

Again, I'd personally recommend you just remove all durability claims entirely, and remove the code claiming to implement it as an unnecessary overhead. You need to start with a design that assumes the filing system reorders everything all the time, it can't be retrofitted.

...
...
When there is a power loss or device failure, it is possible that recent insertions are lost. The library only guarantees that there will be no corruption. Specifically, any insertions which happen after a commit, might be rolled back if the recover process is invoked. Since the commit process runs every second, not much will be lost.

...
Writing the blocks to the log file are superfluous because it is writing to multiple sectors and there is no mechanism to detect a partial write after power failure.

Hmm, I don't think there's anything superfluous in this library. The log file is a "rollback file." It contains blocks from the key file in the state they were in before being modified. During the commit phase, nothing in the key file is modified until all of the blocks intended to be modified are first backed up to the log file. If the power goes out while these blocks are written to the log file, there is no loss.

You can never assume writes to one inode will reach storage before another in portable code. You can only assume in portable code that writes to the same inode via the same fd will reach storage in the order issued.

You chopped my response here too, and I think this was in response to the COW + inode suggestion. If the design knew COW was available for the filesystem in use, couldn't it also know whether data + metadata is synchronized as expected? The suggestion clearly was not portable anyway.

...

...
...
...
Was the primary decision for the default hash implementation performance?

If you're talking about xxhasher, it was chosen for being the best balance of performance, good distribution properties, and decent security. NuDB was designed to handle adversarial inputs since most envisioned use-cases insert data from untrusted sources / the network.

In which case you did not make a great choice.

Much, much better would be Blake2b. 2 cycles/byte, cryptographically secure, collision probability exceeds life of the universe.

Niall

Lee

Niall Douglas

10:07 p.m.

...

You snipped my suggestion after this for swapping the order of the header write, which assumed/implied `_POSIX_SYNCHRONIZED_IO` due to the reasons you established previously. The only caveat you gave earlier was that Linux did not implement this properly for the metadata flushing on some of its file systems. Which means the log file could be "empty" after a restart in the current implementation, even if both fsync checkpoints completed.

The reason I snipped it was because the original algorithm is broken, and so is yours. You are not conceptualising the problem correctly: consider storage after sudden power loss to be exactly the same as a malicious attacker on the internet capable of manipulating bits on storage to any value in order to cause misoperation. That includes making use of collisions in weak hashes to direct your power loss recovery implementation to sabotage and destroy additional data after the power loss. The claim of durability in ACID is a very, very strong claim. You are explicitly guaranteeing when you claim durability that after power loss your database will always *perfectly* match a correct state from some time before power loss. That specifically means that all transactions up to that point are whole and complete, and all data is exactly correct, and no partial anything or corrupted anything is there. NuDB is not using cryptographically strong hashing, and is therefore subject to collision induced post power loss data loss except on filing systems which provide strong guarantees that corrupted data will never appear. ZFS and ReFS are one of those, ext4 mounted with "data=journal" also is. If NuDB clearly said in its docs "no durability guarantees except on this list of filing systems and mount options: ..." I'd be happy. But it does not: it makes claims which are obviously wrong. And that's fine as some library somewhere on github, but if it wants to enter Boost, it needs to not mislead people or make claims which are patently untrue.

...

...
You can never assume writes to one inode will reach storage before another in portable code. You can only assume in portable code that writes to the same inode via the same fd will reach storage in the order issued.

You chopped my response here too, and I think this was in response to the COW + inode suggestion. If the design knew COW was available for the filesystem in use, couldn't it also know whether data + metadata is synchronized as expected? The suggestion clearly was not portable anyway.

COW filing systems generally offer much stronger guarantees than non-COW filing systems. You are correct that if you are operating on one of those, you can skip a ton of work to implement durability. This is why AFIO v2 has "storage profiles" where ZFS, ReFS and BtrFS are all top of tree in terms of disabling work done by AFIO and its clients. FAT32, meanwhile, sits at the very bottom. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Vinnie Falco

10:20 p.m.

On Sun, Mar 26, 2017 at 6:07 PM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...

...That includes making use of collisions in weak hashes to direct your power loss recovery implementation to sabotage and destroy additional data after the power loss.

No idea what you're going on about here, NuDB's recovery process doesn't recalculate hashes.

...

NuDB is not using cryptographically strong hashing, and is therefore subject to collision induced post power loss data loss.

Again no idea what you're going on about here, the presence of collisions in the hash function in no way harms the integrity of a NuDB database. When two keys have the same hash value, they simply occupy adjacent slots. The fetch() algorithm iterates all entries in the bucket which have the same hash value: https://github.com/vinniefalco/NuDB/blob/master/include/nudb/impl/basic_stor... The database constants were chosen to make a collision incredibly unlikely, and the implementation has code to explicitly handle the case where a collision happens. As I said before, NuDB was designed for adversarial inputs, each database has a randomly calculated "salt" which permutes the hash function. An attacker would have to compromise the machine and extract the salt in order to reliably produce keys which hash to the same value. NuDB assumes attackers do not have root access to the machine upon which it runs (or else anything is possible really).

...

If NuDB clearly said in its docs "no durability guarantees except on this list of filing systems and mount options: ..." I'd be happy.

There are definitely filing systems and mount options which are unsuitable for NuDB. All I can reliably say is that it works for some subset of all possible systems, without explicitly mentioning what those systems are - only because I am just a humble peon from a small podunk who doesn't know the way of things. However, if there was someone out there who was worldly in their travels with much knowledge and experience across a broad spectrum of filesystems, operating systems, mount options, and hardware, who perhaps has dedicated considerable time to developing file libraries of their own - maybe that person could leverage their immense wisdom to helping me compile such a list of filing systems and mount options? Thanks!

Lee Clagett

28 Mar 28 Mar

12:45 p.m.

On Sun, 26 Mar 2017 23:07:14 +0100 Niall Douglas via Boost <boost@lists.boost.org> wrote:

...

...
You snipped my suggestion after this for swapping the order of the header write, which assumed/implied `_POSIX_SYNCHRONIZED_IO` due to the reasons you established previously. The only caveat you gave earlier was that Linux did not implement this properly for the metadata flushing on some of its file systems. Which means the log file could be "empty" after a restart in the current implementation, even if both fsync checkpoints completed.

The reason I snipped it was because the original algorithm is broken, and so is yours. You are not conceptualising the problem correctly: consider storage after sudden power loss to be exactly the same as a malicious attacker on the internet capable of manipulating bits on storage to any value in order to cause misoperation. That includes making use of collisions in weak hashes to direct your power loss recovery implementation to sabotage and destroy additional data after the power loss.

I do understand the problem - I stated 3 times in this thread that even writing the small header to a single sector could still be problematic. My suggestions were primarily trying to tweak the existing design to improve durability with minimal impact. If I convinced Vinnie (and perhaps even myself) that writing the log header after its contents could reduce the probability of an undetected incomplete write, the next obvious suggestion was to append a cryptographic hash of the header(s). The buckets in the log would then be valid if fsync blocking until metadata + data completion can be assumed. Even if the hardware lies about immediately writing to the physical medium, it should still be reducing the time window where data loss can occur. Hashing over the entire log file would be a more portable/optimal solution, but adds _more_ CPU time and would deviate from the current implementation a bit more. I think there is a point where handling difficult filesystems and hardware is out of scope for this library. If the library cannot assume that a returned fsync call means the hardware "stored" the data + metadata, it could make the implementation more complex/costly. Checking for `_POSIX_SYNCHRONIZED_IO` and calling an OSX `fnctl` instead of `fsync` is probably the limit of actions a library like NuDB should take. NuDB already has a file concept that needs documenting and formalizing before any potential boost review. These harder edge cases could be provided by an implementation of this concept instead of NuDB directly. If the highly durable implementation required a noticeable amount of CPU cycles, existing and new users of the library could remain on the potentially less durable and faster direct platform versions that "steals" less CPU cycles from their system. [...snip...]

...

...
...
You can never assume writes to one inode will reach storage before another in portable code. You can only assume in portable code that writes to the same inode via the same fd will reach storage in the order issued.

You chopped my response here too, and I think this was in response to the COW + inode suggestion. If the design knew COW was available for the filesystem in use, couldn't it also know whether data + metadata is synchronized as expected? The suggestion clearly was not portable anyway.

COW filing systems generally offer much stronger guarantees than non-COW filing systems. You are correct that if you are operating on one of those, you can skip a ton of work to implement durability. This is why AFIO v2 has "storage profiles" where ZFS, ReFS and BtrFS are all top of tree in terms of disabling work done by AFIO and its clients. FAT32, meanwhile, sits at the very bottom.

Getting the NuDB file concept to work with AFIOv2 seems like it could be very useful then. Does v2 have a method for specifying dependency order on writes (I couldn't find any)? I thought v1 had this feature - does v2 drop it? Lee

Vinnie Falco

12:59 p.m.

On Tue, Mar 28, 2017 at 8:45 AM, Lee Clagett via Boost <boost@lists.boost.org> wrote:

...

...writing the log header after its contents could reduce the probability of an undetected incomplete write

The recovery test simulates partial writes: https://github.com/vinniefalco/NuDB/blob/master/extras/nudb/test/fail_file.h...

...

NuDB already has a file concept that needs documenting and formalizing before any potential boost review.

http://vinniefalco.github.io/nudb/nudb/types/File.html Thanks

Lee Clagett

29 Mar 29 Mar

12:26 a.m.

On Tue, 28 Mar 2017 08:59:10 -0400 Vinnie Falco via Boost <boost@lists.boost.org> wrote:

...

On Tue, Mar 28, 2017 at 8:45 AM, Lee Clagett via Boost <boost@lists.boost.org> wrote:

...
...writing the log header after its contents could reduce the probability of an undetected incomplete write

The recovery test simulates partial writes: https://github.com/vinniefalco/NuDB/blob/master/extras/nudb/test/fail_file.h...

This is simulating a write I/O error, not a power failure. Even with the assumption that a returned `fsync` has fully stored the data on disk, the recovery algorithm could be opening a log file which called `write` but never returned from `fsync`. That file could have enough "space" for a bucket, but lack the proper contents of the bucket itself. There is an inherent race between writing and the completion of an `fsync` that will go unnoticed by the current recovery algorithm on some filesystem configurations. The only "portable" fixes I've seen are: (1) cryptographic hashes, (2) hoping that changing path to inode mappings is all-or-nothing, or (3) hoping that _overwriting_ a single sector last will be all-or-nothing. Both (2) and (3) still depend on the filesystem + hardware AFAIK, BUT probably work with more filesystem and hardware configurations.

...

...
NuDB already has a file concept that needs documenting and formalizing before any potential boost review.

http://vinniefalco.github.io/nudb/nudb/types/File.html

I did miss this. Defining the concept in terms of "records" might be more useful for storing a cryptographic hash in the manner Niall was mentioning for SQLite. I think it could allow per record corruption detection, so that the entire DB wasn't punted after an incomplete write. Lee

Vinnie Falco

12:49 a.m.

On Tue, Mar 28, 2017 at 8:26 PM, Lee Clagett via Boost <boost@lists.boost.org> wrote:

...

This is simulating a write I/O error, not a power failure. Even with the assumption that a returned `fsync` has fully stored the data on disk, the recovery algorithm could be opening a log file which called `write` but never returned from `fsync`. That file could have enough "space" for a bucket, but lack the proper contents of the bucket itself. There is an inherent race between writing and the completion of an `fsync` that will go unnoticed by the current recovery algorithm on some filesystem configurations.

Okay, I can almost understand what you're saying. Perhaps a bit more guidance would help. Can you walk me through the scenario you're describing, making reference to the current commit algorithm? Its here: https://github.com/vinniefalco/NuDB/blob/master/include/nudb/impl/basic_stor... My goals is to improve the fail_file used in the tests to capture your case. Thanks

Niall Douglas

28 Mar 28 Mar

1:27 p.m.

...

I think there is a point where handling difficult filesystems and hardware is out of scope for this library.

I agree. Just don't claim any guarantees about reliability or data safety and I'm totally happy. I would then advise that if you're not implementing durability, you might as well remove the inefficiency of keeping write journals etc and use a design which goes even faster. I mentioned some time ago that even with AFIO v2 today in its very alpha state, I reckon in a few hundred lines of code I can meet or beat NuDB precisely because I would assume no data integrity guarantees and that lets me use a single file for everything. That, in turn, allows concurrent inserts, and I suspect I can also allow deletion. Simpler is better. I would of course not recommend that anyone use AFIO v2 in its present state. It's also currently not compiling on POSIX, it awaits some TLC from me after the ACCU conference next month.

...

...
COW filing systems generally offer much stronger guarantees than non-COW filing systems. You are correct that if you are operating on one of those, you can skip a ton of work to implement durability. This is why AFIO v2 has "storage profiles" where ZFS, ReFS and BtrFS are all top of tree in terms of disabling work done by AFIO and its clients. FAT32, meanwhile, sits at the very bottom.

Getting the NuDB file concept to work with AFIOv2 seems like it could be very useful then. Does v2 have a method for specifying dependency order on writes (I couldn't find any)? I thought v1 had this feature - does v2 drop it?

The AFIO v1 review indicated that the community did not want AFIO to implement a mechanism for dependency ordering. They wanted just a barebones file i/o abstraction layer, so you get a completion handler callback and that's all. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Vinnie Falco

1:42 p.m.

On Tue, Mar 28, 2017 at 9:27 AM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...

I reckon in a few hundred lines of code I can meet or beat NuDB

Is there a repository link where I can try out your code?

Niall Douglas

2:25 p.m.

On 28/03/2017 14:42, Vinnie Falco via Boost wrote:

...

On Tue, Mar 28, 2017 at 9:27 AM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...
I reckon in a few hundred lines of code I can meet or beat NuDB

Is there a repository link where I can try out your code?

AFIO v2 is a collection of primitive filesystem algorithms. The idea is that it's a toolkit for quickly mashing together custom file system programs like your NuDB in minimum time and effort. So there is nothing in there which looks like NuDB. Docs, such as they are, are at https://ned14.github.io/boost.afio/. Github repo is obvious. And as I mentioned, the CI has been reporting it doesn't build for some months now, so I'd doubt you can try anything until after the ACCU conference. If I were going to write in a few hundred lines something like NuDB, for the value store file, I'd make it sparse and do atomic append as you do, and use hole punching to delete and free storage. I'd create a shared memory mapped view of the key file. I'd use something like https://github.com/ned14/boost-lite/blob/master/include/algorithm/open_hash_... for the indexing algorithm in that memory mapped key file, it has a concurrency safe policy class called atomic_linear_memory_policy which puts a spinlock on each hash table item, and so is concurrent safe. That would be my first attempt, and it would be easy. But I think there is a better algorithm in there, one based on a very large single sparse file 16 Exabyte long which is itself is one giant open addressed hash table. With a very high quality hash, you simply open hash address to the nearest 4Kb block and your maximum key + value is 4Kb with larger values being spilled into a separate largevalue file. Small value cluster packing might be worth implementing as well if values are much smaller than 4Kb, so the idea is you pack nearby items into the same 4Kb extent where possible and use a linear scan to find them inside the 4Kb block. The second algorithm I think would be much faster than your algorithm, but it's pure guesswork on my behalf on the basis that you do one i/o per read or write as it's a single file. It easily might be much slower. Either way, the first algorithm is basically the same as yours, but simplified down to the bare essentials. It should perform as well as yours, and likely better due to no write log file etc. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Vinnie Falco

2:34 p.m.

On Tue, Mar 28, 2017 at 10:25 AM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...

If I were going to write...a better algorithm...would be much faster than your algorithm...It should perform as well as yours, and likely better...

I have not yet learned how to compare the performance of one library against another, yet-unwritten library so how much trouble would it be for you to put together a simple example - perhaps using these "primitive filesystem algorithms" which assist in rapid development?

Niall Douglas

3:16 p.m.

On 28/03/2017 15:34, Vinnie Falco via Boost wrote:

...

On Tue, Mar 28, 2017 at 10:25 AM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...
If I were going to write...a better algorithm...would be much faster than your algorithm...It should perform as well as yours, and likely better...

I have not yet learned how to compare the performance of one library against another, yet-unwritten library so how much trouble would it be for you to put together a simple example - perhaps using these "primitive filesystem algorithms" which assist in rapid development?

Not much effort. But I have much higher priority items between now and the ACCU conference as I'm one of the major speakers. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Asbjørn

4:58 p.m.

On 28.03.2017 15:27, Niall Douglas via Boost wrote:

...

...
I think there is a point where handling difficult filesystems and hardware is out of scope for this library.

I agree. Just don't claim any guarantees about reliability or data safety and I'm totally happy. I would then advise that if you're not implementing durability, you might as well remove the inefficiency of keeping write journals etc and use a design which goes even faster.

Isn't the point here that NuDB's guarantees depend on the OS, filesystem and hardware? If the OS filesystem and hardware guarantees that fsync's won't be reordered, then NuDB can guarantee that it is durable (modulo any misunderstanding on my part). If so, why throw it all away? Maybe the user has an OS, a filesystem and some hardware which can guarantee this? Regards - Asbjørn

Niall Douglas

29 Mar 29 Mar

6:18 a.m.

On 28/03/2017 17:58, Asbjørn via Boost wrote:

...

On 28.03.2017 15:27, Niall Douglas via Boost wrote:

...
...
I think there is a point where handling difficult filesystems and hardware is out of scope for this library.

I agree. Just don't claim any guarantees about reliability or data safety and I'm totally happy. I would then advise that if you're not implementing durability, you might as well remove the inefficiency of keeping write journals etc and use a design which goes even faster.

Isn't the point here that NuDB's guarantees depend on the OS, filesystem and hardware?

The point I am trying to make is that NuDB's guarantees need to NOT depend on the OS, filesystem and hardware. Otherwise they are not valuable guarantees. You can and should make your durability implementation totally independent of the system you are running on. It should perform as perfectly as possible in response to whatever messed up state turns up after power loss. Whatever is lost is lost, the *key* feature is that damaged data doesn't cause further data loss.

...

If the OS filesystem and hardware guarantees that fsync's won't be reordered, then NuDB can guarantee that it is durable (modulo any misunderstanding on my part).

If they work on all systems with all hardware then yes, with careful writing you can achieve durability using write ordering. BSD's UFS is the classic example of such an implementation. But far better to not rely on fsync working at all. Remember, fsync is permitted to do nothing, and it does return before data is on storage on at least one major OS (OS X).

...

If so, why throw it all away? Maybe the user has an OS, a filesystem and some hardware which can guarantee this?

Because a proper implementation of durability should be able to use no fsync and no O_SYNC at all. In that case, you get "late durability" where minutes of recent writes get lost after power loss. For users where that is unacceptable, O_SYNC should be turned on and you now have "early durability" where only seconds may be lost. You pay for that early durability with much reduced performance. fsync is the worst method you can use. It has the worst semantics, the worst performance and is unreliable. Everybody should be using O_SYNC where possible instead, and your code should still work perfectly without either O_SYNC nor fsync working. Then your implementation can correctly claim to implement durability. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Asbjørn

9:13 a.m.

On 29.03.2017 08:18, Niall Douglas via Boost wrote:

...

Whatever is lost is lost, the *key* feature is that damaged data doesn't cause further data loss.

I'm struggling to see how you can guarantee that without _any_ guarantees from the OS or hardware.

...

...
If so, why throw it all away? Maybe the user has an OS, a filesystem and some hardware which can guarantee this?

Because a proper implementation of durability should be able to use no fsync and no O_SYNC at all. In that case, you get "late durability" where minutes of recent writes get lost after power loss. For users where that is unacceptable, O_SYNC should be turned on and you now have "early durability" where only seconds may be lost. You pay for that early durability with much reduced performance.

Without O_SYNC and fsync, replace "minutes" with "hours" or "days". This may be entirely unacceptable. With O_SYNC you get horrible performance as you note, which may be entirely unacceptable. Also, I'm assuming the hardware may ignore the O_SYNC as much as it can ignore the fsync, in which case you're SOL anyway. Regards - Asbjørn

Niall Douglas

11:07 a.m.

On 29/03/2017 10:13, Asbjørn via Boost wrote:

...

On 29.03.2017 08:18, Niall Douglas via Boost wrote:

...
Whatever is lost is lost, the *key* feature is that damaged data doesn't cause further data loss.

I'm struggling to see how you can guarantee that without _any_ guarantees from the OS or hardware.

The lack of guarantees only refers to post-power-loss data integrity. And, as you've mentioned, it's only a portability concern. Specific combinations of OS kernel version, SCSI controller, SSD etc have excellent guarantees. The trouble is you can't know whether your particular combination works reliably or not, or whether it is still working reliably or not. For the implementation of Durability, one can assume that everything works perfectly in between power loss events. That in itself is a bit risky due to storage bit rot, cosmic ray bitflips and so on, but that's a separate matter to Durability. (incidentally, AFIO v2 provides a fast templated SECDED class letting you repair bitflips from parity info, handy for mature cold storage)

...

...
...
If so, why throw it all away? Maybe the user has an OS, a filesystem and some hardware which can guarantee this?

Because a proper implementation of durability should be able to use no fsync and no O_SYNC at all. In that case, you get "late durability" where minutes of recent writes get lost after power loss. For users where that is unacceptable, O_SYNC should be turned on and you now have "early durability" where only seconds may be lost. You pay for that early durability with much reduced performance.

Without O_SYNC and fsync, replace "minutes" with "hours" or "days". This may be entirely unacceptable. With O_SYNC you get horrible performance as you note, which may be entirely unacceptable.

A filing system which takes hours to send dirty blocks to storage is buggy or misconfigured. Most will write out dirty blocks within 30 seconds of modification whatever the situation. You are probably referring to "live blocks", so in the past, especially on Linux, if you repeatedly modified the same block frequently it would get its age counter reset and so might never be written out. I don't believe any recent Linux kernel has that problem any more. "First dirtied" timestamps are kept separate to "last dirtied" nowadays. If a dirtied block is too old, it'll get flushed. There is still a problem I believe on Windows with FAT where live blocks may take far too long to hit storage. But most USB disks are mounted with write through semantics, so you shouldn't see that problem in most modern systems which don't tend to have FAT drives with writeback caching.

...

Also, I'm assuming the hardware may ignore the O_SYNC as much as it can ignore the fsync, in which case you're SOL anyway.

Oh yes. You should also assume O_SYNC does nothing. On some systems, or some configurations of systems (e.g. inside lxc containers) it really does do nothing. Thankfully, most lxc containers I've seen in the wild only disable fsync, not O_SYNC. Which is another very good reason to not use fsync - people running your code inside a lxc container get a false sense of security. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Thijs van den Berg

11:42 a.m.

IMO it would be useful to have a “fast key/value insert-only” storage It’s design *could* isolate a storage component that currently uses fsync, but which could be swappable with other storage providers that have different performance/ durability guarantees. Different users would want different properties here. That storage provider could have value in its own right and the main challenge is to define a good public interface. NuDB can provide it's default storage and the claimed properties of that could be documented. Personally I can live with fsync as a default choice, I would look at hardware options instead when data is very important to me.

Asbjørn

6:54 p.m.

On 29.03.2017 13:07, Niall Douglas via Boost wrote:

...

On 29/03/2017 10:13, Asbjørn via Boost wrote:

...
On 29.03.2017 08:18, Niall Douglas via Boost wrote:

...
Whatever is lost is lost, the *key* feature is that damaged data doesn't cause further data loss.

I'm struggling to see how you can guarantee that without _any_ guarantees from the OS or hardware.

The lack of guarantees only refers to post-power-loss data integrity.

But that's not what you wrote. You said: "The point I am trying to make is that NuDB's guarantees need to NOT depend on the OS, filesystem and hardware. Otherwise they are not valuable guarantees." Surely any post-power-loss integrity guarantees are intimately related to between-power-loss guarantees as the data is being written in the "between" state right until the power goes. As an extreme example, if the OS does not guarantee your data will be written unmodified in the "between-power-loss" state, that is, it may write random data instead, then that directly affects the post-power-loss integrity. How could NuDB code around this? Surely at some point a program/library like NuDB must rely on _something_ from the OS, filesystem and hardware in order to claim anything about post-power-loss integrity of its data? Regards - Asbjørn

Niall Douglas

9:04 p.m.

On 29/03/2017 19:54, Asbjørn via Boost wrote:

...

On 29.03.2017 13:07, Niall Douglas via Boost wrote:

...
On 29/03/2017 10:13, Asbjørn via Boost wrote:

...
On 29.03.2017 08:18, Niall Douglas via Boost wrote:

...
Whatever is lost is lost, the *key* feature is that damaged data doesn't cause further data loss.

I'm struggling to see how you can guarantee that without _any_ guarantees from the OS or hardware.

The lack of guarantees only refers to post-power-loss data integrity.

But that's not what you wrote. You said:

"The point I am trying to make is that NuDB's guarantees need to NOT depend on the OS, filesystem and hardware. Otherwise they are not valuable guarantees."

I thought I was clear, but let me rephrase so: "The point I am trying to make is that NuDB's [post power loss data integrity] guarantees need to NOT depend on the OS, filesystem and hardware. Otherwise they are not valuable guarantees."

...

Surely any post-power-loss integrity guarantees are intimately related to between-power-loss guarantees as the data is being written in the "between" state right until the power goes.

As an extreme example, if the OS does not guarantee your data will be written unmodified in the "between-power-loss" state, that is, it may write random data instead, then that directly affects the post-power-loss integrity. How could NuDB code around this?

Surely at some point a program/library like NuDB must rely on _something_ from the OS, filesystem and hardware in order to claim anything about post-power-loss integrity of its data?

You are not wrong that memory corruption will inevitably affect what lands on storage. There is even a chance that the storage is fine, but a read got corrupted going into memory. There are nine sigma reliability CPUs that keep two parity bits per byte to solve the trustworthiness problem, but let's assume we're talking consumer hardware. When I say that you can assume that in between sudden power loss events everything works, but not across power loss events, I am talking about probabilities of data loss. So, between sudden power less events the chances of bits getting flipped somewhere important is very low. That's why your laptop, which may be turned on for weeks, doesn't blue screen usually even though quite a few bits will have been flipped by cosmic rays. The chances are also pretty good nowadays that your OS has been well tested, as has been your filesystem and your SSD, at least well tested in the code paths regularly executed which is 99.99% of what you will ever do. So given that almost all of the software on your computer isn't constantly crashing and file systems are not always corrupting, I'd say the chances of data loss between sudden power loss events is very low, perhaps no more frequently than once per month. Your hardware and OS and filesystem and SSD probably have been well tested *individually* for sudden power loss under certain canned test scenarios. The chances of them being tested together as a system is extremely low - perhaps only Apple do so with their MacBooks. For data loss to occur, just one thing needs to go wrong, whereas for data loss to not occur every single thing needs to go right. It's hard to come up with concrete probabilities, but as that paper Lee linked to points out, everybody writing storage algorithms - even the professionals - consistently gets sudden power loss wrong without fail. That paper found power loss bugs in the OS, filing systems, all the major databases and source control implementations and so on. This is despite all of those being written and tested very carefully for power loss correctness. They ALL made mistakes. So I'm going to throw out there that after power loss there is a strong probability that storage has got screwed up somehow by someone in the chain of bits of software between your code and the bits on the medium. That's why you should be paranoid about correct state after power loss, but between power loss events you can probably safely take off the tinfoil hat. Does this make more sense? Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Vinnie Falco

9:18 p.m.

On Wed, Mar 29, 2017 at 5:04 PM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...

as that paper Lee linked to points out, everybody writing storage algorithms - even the professionals - consistently gets sudden power loss wrong without fail.

That paper found power loss bugs in the OS, filing systems, all the major databases and source control implementations and so on. This is despite all of those being written and tested very carefully for power loss correctness. They ALL made mistakes.

I have to agree. For now I am withdrawing NuDB from consideration - the paper that Lee linked is very informative. However just to make sure I have the scenario that Lee pointed out in my head, here's the sequence of events: 1. NuDB makes a system call to append to the file 2. The file system increases the metadata indicating the new file size 3. The file system writes the new data into the correct location Lee, and I think Niall (hard to tell through the noise), are saying that if a crash occurs after 2 but before or during 3, the contents of the new portion of the file may be undefined. Is this correct? If so, then I do need to go back and make improvements to prevent this. While its true that I have not seen a corrupted database despite numerous production deployments and over 2TB data file, it would seem this case is sufficiently rare (and data-center hardware sufficiently reliable) that it is unlikely to have come up.

Lee Clagett

30 Mar 30 Mar

12:34 p.m.

On Wed, 29 Mar 2017 17:18:23 -0400 Vinnie Falco via Boost <boost@lists.boost.org> wrote:

...

On Wed, Mar 29, 2017 at 5:04 PM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...
as that paper Lee linked to points out, everybody writing storage algorithms - even the professionals - consistently gets sudden power loss wrong without fail.

That paper found power loss bugs in the OS, filing systems, all the major databases and source control implementations and so on. This is despite all of those being written and tested very carefully for power loss correctness. They ALL made mistakes.

I have to agree. For now I am withdrawing NuDB from consideration - the paper that Lee linked is very informative.

However just to make sure I have the scenario that Lee pointed out in my head, here's the sequence of events:

1. NuDB makes a system call to append to the file 2. The file system increases the metadata indicating the new file size 3. The file system writes the new data into the correct location

Lee, and I think Niall (hard to tell through the noise), are saying that if a crash occurs after 2 but before or during 3, the contents of the new portion of the file may be undefined.

Is this correct?

The best explanation of the problem I have seen is [described in a paper discussing the implementation details of the ext filesystem on linux][0]. The paper also discusses the various issues that the filesystem designers have had to face, which has been helpful to me. The thing to remember is that the filesystem metadata is not just the filesize; the filesystem actually has to write information about which portions of the disk are in use for the file. This is why a crash during an append could contain old log file contents after a restart - the filesystem added pointers to new sectors but not the data at those pointers.

...

If so, then I do need to go back and make improvements to prevent this. While its true that I have not seen a corrupted database despite numerous production deployments and over 2TB data file, it would seem this case is sufficiently rare (and data-center hardware sufficiently reliable) that it is unlikely to have come up.

Yes it is likely pretty rare, especially on a journaled filesystem. The system has to halt a very specific point in time. This is why I suggested swapping the order of the writes to: write_buckets -> fsync -> write_header -> fsync ... write_header(zeroes) -> fsync -> truncate(header_size - not `0`) This still has implementation/system defined behavior, but overwriting a single sector is more likely to be "atomic" from the perspective of the filesystem (but not necessarily the hard-drive). And it didn't require massive structural changes. Writing out a cryptographic hash of the header would leave a single assumption - fsync is a proper write barrier in the OS/filesystem and in the hard-drive. Niall has been particularly harsh on fsync, but I do not think its all bad. With the exception of OSX, it seems that many filesystems implement it properly (might regret saying this), and a user can purchase an "enterprise" hard-drive that is not trying to artificial boost benchmarks stats. At the very least the number of assumptions has been decreased. FWIW, I _think_ Niall's suggestion to remove the log file also might be an interesting to investigate. Lee [0] http://pages.cs.wisc.edu/~remzi/OSTEP/file-journaling.pdf

Niall Douglas

1:56 p.m.

...

barrier in the OS/filesystem and in the hard-drive. Niall has been particularly harsh on fsync, but I do not think its all bad. With the exception of OSX, it seems that many filesystems implement it properly (might regret saying this), and a user can purchase an "enterprise" hard-drive that is not trying to artificial boost benchmarks stats. At the very least the number of assumptions has been decreased.

For a general purpose collection of libraries such as Boost, you can't be imposing things like users have to use specific hardware X. It is not particularly hard, nor performance killing, to design a fast key value store which has no reliance on fsync, trunc nor O_SYNC. A correctly designed library ready for Boost would therefore have no reliance on fsync, trunc or OS_SYNC for correct operation. That's what we need to aim for. Libraries useful across a wide set of hardware and configurations. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Lee Clagett

2:06 p.m.

On Thu, 30 Mar 2017 08:34:07 -0400 Lee Clagett <forum@leeclagett.com> wrote:

...

On Wed, 29 Mar 2017 17:18:23 -0400 Vinnie Falco via Boost <boost@lists.boost.org> wrote:

...
On Wed, Mar 29, 2017 at 5:04 PM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...
as that paper Lee linked to points out, everybody writing storage [...snip...]

This still has implementation/system defined behavior, but overwriting a single sector is more likely to be "atomic" from the perspective of the filesystem (but not necessarily the hard-drive). And it didn't require massive structural changes. Writing out a cryptographic hash of the header would leave a single assumption - fsync is a proper write barrier in the OS/filesystem and in the hard-drive. Niall has been particularly harsh on fsync, but I do not think its all bad. With the exception of OSX, it seems that many filesystems implement it properly (might regret saying this), and a user can purchase an "enterprise" hard-drive that is not trying to artificial boost benchmarks stats. At the very least the number of assumptions has been decreased.

I forgot that even with the cryptographic hash this algorithm is assuming that the filesystem does not change sectors during a single sector overwrite. Otherwise it could point back to a prior log header that was never actually overwritten. Crap. So a rare but still somewhat crummy non-portable issue. Lee

Peter Dimov

29 Mar 29 Mar

1:36 p.m.

Niall Douglas wrote:

...

Because a proper implementation of durability should be able to use no fsync and no O_SYNC at all. In that case, you get "late durability" where minutes of recent writes get lost after power loss. For users where that is unacceptable, O_SYNC should be turned on and you now have "early durability" where only seconds may be lost.

No such thing. "The durability property ensures that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter)." If you lose writes, it's not Durable, just Consistent. The page you linked, https://www.sqlite.org/howtocorrupt.html, says "Actually, if one is only concerned with atomic and consistent writes and is willing to forego durable writes, the sync operation does not need to wait until the content is completely stored on persistent media. Instead, the sync operation can be thought of as an I/O barrier. As long as all writes that occur before the sync are completed before any write that happens after the sync, no database corruption will occur. If sync is operating as an I/O barrier and not as a true sync, then a power failure or system crash might cause one or more previously committed transactions to roll back (in violation of the "durable" property of "ACID") but the database will at least continue to be consistent, and that is what most people care about."

Vinnie Falco

1:39 p.m.

On Wed, Mar 29, 2017 at 9:36 AM, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter)."

Just a note, NuDB does not have this property. It is entirely possible that a call to nudb::store::insert will return, and if the database crashes afterwards the inserted value will not be present in the recovered database. This is by design. All that NuDB assures is that for a suitable set of operating systems and file systems, the recovered database will not be corrupt.

Peter Dimov

1:49 p.m.

Vinnie Falco wrote:

...

On Wed, Mar 29, 2017 at 9:36 AM, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...
once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter)."

Just a note, NuDB does not have this property. It is entirely possible that a call to nudb::store::insert will return, and if the database crashes afterwards the inserted value will not be present in the recovered database. This is by design.

But you do guarantee that if this insert is present in the recovered database, all earlier inserts are too, right? This is what you need for the letter C.

Vinnie Falco

1:53 p.m.

On Wed, Mar 29, 2017 at 9:49 AM, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

But you do guarantee that if this insert is present in the recovered database, all earlier inserts are too, right? This is what you need for the letter C.

Yes. NuDB buffers insertions into memory (that's what the "cache" and "pool" classes are for). Once per second, or sooner depending on the insertion load, the implementation runs a "commit". The commit process acquires all the insertions buffered up to that point, renders the changed buckets in memory, creates a backup of the buckets in their before-modified state to the log file, applies the changes, and then resets the log file. The commit process is atomic, either it all goes through or it fails and the database is recovered to the state before the commit.

Olaf van der Spek

4:13 p.m.

On Wed, Mar 29, 2017 at 3:39 PM, Vinnie Falco via Boost <boost@lists.boost.org> wrote:

...

All that NuDB assures is that for a suitable set of operating systems and file systems, the recovered database will not be corrupt.

An empty DB matches that guarantee.. ;) -- Olaf

Niall Douglas

2:31 p.m.

On 29/03/2017 14:36, Peter Dimov via Boost wrote:

...

Niall Douglas wrote:

...
Because a proper implementation of durability should be able to use no fsync and no O_SYNC at all. In that case, you get "late durability" where minutes of recent writes get lost after power loss. For users where that is unacceptable, O_SYNC should be turned on and you now have "early durability" where only seconds may be lost.

No such thing.

It sure is. For example, https://sqlperformance.com/2014/04/io-subsystem/delayed-durability-in-sql-se.... Also, Linux and FreeBSD and even Windows lets you tune when dirty data is required to be sent to storage. That's late vs early durability too. In all storage algorithmic code, there is always a tension between late and early durability. There is no such thing as perfect durability, or rather, it is impossible to define what perfect durability would be given all the moving parts. So you have "as soon as possible" durability which usually comes with mediocre performance, or "somewhat later" durability with improving performance, with a gradual tradeoff continuum between. When designing durable storage algorithms, you choose where on that continuum you want to be, and you usually add a few knobs the user can turn too.

...

"The durability property ensures that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter)."

If you lose writes, it's not Durable, just Consistent.

Consistency refers to the referential integrity of the database. So, after power loss, if a database is Consistent then all its references are correct. Think of it like a directory containing references to inodes which don't exist. If that is never the case after power loss, your filing system implements Consistency. Durability refers to the writes themselves, so do they appear in whole as a transaction group or not? So if transaction A updates references which are used by transaction B, and if after power loss transaction A was damaged and transaction B was not, if you are Consistent then you also need to throw away transaction B during recovery. But if transaction B did not use any inputs from modifications by transaction A, then if you are Durable you MUST recover transaction B even though it occurred after the damaged transaction A which is thrown away. Does this make sense now?

...

The page you linked, https://www.sqlite.org/howtocorrupt.html, says

"Actually, if one is only concerned with atomic and consistent writes and is willing to forego durable writes, the sync operation does not need to wait until the content is completely stored on persistent media. Instead, the sync operation can be thought of as an I/O barrier. As long as all writes that occur before the sync are completed before any write that happens after the sync, no database corruption will occur. If sync is operating as an I/O barrier and not as a true sync, then a power failure or system crash might cause one or more previously committed transactions to roll back (in violation of the "durable" property of "ACID") but the database will at least continue to be consistent, and that is what most people care about."

fsync may be a reordering barrier per inode on some systems. It is rarely a reordering barrier across inodes, and as I keep saying but nobody appears to be listening, in portable code you should assume that fsync does nothing. POSIX allows fsync to do nothing, and common-in-the-wild configurations such as lxc containers routinely make fsync into a noop. So please stop saying "well only if we do X with fsync then it'll work". No it won't. Assume fsync = noop. Proceed from there when designing high quality, Boost-ready code. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Peter Dimov

3:06 p.m.

Niall Douglas wrote:

...

...
No such thing.

It sure is. For example, https://sqlperformance.com/2014/04/io-subsystem/delayed-durability-in-sql-se....

"Delayed durability" is not what traditionally the letter D in ACID means. It means that if the insert call has returned, the insert is in the database, period. It makes no provisions for up to 5 minutes of inserts being lost. This is simply not durable by the standard definition.

...

Consistency refers to the referential integrity of the database. ... So if transaction A updates references which are used by transaction B, and if after power loss transaction A was damaged and transaction B was not, if you are Consistent then you also need to throw away transaction B during recovery. But if transaction B did not use any inputs from modifications by transaction A, then if you are Durable you MUST recover transaction B even though it occurred after the damaged transaction A which is thrown away.

One definition of the letter C in ACID is that if you have issued transaction B after transaction A, the database is in one of three possible states: (none of A and B present), (A present), (both A and B present). It doesn't matter what references A or B update. You're absolutely not allowed to apply B without A on recovery. It's true that this meaning of consistency is not universally accepted. If you interpret C loosely, you have no letter with which to describe the above, which would now fall somewhere between C and D. So it probably makes sense to call it "delayed durability" if you consider "consistency" not that. As far as I can see, NuDB claims this exact property, subject to "NuDB's database integrity guarantees are only valid if the implementation of sync assures that all data is fully written to the underlying file before the call returns." "NuDB's database integrity guarantees are only valid if the implementation of trunc assures that subsequent calls to size will return o, even if the program is terminated or the device is taken offline before calling size." so I'm not quite sure what your objections are. You insist on him not calling NuDB durable, which he never did.

Niall Douglas

4:03 p.m.

On 29/03/2017 16:06, Peter Dimov via Boost wrote:

...

...
Consistency refers to the referential integrity of the database. ... So if transaction A updates references which are used by transaction B, and if after power loss transaction A was damaged and transaction B was not, if you are Consistent then you also need to throw away transaction B during recovery. But if transaction B did not use any inputs from modifications by transaction A, then if you are Durable you MUST recover transaction B even though it occurred after the damaged transaction A which is thrown away.

One definition of the letter C in ACID is that if you have issued transaction B after transaction A, the database is in one of three possible states: (none of A and B present), (A present), (both A and B present). It doesn't matter what references A or B update. You're absolutely not allowed to apply B without A on recovery.

Consistency is time invariant. If two transactions each affect totally unrelated data, if B was committed after A in time originally then during recovery if B was recoverable but A was not, B is indeed recovered without A. You'll find this effect breaking code sometimes because B will appear without the A because the database deduced they were not related. The cause is that you are supposed to create a reference in the database between pieces of data which are related in the program code, that way B always happens after A and B never appears without A. But sometimes programmers forget, or somebody doing optimisation removes the reference to increase speed, or sometimes the relationship is buried deep inside libraries and is non-obvious.

...

It's true that this meaning of consistency is not universally accepted. If you interpret C loosely, you have no letter with which to describe the above, which would now fall somewhere between C and D. So it probably makes sense to call it "delayed durability" if you consider "consistency" not that.

As far as I can see, NuDB claims this exact property, subject to

"NuDB's database integrity guarantees are only valid if the implementation of sync assures that all data is fully written to the underlying file before the call returns."

"NuDB's database integrity guarantees are only valid if the implementation of trunc assures that subsequent calls to size will return o, even if the program is terminated or the device is taken offline before calling size."

so I'm not quite sure what your objections are. You insist on him not calling NuDB durable, which he never did.

If he rewrote those two paragraphs above with the below I would have no objection apart from that his design is flawed and should not enter Boost without being substantially redesigned. "NuDB attempts to recover a usable database, but provides no guarantees of database integrity after unexpected power loss. Please do not use NuDB in situations where data stored is important." His earlier paragraphs suggest that if fsync and trunc behave according to the semantics he gives, you are guaranteed database integrity. As Lee showed earlier, that is a highly questionable claim in itself, and could cause users to lose data when they did not expect to. Most developers think fsync cannot be a noop, and don't know how delayed allocation and trunc works. They will get this wrong, and lose data. And that is misleading at best. So all I am asking for is to remove the misleading claim. Don't *guarantee* database integrity after power loss, say you try your best to recover the database, and warn don't use NuDB for important data. Then I'm happy. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Vinnie Falco

4:07 p.m.

On Wed, Mar 29, 2017 at 12:03 PM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...

If he rewrote those two paragraphs above with the below I would have no objection apart from that his design is flawed and should not enter Boost without being substantially redesigned.

I could take this seriously if you provide code in the form of a regression test which shows the scenario in which a power loss harms NuDB database integrity. Because, as they say, "talk is cheap."

Lee Clagett

4:32 p.m.

On Wed, 29 Mar 2017 12:07:11 -0400 Vinnie Falco via Boost <boost@lists.boost.org> wrote:

...

On Wed, Mar 29, 2017 at 12:03 PM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...
If he rewrote those two paragraphs above with the below I would have no objection apart from that his design is flawed and should not enter Boost without being substantially redesigned.

I could take this seriously if you provide code in the form of a regression test which shows the scenario in which a power loss harms NuDB database integrity. Because, as they say, "talk is cheap."

Read this [paper on crash-consistent applications][0]. Table 1 on page 5 should be of particular interest. I _think_ the bucket portion of NuDB's log has no size constraint, so its algorithm is either going to be "single sector append", "single block append", or "multi-block append/writes" depending on the total size of the buckets. The algorithm is always problematic when metadata journaling is disabled. Your assumptions of fsync have not been violated to achieve those inconsistencies. A test case could be writing completely random data to the bucket portion of the log to see what happens. The behavior should be indeterminate, because its the equivalent of reading uninitialized memory. Lee [0]https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pd...

Vinnie Falco

4:44 p.m.

On Wed, Mar 29, 2017 at 12:32 PM, Lee Clagett via Boost <boost@lists.boost.org> wrote:

...

Read this [paper on crash-consistent applications][0] ... [0]https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pd...

This is quite helpful, thank you! I think this paragraph is relevant to my use-case? "Current file systems do not provide atomic multi-block appends; appends can be broken down into multiple operations. However, most file systems seemingly guarantee that some prefix of the data written (e.g., the first 10 blocks of a larger append) will be appended atomically" It sounds to me like I have this case covered with the "partial write" failure mode of fail_file. Or is there another case I missed?

Lee Clagett

5:01 p.m.

On Wed, 29 Mar 2017 12:44:42 -0400 Vinnie Falco via Boost <boost@lists.boost.org> wrote:

...

On Wed, Mar 29, 2017 at 12:32 PM, Lee Clagett via Boost <boost@lists.boost.org> wrote:

...
Read this [paper on crash-consistent applications][0] ... [0]https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pd...

This is quite helpful, thank you! I think this paragraph is relevant to my use-case?

"Current file systems do not provide atomic multi-block appends; appends can be broken down into multiple operations. However, most file systems seemingly guarantee that some prefix of the data written (e.g., the first 10 blocks of a larger append) will be appended atomically"

It sounds to me like I have this case covered with the "partial write" failure mode of fail_file. Or is there another case I missed?

This portion was worded poorly by the authors. If you look at table 1, a single block append doesn't work when the filesystem is **not** doing metadata journaling. Its inconceivable that multi-block appends would appear atomically for these configurations. Their intent was to point out that filesystem configurations achieving single block atomic append could actually do up to 10 blocks atomically. And the partial write failure test case does not cover what I am talking about. A filesystem is allowed to write the control structures before writing the data, and still meet your constraints for fsync. So the pointer to the sector has been stored but the data at that sector was never written. Lee

Peter Dimov

5:18 p.m.

Lee Clagett wrote:

...

And the partial write failure test case does not cover what I am talking about. A filesystem is allowed to write the control structures before writing the data, and still meet your constraints for fsync. So the pointer to the sector has been stored but the data at that sector was never written.

In terms of the File concept, this corresponds to File::write writing up to n random bytes, then calling abort(), right?

Lee Clagett

7:56 p.m.

On Wed, 29 Mar 2017 20:18:58 +0300 Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

Lee Clagett wrote:

...
And the partial write failure test case does not cover what I am talking about. A filesystem is allowed to write the control structures before writing the data, and still meet your constraints for fsync. So the pointer to the sector has been stored but the data at that sector was never written.

In terms of the File concept, this corresponds to File::write writing up to n random bytes, then calling abort(), right?

A better test might be to use data from a previous log file, and ensure that its rejected. This would be testing the situation where the filesystem reused sectors from a previous log file. Lee

Niall Douglas

9:14 p.m.

...

...
"Current file systems do not provide atomic multi-block appends; appends can be broken down into multiple operations. However, most file systems seemingly guarantee that some prefix of the data written (e.g., the first 10 blocks of a larger append) will be appended atomically"

It sounds to me like I have this case covered with the "partial write" failure mode of fail_file. Or is there another case I missed?

This portion was worded poorly by the authors. If you look at table 1, a single block append doesn't work when the filesystem is **not** doing metadata journaling. Its inconceivable that multi-block appends would appear atomically for these configurations. Their intent was to point out that filesystem configurations achieving single block atomic append could actually do up to 10 blocks atomically.

I know from my empirical testing that atomic append only updates the file's maximum extent atomically in the kernel. The i/o writing the appended data is not atomic with respect to other appends nor reads, so if three threads each append to the same file a block, the appends will not be interleaved because the file high watermark is atomically incremented by the total write size before each write is started, but if you do a concurrent read of the end of the file you will see torn writes. I don't know if the torn writes as seen by the kernel reflects what reaches storage, but I'd suggest it would be wise to assume they land into storage even more reordered again. Now, journaled file systems may give the appearance of atomic appends because I would assume that the journal has a mutex on it, and appending unavoidably incurs extent allocation which is a journal operation. So it could simply be that atomicity of appends is an artefact of how the journal has been implemented for that file system, and is not a guarantee. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Niall Douglas

5:06 p.m.

On 29/03/2017 17:32, Lee Clagett via Boost wrote:

...

Read this [paper on crash-consistent applications][0]. Table 1 on page 5

I particularly like the sentence: "However, not issuing such an fsync() is perhaps more safe in modern file systems than out-of-order persistence of directory operations. We believe the developers’ interest in fixing this problem arises from the Linux documentation explicitly recommending an fsync() after creating a file." I agree with them. fsync() gives false assurance. Better to not use it, and certainly never rely on it.

...

should be of particular interest. I _think_ the bucket portion of NuDB's log has no size constraint, so its algorithm is either going to be "single sector append", "single block append", or "multi-block append/writes" depending on the total size of the buckets. The algorithm is always problematic when metadata journaling is disabled. Your assumptions of fsync have not been violated to achieve those inconsistencies.

One of my biggest issues with NuDB is the log file. Specifically, it's worse than useless, it actively interferes with database integrity. If you implemented NuDB as a simple data file and a memory mapped key file and always atomic appended transactions to the data file when inserting items, then after power loss you could check if the key file mentions extents not possible given the size of the data file. You then can rebuild the key file simply by replaying through the data file, being careful to ignore any truncated final append. That would be a reasonable power loss recovery algorithm. A little slow to do recovery for large databases, but safe, reliable, predictable and it would only run on a badly closed database. You can also turn off fsync entirely, and let the atomic appends land on storage in an order probably close to the append order. Ought to be quicker than NuDB by a fair bit, much fewer i/o ops, simpler design. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Lee Clagett

30 Mar 30 Mar

12:51 p.m.

On Wed, 29 Mar 2017 18:06:02 +0100 Niall Douglas via Boost <boost@lists.boost.org> wrote:

...

On 29/03/2017 17:32, Lee Clagett via Boost wrote:

...
Read this [paper on crash-consistent applications][0]. Table 1 on page 5

I particularly like the sentence:

"However, not issuing such an fsync() is perhaps more safe in modern file systems than out-of-order persistence of directory operations. We believe the developers’ interest in fixing this problem arises from the Linux documentation explicitly recommending an fsync() after creating a file."

I think in this instance the authors were referring to the recommendation to fsync a file after creation. The paper is primarily about the properties of the filesystem, not lies by the hardware. Later on they comment about how developers generally disregard fsync as being unreliable, but that its possible that the root cause to their problems is an incorrect assumption about filesystem properties/behavior (based on the number of problems they found with common software).

...

I agree with them. fsync() gives false assurance. Better to not use it, and certainly never rely on it.

...
should be of particular interest. I _think_ the bucket portion of NuDB's log has no size constraint, so its algorithm is either going to be "single sector append", "single block append", or "multi-block append/writes" depending on the total size of the buckets. The algorithm is always problematic when metadata journaling is disabled. Your assumptions of fsync have not been violated to achieve those inconsistencies.

One of my biggest issues with NuDB is the log file. Specifically, it's worse than useless, it actively interferes with database integrity.

If you implemented NuDB as a simple data file and a memory mapped key file and always atomic appended transactions to the data file when inserting items, then after power loss you could check if the key file mentions extents not possible given the size of the data file. You then can rebuild the key file simply by replaying through the data file, being careful to ignore any truncated final append.

I think this is trading an atomic `truncate(0)` assumption with an atomic multi-block overwrite assumption. So this seems like something that is more likely to have a torn write that is hard to notice.

...

That would be a reasonable power loss recovery algorithm. A little slow to do recovery for large databases, but safe, reliable, predictable and it would only run on a badly closed database. You can also turn off fsync entirely, and let the atomic appends land on storage in an order probably close to the append order. Ought to be quicker than NuDB by a fair bit, much fewer i/o ops, simpler design.

How would it notice that a bucket was partially overwritten though? Wouldn't it have to _always_ inspect the entire key file? Lee

Vinnie Falco

1:13 p.m.

On Thu, Mar 30, 2017 at 8:51 AM, Lee Clagett via Boost <boost@lists.boost.org> wrote:

...

How would it notice that a bucket was partially overwritten though? Wouldn't it have to _always_ inspect the entire key file?

One way to fix this is to prefix each bucket in the log file with a cryptographic digest, making it easy to verify the integrity. This means an extra pass through the log file, which isn't too bad at all (much better than a pass through the entire data file).

Niall Douglas

2:08 p.m.

...

...
If you implemented NuDB as a simple data file and a memory mapped key file and always atomic appended transactions to the data file when inserting items, then after power loss you could check if the key file mentions extents not possible given the size of the data file. You then can rebuild the key file simply by replaying through the data file, being careful to ignore any truncated final append.

I think this is trading an atomic `truncate(0)` assumption with an atomic multi-block overwrite assumption. So this seems like something that is more likely to have a torn write that is hard to notice.

Implied in my proposal was that every record appended to the data file would be hashed. You need this as atomic append does not send those appends to storage in order, so appends could be torn.

...

...
That would be a reasonable power loss recovery algorithm. A little slow to do recovery for large databases, but safe, reliable, predictable and it would only run on a badly closed database. You can also turn off fsync entirely, and let the atomic appends land on storage in an order probably close to the append order. Ought to be quicker than NuDB by a fair bit, much fewer i/o ops, simpler design.

How would it notice that a bucket was partially overwritten though? Wouldn't it have to _always_ inspect the entire key file?

My memory mapped key file would be something like: struct key_file { std::atomic<off_t> data_file_extent, data_file_written; open_hash_index<key_type, std::pair<off_t, off_t>> keys; // map key to offset into data file and size of value stored }; When opening the database for use, you'd first check that data_file_extent matches the actual size of the data file. You'd then check that every key refers to an extent before the data_file_extent. If anything doesn't match up, you need to rebuild. The data_file_written atomic BTW is atomically updated when an append completes, whereas data_file_extent is atomically updated before an append is begun. This should keep a correct window of extents to check for torn writes if O_SYNC is working. But what if O_SYNC isn't working? I'd solve this by writing a timestamp with each transaction record written to the data file. So you parse backwards from whatever survived the power loss at the end of the file by say one minute's worth of writes. Most sane filing systems won't not write out dirty data after a minute. That would be my automatic "load the database" implementation, it's a reasonable balance of performance with checks. Of course one could and should provide a "scan the entire database for consistency and repair" routine too. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Niall Douglas

29 Mar 29 Mar

4:36 p.m.

...

...
If he rewrote those two paragraphs above with the below I would have no objection apart from that his design is flawed and should not enter Boost without being substantially redesigned.

I could take this seriously if you provide code in the form of a regression test which shows the scenario in which a power loss harms NuDB database integrity. Because, as they say, "talk is cheap."

I and others have found serious concerns in our review of your choice of implementation and the database integrity guarantees you make. Your options: 1. You could remove the strong claims from your documentation. 2. You could redesign NuDB so it does implement your strong guarantees. 3. Ignore or attack the feedback you've been given. So far you appear to be following option 3. Be therefore not surprised if nobody gives you further feedback on NuDB. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Peter Dimov

4:19 p.m.

Niall Douglas wrote:

...

Consistency is time invariant. If two transactions each affect totally unrelated data, if B was committed after A in time originally then during recovery if B was recoverable but A was not, B is indeed recovered without A.

You can define it to be time-invariant if you assume that all invariants of the data are contained in the database and none are external to it; or, in other words, that the state "B without A" can never be considered inconsistent by the outside world if it's not described as inconsistent in the invariants the database knows about. Or in yet other words, you assume that the database knows which data are related and which unrelated, and that none of this knowledge is external to it. We could argue whether this definition is sensible or not, but it doesn't even matter in this context. NuDB doesn't contain any such knowledge, as far as I can see, and therefore can't assume anything about the unrelatedness of A and B. Whether B references A is not known to it, so it must assume that it does.

Gavin Lambert

10:53 p.m.

On 30/03/2017 05:19, Peter Dimov via Boost wrote:

...

Niall Douglas wrote:

...
Consistency is time invariant. If two transactions each affect totally unrelated data, if B was committed after A in time originally then during recovery if B was recoverable but A was not, B is indeed recovered without A.

You can define it to be time-invariant if you assume that all invariants of the data are contained in the database and none are external to it; or, in other words, that the state "B without A" can never be considered inconsistent by the outside world if it's not described as inconsistent in the invariants the database knows about. Or in yet other words, you assume that the database knows which data are related and which unrelated, and that none of this knowledge is external to it.

I don't think any database can generally make this claim, as usually it has no knowledge what the application is doing with the data. The app can insert into table A, read data from table B, and insert into table C. To the database, these all appear to be completely independent (especially if they're on different connections or have no foreign key relationships), and indeed they might be -- but it could also be true that the insert into C used some values calculated from and dependent on what was inserted into A or read from B, and the application might be very confused if the table C data is present without the table A data. The problem compounds if there was also an insert into table B just beforehand. So I agree with Peter: it should be illegal for transaction C to appear in the recovered database if transaction A was committed first but recovery resulted in rolling it back. You could argue that it might be ok if you can prove that transaction A and C were concurrent (and thus most likely independent), or from different connections, but this still provides no guarantees that the app wasn't still calculating one from the other, or assuming that the order it chose to commit the transactions could be inverted or even partially undone. Commit time is the ultimate proof of unrelated data, due to causality. (Even then, you can be tripped up by the app committing related transactions in the "wrong" order, but in this case the app shouldn't be surprised by the database enforcing that order during recovery.)

Vinnie Falco

26 Mar 26 Mar

2:17 p.m.

On Sun, Mar 26, 2017 at 7:11 AM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...

If you want durability across multiple OSs and filing systems, you need to assume that fsync's are reordered with respect to one another.

Does SQLite assume this? I make no claims that NuDB works on all filing systems. It will definitely not do well on network shares - it wasn't designed for that. Nor will it perform well on spinning disks. It is specifically designed for directly connected SSD drives, on mainstream operating systems. As I said in the original post, its use-cases are niche (hence, why I am probing for the level of interest).

...

NuDB needs to remove all claims regarding durability of any kind. In fact, you might as well remove the fsync code path entirely

No, I don't think that I'll be doing that. If you feel that this is important, you have my blessing to clone the repository and continue with your own fork.

...

fsync() performs pathologically awful on copy-on-write filing systems

The library is not designed for exotic file systems like the one you describe. Its meant for simple commodity hardware and operating systems such as what you might find on a bare metal amazon web instance. There is no need for a copy on write file system, as long as the invariants are met (that fsyncs aren't reordered).

...

In which case you did not make a great choice.

Much, much better would be Blake2b. 2 cycles/byte, cryptographically secure, collision probability exceeds life of the universe.

Hmm, no, it seems that you are the one who "did not make a great choice." The requirements of Hasher do not include cryptographic security. Blake2b is a cryptographically secure hash function which computes digests up to 32 bytes, while xxhasher is a non cryptographically secure hash function which computes a 64-bit digest. NuDB requires a Hasher more like std::hash and less like SHA-1. Blake2b can achieve almost 1Gb/s while xxhash can achieve 110Gb/s. I think I'll be sticking with xxhash but again you have my full support if you want to instantiate nudb::store objects with your own user defined Hasher that uses Blake2b - here's a link to the type requirements so you can get started with it: http://vinniefalco.github.io/nudb/nudb/types/Hasher.html

Niall Douglas

11:14 p.m.

...

...
fsync() performs pathologically awful on copy-on-write filing systems

The library is not designed for exotic file systems like the one you describe. Its meant for simple commodity hardware and operating systems such as what you might find on a bare metal amazon web instance. There is no need for a copy on write file system, as long as the invariants are met (that fsyncs aren't reordered).

Except, as has already been established, retrievability of fsyncs after power loss *are* reordered. So your invariant is not met. For the record, ZFS is hardly an exotic file system. All my servers are running ZFS on Linux because that Linux distro (Proxmox) defaults to ZFS. My FreeBSD install on my laptop runs ZFS because that's also the default filing system for FreeBSD. Additionally, ext4 can be mounted in COW mode via "data=journal". So none of this is exotic, merely it's commonplace outside where you've been using NuDB to date. As I said to you at the very beginning of all this, your database is aimed at a very limited use case. If it entered Boost, you'd find people doing all sorts of crazy stuff with it, and running it on ZFS would be very mild compared to what some would do.

...

...
In which case you did not make a great choice.

Much, much better would be Blake2b. 2 cycles/byte, cryptographically secure, collision probability exceeds life of the universe.

Hmm, no, it seems that you are the one who "did not make a great choice." The requirements of Hasher do not include cryptographic security. Blake2b is a cryptographically secure hash function which computes digests up to 32 bytes, while xxhasher is a non cryptographically secure hash function which computes a 64-bit digest. NuDB requires a Hasher more like std::hash and less like SHA-1.

I already explained in my reply to Lee why one very good approach to portably achieving durability is to use an acyclic graph of chained cryptographic hashes to maintain a secure history of time. Exactly as git or mercurial does in fact. If you're not using cryptographically strong hashing, it's highly unlikely your database can be durable.

...

Blake2b can achieve almost 1Gb/s while xxhash can achieve 110Gb/s.

Your maths are seriously out. Your hash function (which runs per thread) only needs to be as fast as your storage device is at a queue depth of 1. So, taking a top of the range NVM SSD, the Samsung 960 Pro, it can write at QD1 about 50k IOPS. That's around 200Mb/sec/thread. Blake2b runs at 1Gb/sec, so it should comfortably fit with a bit of room to spare on a single thread. Obviously, more threads gets you more queue depth and performance should rise linearly until you run out of CPUs. Note I mention you only cryptographically hash on write. I wouldn't suggest it for lookups and reads except on first database open. For lookups and reads I'd strongly recommend SpookyHash v2 over xxhash (you'll find a header only edition of SpookyHash v2 in AFIO v2). Spooky was designed by a renowned crypto hash expert, unlike the MurmorHash derived xxhash which definitely was not. Spooky is fast on all architectures, not just Intel x64. Spooky uses the same internal mechanism as a cryptographically strong hash, just fewer rounds for performance. Spooky is as ddos resistant as siphash, and empirically proved collision resistant to 2^72 inputs, whereas xxhash will collide at best at 2^32 inputs. And the 128 bit hash Spooky creates fits perfectly into a single SSE or NEON register making working with them single cycle, and that is exactly what AFIO uses to work with them via its uint128 type. So for a content addressable database like yours, please use SpookyHash v2, even if you XOR into 64 bits. And if you decide to stick with 64 bit hashes, you need to document that collision is mathematically certain after 4 billion items have been inserted, and mathematically likely long before that. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Olaf van der Spek

27 Mar 27 Mar

7:03 a.m.

On Mon, Mar 27, 2017 at 1:14 AM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...

Your hash function (which runs per thread) only needs to be as fast as your storage device is at a queue depth of 1. So, taking a top of the

Why? Running that function means the core can't do other work, so I'd think a faster function is still beneficial.

...

range NVM SSD, the Samsung 960 Pro, it can write at QD1 about 50k IOPS. That's around 200Mb/sec/thread. Blake2b runs at 1Gb/sec, so it should

Is that Mbit/s or MByte/s? I think MByte. -- Olaf

Niall Douglas

10:57 a.m.

...

...
Your hash function (which runs per thread) only needs to be as fast as your storage device is at a queue depth of 1. So, taking a top of the

Why? Running that function means the core can't do other work, so I'd think a faster function is still beneficial.

The point I was making was not that crypto hashing isn't expensive and doesn't consume most of a CPU when paired with a top end SSD. Rather I was pointing out that current CPUs are plenty more than fast enough to crypto hash for current SSDs such that maximum performance is maintained. Obviously, if you paired a 1Ghz Atom CPU with a PCIe SSD, crypto hashing would have a severe impact on performance, and achieving durability there likely not worth the cost. Copy on write filing systems typically do the hashing for you, that's why on those filing systems you don't need to. Similarly, mounting ext4 with "journal=data" effectively applies hashing to written data so again you don't have to. The point being made here is that you need to use a high quality secure hash when writing data to storage. That way after power loss you can determine if the database store is truly in a consistent and valid state rather than being tricked into thinking it is, and then unwittingly destroying more user data.

...

...
range NVM SSD, the Samsung 960 Pro, it can write at QD1 about 50k IOPS. That's around 200Mb/sec/thread. Blake2b runs at 1Gb/sec, so it should

Is that Mbit/s or MByte/s? I think MByte.

200 Megabytes per second per thread. As most i/o loads are QD1, we care most about that performance during filing system algorithm design. The headline figures you get for SSDs are usually at QD32, it's very deceptive as most filing system algorithms do not parallelise well. This is why the new XPoint SSDs from Intel are so exciting. Their QD1 performance is double anything from before. Very exciting! Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Vinnie Falco

11:21 a.m.

On Mon, Mar 27, 2017 at 6:57 AM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...

The point being made here is that you need to use a high quality secure hash when writing data to storage. That way after power loss you can determine if the database store is truly in a consistent and valid state rather than being tricked into thinking it is, and then unwittingly destroying more user data.

Oh...I misunderstood. The hash function in NuDB is not used to calculate a digest of the value, its used only to determine which bucket to place the key in the hash table. NuDB is essentially an on-disk unordered_map. That is why the hash function does not need to be cryptographically secure. In the use-cases for which NuDB is designed (blockchain applications) the key used for insertion is usually a cryptographic digest of the value. So its up to the application to use a cryptographically secure hash function for computing keys. However, this is part of the caller's implementation and not part of the library.

Niall Douglas

12:58 p.m.

On 27/03/2017 12:21, Vinnie Falco via Boost wrote:

...

On Mon, Mar 27, 2017 at 6:57 AM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...
The point being made here is that you need to use a high quality secure hash when writing data to storage. That way after power loss you can determine if the database store is truly in a consistent and valid state rather than being tricked into thinking it is, and then unwittingly destroying more user data.

Oh...I misunderstood. The hash function in NuDB is not used to calculate a digest of the value, its used only to determine which bucket to place the key in the hash table. NuDB is essentially an on-disk unordered_map. That is why the hash function does not need to be cryptographically secure.

This is why I was mentioning that to achieve durability, during writes you crypto hash AND fast hash the new item. During lookups you use the fast hash only. The issue I am taking here is the claim of durability which is a very strong guarantee. If you don't claim that, I have no issue. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Vinnie Falco

1:05 p.m.

On Mon, Mar 27, 2017 at 8:58 AM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...

...to achieve durability, during writes you crypto hash AND fast hash the new item. During lookups you use the fast hash only.

NuDB assumes that bytes written to the file system are not changed. It doesn't verify a checksum or digest on the values. That's up to the application. If another process corrupts the data in a database in between when the database is opened, NuDB does not detect that (except for visible corruption in the key file format, such as an invalid size or offset).

...

The issue I am taking here is the claim of durability

This is the first I'm hearing that the "Durability" in ACID implies protection from data corruption after the fact - please provide a source.

Niall Douglas

1:42 p.m.

...

...
The issue I am taking here is the claim of durability

This is the first I'm hearing that the "Durability" in ACID implies protection from data corruption after the fact - please provide a source.

It doesn't imply that. Durability means that corruption to the database will not cause further data loss during subsequent use. For example, if you use a single bit to indicate that a record is deleted, and corruption flips that bit to deleted, and your implementation has no way of noticing the corruption, you have lost data after the corruption. Ideally when a user next accesses that record, they should see an error like "Record corrupt". There is a fun history of corruption and SQLite at https://www.sqlite.org/howtocorrupt.html. Last time I looked, there was a popular fork of SQLite which implements per-row checksumming, but the default build does not (the canonical advice is: "use a proper filing system like ZFS if you don't want bit errors"). But SQLite is very carefully written to check consistency during modifications, and where it can it will refuse to modify data when the metadata doesn't match up. So, a database can be durable and not detect arbitrary damage to user data. It just cannot lose further user data due to corruption of its own structures. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Dominique Devienne

2:05 p.m.

On Mon, Mar 27, 2017 at 3:42 PM, Niall Douglas via Boost < boost@lists.boost.org> wrote:

...

...
...
The issue I am taking here is the claim of durability

This is the first I'm hearing that the "Durability" in ACID implies protection from data corruption after the fact - please provide a source.

It doesn't imply that.

Durability means that corruption to the database will not cause further data loss during subsequent use. For example, if you use a single bit to indicate that a record is deleted, and corruption flips that bit to deleted, and your implementation has no way of noticing the corruption, you have lost data after the corruption. Ideally when a user next accesses that record, they should see an error like "Record corrupt".

There is a fun history of corruption and SQLite at https://www.sqlite.org/howtocorrupt.html. Last time I looked, there was a popular fork of SQLite which implements per-row checksumming, but the default build does not (the canonical advice is: "use a proper filing system like ZFS if you don't want bit errors"). But SQLite is very carefully written to check consistency during modifications, and where it can it will refuse to modify data when the metadata doesn't match up.

So, a database can be durable and not detect arbitrary damage to user data. It just cannot lose further user data due to corruption of its own structures.

Given that SQLite doesn't do any checksum'ing of its data (i.e. its pages), I don't see how it could be durable in the way you seem to imply Niall. --DD

Niall Douglas

3:08 p.m.

...

...
Durability means that corruption to the database will not cause further data loss during subsequent use. For example, if you use a single bit to indicate that a record is deleted, and corruption flips that bit to deleted, and your implementation has no way of noticing the corruption, you have lost data after the corruption. Ideally when a user next accesses that record, they should see an error like "Record corrupt".

There is a fun history of corruption and SQLite at https://www.sqlite.org/howtocorrupt.html. Last time I looked, there was a popular fork of SQLite which implements per-row checksumming, but the default build does not (the canonical advice is: "use a proper filing system like ZFS if you don't want bit errors"). But SQLite is very carefully written to check consistency during modifications, and where it can it will refuse to modify data when the metadata doesn't match up.

So, a database can be durable and not detect arbitrary damage to user data. It just cannot lose further user data due to corruption of its own structures.

Given that SQLite doesn't do any checksum'ing of its data (i.e. its pages), I don't see how it could be durable in the way you seem to imply Niall.

I would agree with you (and those maintaining the fork of SQLite which does checksum its rows) that row checksumming should be done if one is claiming durability. But I can see the point of those in SQLite who say that the code does carefully check that metadata is sensible before changing things. I personally don't think that goes far enough, but equally, if a database did just that and claimed durability I'd grudgingly accept it because of SQLite's stature. But you are right, I wouldn't personally say SQLite can claim it implements durability as a personal opinion. It needs to do more, and it's not like it's much more, the code allows a checksum to be added per row and validated very easily. The reason it is not in the main repo is purely due to unresolved philosophical differences between factions of opinion in the devs, not a lack of technical implementation nor even much of a performance penalty. The only rational argument I've heard against row checksumming is that for really small embedded devices, the extra storage and memory used could be a problem. Personally, I'd make the row checksumming optional, and again there is no technical reason that isn't easy. I believe the fork makes it a user definable setting. But differences of philosophy often trump technical arguments, and people have taken stands on opinion. It's no different there than here. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Vinnie Falco

3:30 p.m.

On Mon, Mar 27, 2017 at 11:08 AM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...

I can see the point of those...who say that...you are right...

Could I trouble you (or anyone) to submit a tiny patch to fix this issue? I'm not fluent in Travis: https://github.com/vinniefalco/NuDB/issues/52 I have no idea why Travis is using gcc4.6.3 even though the .yml file asks for gcc5: https://travis-ci.org/vinniefalco/NuDB/jobs/213555724#L375 Thanks!

Peter Dimov

3:57 p.m.

Vinnie Falco wrote:

...

Could I trouble you (or anyone) to submit a tiny patch to fix this issue? I'm not fluent in Travis: https://github.com/vinniefalco/NuDB/issues/52

I have no idea why Travis is using gcc4.6.3 even though the .yml file asks for gcc5:

The .yml file only installs gcc5, it doesn't ask for it. The default gcc/g++ is still the same (4.6). gcc5 is gcc-5/g++-5. I don't know how this needs to be passed to cmake, presumably by setting CC/CXX?

Vinnie Falco

4:01 p.m.

On Mon, Mar 27, 2017 at 11:57 AM, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

Vinnie Falco wrote: The .yml file only installs gcc5, it doesn't ask for it. The default gcc/g++ is still the same (4.6). gcc5 is gcc-5/g++-5. I don't know how this needs to be passed to cmake, presumably by setting CC/CXX?

I'm totally clueless when it comes to Linux toolchains. But I don't think that this should be passed to cmake. Instead, the operating system should be configured so that the commands "gcc" and "g++" will use the newer version. Otherwise, people who build with CMake are going to be pigeonholed to a specific gcc version? That doesn't seem like the right thing to do.

Peter Dimov

4:18 p.m.

Vinnie Falco wrote:

...

On Mon, Mar 27, 2017 at 11:57 AM, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...
The .yml file only installs gcc5, it doesn't ask for it. The default gcc/g++ is still the same (4.6). gcc5 is gcc-5/g++-5. I don't know how this needs to be passed to cmake, presumably by setting CC/CXX?

I'm totally clueless when it comes to Linux toolchains. But I don't think that this should be passed to cmake. Instead, the operating system should be configured so that the commands "gcc" and "g++" will use the newer version.

You can install may different g++ versions on Travis simultaneously. The system default is always 4.6. If you install 4.7, 4.8, 4.9, 5 and 6, their corresponding executables are g++-4.7, g++-4.8, g++-4.9, g++-5 and g++-6. Same if you install many versions of Clang (the default one is 3.4).

Vinnie Falco

4:27 p.m.

On Mon, Mar 27, 2017 at 12:18 PM, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

You can install may different g++ versions on Travis simultaneously. The system default is always 4.6. If you install 4.7, 4.8, 4.9, 5 and 6, their corresponding executables are g++-4.7, g++-4.8, g++-4.9, g++-5 and g++-6. Same if you install many versions of Clang (the default one is 3.4).

I guess what I need is for the travis.yml or other script, to change 'gcc' and 'g++' to link to a different version? Something about npm-alternatives or something, I don't know...this is all foreign

Peter Dimov

4:40 p.m.

Vinnie Falco wrote:

...

I guess what I need is for the travis.yml or other script, to change 'gcc' and 'g++' to link to a different version?

I'm not a Cmake person, but I strongly suspect that CC=gcc-5 CXX=g++-5 will work. It's possible that "compiler: gcc-5" will work as well, instead of the current "compiler: gcc" at https://github.com/vinniefalco/NuDB/blob/master/.travis.yml#L36 but I've never tried that. Either way, the result will be CXX and CC being set. Their current default values are g++ and gcc, as can be seen at https://travis-ci.org/vinniefalco/NuDB/jobs/213555724#L354

Vinnie Falco

4:53 p.m.

On Mon, Mar 27, 2017 at 12:40 PM, Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

It's possible that "compiler: gcc-5" will work as well, instead of the current "compiler: gcc" at https://github.com/vinniefalco/NuDB/blob/master/.travis.yml#L36

Yes! https://travis-ci.org/vinniefalco/NuDB/jobs/215604522#L379

Vinnie Falco

2:13 p.m.

On Mon, Mar 27, 2017 at 9:42 AM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...

I am...here

Just wondering, but has anyone actually built and used the library to poke around with or are these conversations theoretical? Thanks

Vinnie Falco

29 Mar 29 Mar

4:19 p.m.

On Mon, Mar 27, 2017 at 10:13 AM, Vinnie Falco <vinnie.falco@gmail.com> wrote:

...

Just wondering, but has anyone actually built and used the library to poke around with or are these conversations theoretical?

ping

Asbjørn

27 Mar 27 Mar

11:29 a.m.

On 27.03.2017 01:14, Niall Douglas via Boost wrote:

...

Your hash function (which runs per thread) only needs to be as fast as your storage device is at a queue depth of 1. So, taking a top of the range NVM SSD, the Samsung 960 Pro, it can write at QD1 about 50k IOPS. That's around 200Mb/sec/thread. Blake2b runs at 1Gb/sec, so it should comfortably fit with a bit of room to spare on a single thread.

I'm probably being dense, but I don't get your argument. NuDB hashes the value to use as a key, from what I understood. If you have a large value, the write will be much faster than with random 4k IOPS, no? For reference, on my i7 4770k running Win 8.1 a 1TB 960 Pro has about 1700MB/s sequential write speed and 150MB/s 4k random write, both at QD1. Cheers - Asbjørn

Vinnie Falco

11:31 a.m.

On Mon, Mar 27, 2017 at 7:29 AM, Asbjørn via Boost <boost@lists.boost.org> wrote:

...

NuDB hashes the value to use as a key, from what I understood.

Hmm, no - again, sorry for the confusion (the docs still need work!) The key is defined by the application. The hash function is used to calculate in which bucket to place the key, like std::hash.

Asbjørn

12:54 p.m.

On 27.03.2017 13:31, Vinnie Falco via Boost wrote:

...

On Mon, Mar 27, 2017 at 7:29 AM, Asbjørn via Boost <boost@lists.boost.org> wrote:

...
NuDB hashes the value to use as a key, from what I understood.

Hmm, no - again, sorry for the confusion (the docs still need work!) The key is defined by the application. The hash function is used to calculate in which bucket to place the key, like std::hash.

Oh, my bad then. Not sure what I was thinking, I blame monday... Sorry for the noise. - Asbjørn

Niall Douglas

1:24 p.m.

...

...
Your hash function (which runs per thread) only needs to be as fast as your storage device is at a queue depth of 1. So, taking a top of the range NVM SSD, the Samsung 960 Pro, it can write at QD1 about 50k IOPS. That's around 200Mb/sec/thread. Blake2b runs at 1Gb/sec, so it should comfortably fit with a bit of room to spare on a single thread.

I'm probably being dense, but I don't get your argument. NuDB hashes the value to use as a key, from what I understood. If you have a large value, the write will be much faster than with random 4k IOPS, no?

For reference, on my i7 4770k running Win 8.1 a 1TB 960 Pro has about 1700MB/s sequential write speed and 150MB/s 4k random write, both at QD1.

Obviously enough, if you write 16Kb to storage, the OS will issue 4x 4Kb writes (why 4Kb? That's the PCIe DMA maximum). Your 960 Pro very likely will issue all four of those in parallel, so the write of 16Kb will complete in the same time as writing a single 4Kb block. Taking just your figures above, dividing 1700 by 150 yields about 11, so in your test and/or your filing system and/or your OS, sequential writes are being issued on average at 11 at a time. If you are using the kernel page cache, all that stuff is taken care of for you. NuDB uses the kernel page cache, so all that is out of your control. But for me when I was designing AFIO, I was starting on the premise that many if not most would be running with O_DIRECT|O_SYNC and they will be doing their own multiplexing of 4Kb block i/o onto storage. So the async backend AFIO v2 uses is specifically a non-multithreaded implementation - per thread you run an afio::io_service, and you issue an i/o, while that completes you prepare the next i/o, upon completion you issue the next i/o and so on. You keep your thread always writing data, when not writing then preparing the next data to write etc. Thus becomes the challenge of how much work to pack onto each kernel thread, and that's where calculating hash overheads etc comes in. All I was saying earlier was that modern CPUs are fast enough to do crypto strength hashing on i/o even to the very fastest SSDs available right now. If you want durability, you won't give up write performance to achieve it so long as your CPU is as top end as your SSD. If you don't want durability, then none of this matters. Crypto hashes are very expensive, and say SpookyHash is pretty amazing for its performance. So use SpookyHash, but don't claim durability unless you've implemented a hash chain of history where one can wind arbitrarily backwards and forwards in time and thus have the hashes checking one another's correctness such that a collision in one would be spotted by another. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Niall Douglas

25 Mar 25 Mar

11:16 p.m.

On 25/03/2017 20:01, Lee Clagett via Boost wrote:

...

At this point I'm sure Niall will tell me that writing <512 bytes to the beginning of the file is also not an atomic operation. Which is exactly what SQLite documentation assumes. The power-loss recovery algorithm gets tougher with this assumption.

That is a really fascinating topic I could write a really long email about, but one I won't bore the list with. See http://stackoverflow.com/a/35258623/805579. In short, if O_DIRECT is off, assume no atomicity at all, if O_DIRECT is on, probably infinity on all very recent POSIX systems, Windows varies between 64 bytes and 4096 bytes depending on DMA. This refers *purely* to *user observed effects*, and has no relation to what order stuff actually hits physical storage. Note that even if a device claims 512 byte blocks, no SSD actually implements that in flash and you are reliant on the SSD doing a reliable RCU under sudden power loss. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Niall Douglas

22 Mar 22 Mar

10:43 a.m.

...

...
I am unsure what benefit it confers over the much safer choices of SQLite or UnQLite both of which are battle tested and written by database experts to ensure that other people's data is never, ever lost.

SQLite is painfully slow compared to NuDB. UnQLite supports features not present in NuDB such as cursor iteration, moving, and deleting. The tradeoff for not supporting these features is that NuDB can make better claims about its performance.

You are correct that NuDB is a single purpose design, and so will always be faster. However SQLite configured into "I don't care about data loss" mode is pretty swift. "Painfully slow" is not a correct term to use except for its default data loss caring mode.

...

...
I felt when I reviewed NuDB's implementation some time ago that your implementation would be highly susceptible to data loss in a wide set of circumstances.

NuDB makes some assumptions about the underlying file system. In particular, that when a call to fsync() returns it is assured the data is reliably written. If these invariants are met there is no data loss.

In fact, one of NuDB's unit tests exhaustively verifies that the database recovery process performs correctly after an I/O failure. The unit test works by instantiating a database with a custom File template parameter. This custom file implementation simulates an I/O failure on the Nth operation. The test runs with N=1 to failure, then increments N, and re-runs the test until there are no failures. At each iteration, the test verifies that the database recover process left the database in a consistent state. You can see this code here: https://github.com/vinniefalco/NuDB/blob/master/test/recover.cpp#L125

I don't doubt that you think this database is vulnerable to data loss under some condition. However, I would appreciate specifics so that I can verify under what conditions your claim is correct, or if it is correct at all.

The single biggest thing which struck me when reading your implementation was that you make very strong assumptions about your storage, and that generally leads to corner case data loss. Plucking straight from random as it was too long ago I examined your source, but a classic mistake is to assume this is sequentially consistent: int keyfd, storefd; write(storefd, data) fsync(storefd) write(keyfd, key) fsync(keyfd) Here the programmer writes the value being stored, persists it, then writes the key to newly stored value, persists that. Most programmers unfamiliar with filing systems will assume that the fsync to the storefd cannot happen after the fsync to the keyfd. They are wrong, that is a permitted reorder. fsyncs are only guaranteed to be sequentially consistent *on the same file descriptor* not different file descriptors. Technically speaking, you could even dupfd() the same fd to the same inode and fsync is not required to be sequentially consistent between those two fds, though most major filing systems do implement that because the locking is kept at inode level. I can't remember if NuDB makes that sort of mistake, but the single biggest thing which struck me was your assumption that storage is reliable and predictable and that the syscalls you wrote will be executed in the order you wrote them. None of those are true except in a very limited circumstance i.e. power never goes off, storage never fails, ECC RAM is present, bitflips never occur, you are on a very recent Linux kernel, your ext4 filing system hasn't been mounted with unusual flags etc. You also rely heavily on the kernel page caching architecture to give your performance. That's a good move as most people trying to invent their own caching architecture on top of O_DIRECT do a far worse job. However to really get IOPS out of storage just around the corner where hitting 1M 4K IOPS is achievable, you can't avoid O_DIRECT for that.

...

...
I also felt it wouldn't scale well to the > 1M IOPS storage just around the corner.

While it hasn't yet been tested on these non-existent devices, I am optimistic, since NuDB was specifically designed for very intense fetch workloads with light concurrent insertions. Fetches can happen concurrently, it should be possible to saturate the read throughput of a device. See: https://github.com/vinniefalco/NuDB/blob/master/include/nudb/impl/basic_stor...

The Macbook Pro I am typing on right now hits 300k 4Kb IOPS just fine. Some of the systems with big RAM drives I've tested AFIO were busting past 2M IOPS. These are not non-existent devices, they are just currently high end, and rapidly heading towards the consumer.

...

...
I would suspect it is being used in a highly restricted use case where corner cases don't present.

Maybe. I'll counter by saying, there are only three operations possible on an open database: insert, fetch, and close.

I was more referring to use environment. For example, atomic appending to a file is super slow on ZFS. It may be fast on ext4 and there you should use atomic appends, but on other filing systems you need to set the extent to maximum and use lazy allocation via an atomically updated manually kept watermark. There's lots of this stuff, it's why AFIO v2 has a concept of "storage profiles" which keep a precanned mix of AFIO filing system algorithms for some storage combo. That way if you load your key-value store on ZFS on Linux, the storage profile says to use algorithms A, B and D. If you load it on NTFS on Windows, use algorithms B, C and F instead. AFIO provides many algorithms all accessible via an abstracted base class so things using them don't need to care.

...

...
if you are up for getting this to a state ready to submit to Boost, I'd welcome it.

Cool! The docs just need a bit of work, the library code is more than ready.

Thanks for taking the time to go through it!

I appreciate I'm full of nothing but criticism, but I wanted to give you some idea of how a peer review would go. I do want to say thank you for writing your library, and intending to bring it to Boost. It may well be that the community will accept NuDB if it very substantially reduces the guarantees it offers to essentially none i.e. this is a fast key-value insert only store with no guarantees. But I will mention that if I don't need to provide any guarantees, I believe I could roll something with AFIO v2 even in its current very alpha state that will match or beat NuDB in less than a hundred lines of code and about one week of work. I even reckon I can achieve key-value deletion if users are willing to accept a reasonable ceiling on total item count of maybe a few million items, and use of a recent extents based filing system. I don't wish to do your work down, but a key value store with no guarantees really is that easy to implement with AFIO v2, right now. It's the guarantees and achieving consistent performance across environments which is hard. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Olaf van der Spek

10:50 a.m.

On Wed, Mar 22, 2017 at 11:43 AM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...

Plucking straight from random as it was too long ago I examined your source, but a classic mistake is to assume this is sequentially consistent:

int keyfd, storefd; write(storefd, data) fsync(storefd) write(keyfd, key) fsync(keyfd)

Here the programmer writes the value being stored, persists it, then writes the key to newly stored value, persists that. Most programmers unfamiliar with filing systems will assume that the fsync to the storefd cannot happen after the fsync to the keyfd. They are wrong, that is a permitted reorder. fsyncs are only guaranteed to be sequentially consistent *on the same file descriptor* not different file descriptors.

Just curious, how is that permitted? Isn't fsync() supposed to ensure data is on durable storage before it returns?

Niall Douglas

11:46 a.m.

On 22/03/2017 10:50, Olaf van der Spek wrote:

...

On Wed, Mar 22, 2017 at 11:43 AM, Niall Douglas via Boost <boost@lists.boost.org> wrote:

...
Plucking straight from random as it was too long ago I examined your source, but a classic mistake is to assume this is sequentially consistent:

int keyfd, storefd; write(storefd, data) fsync(storefd) write(keyfd, key) fsync(keyfd)

Here the programmer writes the value being stored, persists it, then writes the key to newly stored value, persists that. Most programmers unfamiliar with filing systems will assume that the fsync to the storefd cannot happen after the fsync to the keyfd. They are wrong, that is a permitted reorder. fsyncs are only guaranteed to be sequentially consistent *on the same file descriptor* not different file descriptors.

Just curious, how is that permitted?

Isn't fsync() supposed to ensure data is on durable storage before it returns?

A common misconception. Here is the POSIX wording: "The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined. The fsync() function shall not return until the system has completed that action or until an error is detected. [SIO] [Option Start] If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state. All I/O operations shall be completed as defined for synchronized I/O file integrity completion. [Option End]" So, without _POSIX_SYNCHRONIZED_IO, all fsync() guarantees is that it will not return until the *request* for the transfer of outstanding data to storage has completed. In other words, it pokes the OS to start flushing data now rather than later, and returns immediately. OS X implements this sort of fsync() for example. With _POSIX_SYNCHRONIZED_IO, you get stronger guarantees that upon return from the syscall, "synchronized I/O file integrity completion" has occurred. Linux infamously claims _POSIX_SYNCHRONIZED_IO, yet ext2/ext3/ext4 don't implement it fully and will happily reorder fsyncs of the metadata needed to later retrieve a fsynced write of data. So the data itself is written on fsync return sequentially consistent, but not the metadata to later retrieve it, that can be reordered with respect to other fsyncs. AFIO v1 and v2 take care of this sort of stuff for you. If you tell AFIO you want a handle to a file to write reliably, AFIO does what is needed to make it reliable. Be prepared to give up lots of performance however (and hence where async file i/o starts to become very useful because you can queue up lots of writes, and completion handlers will fire when the write really has reached storage in a way always retrievable in the future - excluding bugs in the kernel, filing system, storage device etc). Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Gavin Lambert

10:49 p.m.

On 23/03/2017 00:46, Niall Douglas via Boost wrote:

...

On 22/03/2017 10:50, Olaf van der Spek wrote:

...
Isn't fsync() supposed to ensure data is on durable storage before it returns? [...] So, without _POSIX_SYNCHRONIZED_IO, all fsync() guarantees is that it will not return until the *request* for the transfer of outstanding data to storage has completed. In other words, it pokes the OS to start flushing data now rather than later, and returns immediately. OS X implements this sort of fsync() for example.

With _POSIX_SYNCHRONIZED_IO, you get stronger guarantees that upon return from the syscall, "synchronized I/O file integrity completion" has occurred. Linux infamously claims _POSIX_SYNCHRONIZED_IO, yet ext2/ext3/ext4 don't implement it fully and will happily reorder fsyncs of the metadata needed to later retrieve a fsynced write of data. So the data itself is written on fsync return sequentially consistent, but not the metadata to later retrieve it, that can be reordered with respect to other fsyncs.

There's other factors, too; even if the OS has fully flushed the data to the storage device, the storage device itself might be holding some of it in an internal volatile buffer and may possibly even perform actual non-volatile writes out of order. (Disclaimer: I haven't looked at the internals of storage hardware for a long time so perhaps they make better guarantees than I think.)

Dominique Devienne

9:30 a.m.

On Wed, Mar 22, 2017 at 12:36 AM, Niall Douglas via Boost < boost@lists.boost.org> wrote:

...

[...] much safer choices of SQLite or UnQLite both of which are battle tested

and written by database experts to ensure that other people's data is

...

never, ever lost.

Very much so for SQLite, but not so for UnQLite IMHO, which is a "mostly abandoned" project of changing SQLite3 to use a new key-value store back-end, thus create in the process a new "SQLite4". The original was lead by Dr Hipp, who later concentrated back on SQLite3, and that project was basically "ripped-off" by some random people. You shouldn't associate the latter with the former, or claim it's battle tested. --DD

Niall Douglas

11:27 a.m.

On 22/03/2017 09:30, Dominique Devienne via Boost wrote:

...

On Wed, Mar 22, 2017 at 12:36 AM, Niall Douglas via Boost < boost@lists.boost.org> wrote:

...
[...] much safer choices of SQLite or UnQLite both of which are battle tested

and written by database experts to ensure that other people's data is

...
never, ever lost.

Very much so for SQLite, but not so for UnQLite IMHO, which is a "mostly abandoned" project of changing SQLite3 to use a new key-value store back-end, thus create in the process a new "SQLite4". The original was lead by Dr Hipp, who later concentrated back on SQLite3, and that project was basically "ripped-off" by some random people.

That's interesting, thank you.

...

You shouldn't associate the latter with the former, or claim it's battle tested. --DD

My claim was based on UnQLite using the same storage abstraction layer as SQLite. I am very confident in that layer. The stuff UnQLite adds on top, that could lose data for sure, but it'll do it much less frequently and in much fewer of the usual cases thanks to using the SQLite abstraction layer. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

3027

Age (days ago)

3036

Last active (days ago)

List overview

Download

91 comments

11 participants

participants (11)

Asbjørn
Dominique Devienne
Gavin Lambert
Glen Fernandes
Karen Shaeffer
Lee Clagett
Niall Douglas
Olaf van der Spek
Peter Dimov
Thijs van den Berg
Vinnie Falco