On 22/03/2017 04:13, Gavin Lambert via Boost wrote:
On 22/03/2017 16:08, Vinnie Falco via Boost wrote:
I think this can be unit tested, and I believe that NuDB's unit test covers the case of power loss. I think we can agree that power loss on a read is uninteresting (since it can't corrupt data). The unit test models a power loss as a fatal error during a write. The test exercises all possible fatal errors using an incremental approach (I alluded to this in my previous message).
A power loss is more like a fatal error that fails to execute any subsequent clean-up code, so it might not be quite the same.
There are also more pathological cases such as where a write has been partially successful and done some subset of increasing the file size, zeroing the extra file space, and writing some subset of the intended data. So it's not necessarily that data is missing; there might be invalid data in its place.
There are a few rings to testing data loss safety: Ring 1: Does my application code correctly handle all possible errors in all possible contexts? This can be tested using Monte Carlo methods, fuzzing, parameter permutations, unit and functional testing. Ring 2: Does my code correctly handle sudden stop? This can be tested using LXC containers where you kill -9 the container mid-test. Monte Carlo to verification. Ring 3: Does my code correctly handle sudden kernel stop? This can be tested using kvm or qemu where you kill -9 the virtualised OS mid-test. Ring 4: Does my code correctly handle sudden power loss to the CPU? This can be tested using a few dozen cheap odroid devices where you manually trip their watchdog hard reset hardware feature. This solution has the big advantage of not requiring the SSD used to be sudden power loss safe :) Ring 5: Does my code correctly handle sudden power loss to the storage? It requires more work and you'll find endless bugs in the kernel, filing system and the storage device, but you can install a hardware switch to cut power to the storage device mid-test. This is a never ending "fun" task, it's far too uncommonly tested by the kernel vendors, but it's a great simulation of how well faulty storage is handled. Ring 6: Does my code correctly handle sudden power loss to the system? Unlike Ring 5 this is actually a better tested situation. Sudden power loss to everything at once is probably less buggy than Ring 5. Still, you can get data loss at any level from the kernel, to the SATA chip, to the device itself. There are also other test rings not related to sudden power loss. For example, single and paired bit flips are not uncommon in terabytes of storage, either transient or permanent. These can be simulated using kvm with you manually flipping random bits in the disc image as it runs. You might become quite appalled at what data gets destroyed by bugs in the filing system when facing flipped bits. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/