On 14 March 2019 12:54 Andrey Semashev wrote:
From:
On 3/14/19 3:46 PM, Andrey Semashev wrote: On 3/14/19 2:29 PM, Florian Lindner via Boost wrote:
Am 14.03.19 um 10:11 schrieb Andrey Semashev via Boost:
I haven't had experience with Lustre, but I'm guessing it may be related. Did you try calling fsync between close and rename?
No, I was assuming that close() does this. I have modified the code to
{ namespace fs = boost::filesystem; auto path = getFilename(); auto tmp = fs::path(path + "~"); fs::create_directories(tmp.parent_path()); boost::iostreams::streamboost::iostreams::file_descriptor_sink ofs(tmp); ofs << info; ::fdatasync(ofs->handle()); ofs.close(); fs::rename(tmp, path); }
Reproducing the bug is hard, as so far, it only has appeared on really huge runs with more than 4000 processors.
As another possibility, creating and writing to the file is not atomic with subsequent renaming. It is always possible that another process removes or renames the written file before you attempt to rename it. It may not be intended in your setup, but you should verify this possibility, and even if that isn't supposed to happen, be prepared that it happens anyway (e.g. due to human actions or some sort of system failure).
Having spent ages in the past trying to catch a similar very rare bug - is there some sort of virus detection getting a false positive here?