Hello, I have this innocent piece of code: namespace fs = boost::filesystem; auto path = getFilename(); // returns a string fs::create_directories(fs::path(path).parent_path()); std::ofstream ofs(path + "~"); ofs << info; ofs.close(); fs::rename(path + "~", path); which causes the exception: boost::filesystem::rename: No such file or directory: "../9f/061b4f7a5e529c964659226eedd4e5~", "../9f/061b4f7a5e529c964659226eedd4e5" However, I have no idea how that could happen. I use the rename, so that a reading process never sees an empty file, but only no file or filed with info. Is there any race involved between ofs.close() and fs:rename()? The code was executed on a distributed network filesystem (lustre). Any ideas anyone? Thanks, Florian
Sorry for the missing subject, hit the send button accidentally. Am 14.03.19 um 09:31 schrieb Florian Lindner via Boost:
Hello,
I have this innocent piece of code:
namespace fs = boost::filesystem; auto path = getFilename(); // returns a string fs::create_directories(fs::path(path).parent_path()); std::ofstream ofs(path + "~"); ofs << info; ofs.close(); fs::rename(path + "~", path);
which causes the exception:
boost::filesystem::rename: No such file or directory: "../9f/061b4f7a5e529c964659226eedd4e5~", "../9f/061b4f7a5e529c964659226eedd4e5"
However, I have no idea how that could happen. I use the rename, so that a reading process never sees an empty file, but only no file or filed with info. Is there any race involved between ofs.close() and fs:rename()? The code was executed on a distributed network filesystem (lustre).
Any ideas anyone?
Thanks, Florian
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
On 3/14/19 12:04 PM, Florian Lindner via Boost wrote:
Sorry for the missing subject, hit the send button accidentally.
Am 14.03.19 um 09:31 schrieb Florian Lindner via Boost:
Hello,
I have this innocent piece of code:
namespace fs = boost::filesystem; auto path = getFilename(); // returns a string fs::create_directories(fs::path(path).parent_path()); std::ofstream ofs(path + "~"); ofs << info; ofs.close(); fs::rename(path + "~", path);
which causes the exception:
boost::filesystem::rename: No such file or directory: "../9f/061b4f7a5e529c964659226eedd4e5~", "../9f/061b4f7a5e529c964659226eedd4e5"
However, I have no idea how that could happen. I use the rename, so that a reading process never sees an empty file, but only no file or filed with info. Is there any race involved between ofs.close() and fs:rename()? The code was executed on a distributed network filesystem (lustre).
Any ideas anyone?
I haven't had experience with Lustre, but I'm guessing it may be related. Did you try calling fsync between close and rename?
Am 14.03.19 um 10:11 schrieb Andrey Semashev via Boost:
On 3/14/19 12:04 PM, Florian Lindner via Boost wrote:
Sorry for the missing subject, hit the send button accidentally.
Am 14.03.19 um 09:31 schrieb Florian Lindner via Boost:
Hello,
I have this innocent piece of code:
namespace fs = boost::filesystem; auto path = getFilename(); // returns a string fs::create_directories(fs::path(path).parent_path()); std::ofstream ofs(path + "~"); ofs << info; ofs.close(); fs::rename(path + "~", path);
which causes the exception:
boost::filesystem::rename: No such file or directory: "../9f/061b4f7a5e529c964659226eedd4e5~", "../9f/061b4f7a5e529c964659226eedd4e5"
However, I have no idea how that could happen. I use the rename, so that a reading process never sees an empty file, but only no file or filed with info. Is there any race involved between ofs.close() and fs:rename()? The code was executed on a distributed network filesystem (lustre).
Any ideas anyone?
I haven't had experience with Lustre, but I'm guessing it may be related. Did you try calling fsync between close and rename?
No, I was assuming that close() does this. I have modified the code to { namespace fs = boost::filesystem; auto path = getFilename(); auto tmp = fs::path(path + "~"); fs::create_directories(tmp.parent_path()); boost::iostreams::streamboost::iostreams::file_descriptor_sink ofs(tmp); ofs << info; ::fdatasync(ofs->handle()); ofs.close(); fs::rename(tmp, path); } Reproducing the bug is hard, as so far, it only has appeared on really huge runs with more than 4000 processors. Best, Florian
On 3/14/19 2:29 PM, Florian Lindner via Boost wrote:
Am 14.03.19 um 10:11 schrieb Andrey Semashev via Boost:
I haven't had experience with Lustre, but I'm guessing it may be related. Did you try calling fsync between close and rename?
No, I was assuming that close() does this. I have modified the code to
{ namespace fs = boost::filesystem; auto path = getFilename(); auto tmp = fs::path(path + "~"); fs::create_directories(tmp.parent_path()); boost::iostreams::streamboost::iostreams::file_descriptor_sink ofs(tmp); ofs << info; ::fdatasync(ofs->handle()); ofs.close(); fs::rename(tmp, path); }
Reproducing the bug is hard, as so far, it only has appeared on really huge runs with more than 4000 processors.
close doesn't guarantee that written data or metadata has reached the media. IOW, other processes may not observe the file creation immediately after close. fdatasync only guarantees that for data but not metadata. fsync guarantees that for both, which is why I explicitly mentioned it and not fdatasync. For distributed filesystems, "media" typically means something else than the physical storage on the nodes. Exactly what it means depends on the filesystem. Normally, one would expect that OS (and filesystem driver in the OS, in particular) would guarantee that file creation would be visible at least to the same process (thread) that created the file, even if that operation did not reach the media. I allow that Lustre doesn't maintain this guarantee, and if so, I would think this is a filesystem problem, not that of user's application or Boost.Filesystem. This may be a design choice (which would be wrong, IMHO) or even a configurable option with some tradeoff, not necessarilly a programming bug.
On 3/14/19 3:46 PM, Andrey Semashev wrote:
On 3/14/19 2:29 PM, Florian Lindner via Boost wrote:
Am 14.03.19 um 10:11 schrieb Andrey Semashev via Boost:
I haven't had experience with Lustre, but I'm guessing it may be related. Did you try calling fsync between close and rename?
No, I was assuming that close() does this. I have modified the code to
{ namespace fs = boost::filesystem; auto path = getFilename(); auto tmp = fs::path(path + "~"); fs::create_directories(tmp.parent_path()); boost::iostreams::streamboost::iostreams::file_descriptor_sink ofs(tmp); ofs << info; ::fdatasync(ofs->handle()); ofs.close(); fs::rename(tmp, path); }
Reproducing the bug is hard, as so far, it only has appeared on really huge runs with more than 4000 processors.
close doesn't guarantee that written data or metadata has reached the media. IOW, other processes may not observe the file creation immediately after close. fdatasync only guarantees that for data but not metadata. fsync guarantees that for both, which is why I explicitly mentioned it and not fdatasync. For distributed filesystems, "media" typically means something else than the physical storage on the nodes. Exactly what it means depends on the filesystem.
Normally, one would expect that OS (and filesystem driver in the OS, in particular) would guarantee that file creation would be visible at least to the same process (thread) that created the file, even if that operation did not reach the media. I allow that Lustre doesn't maintain this guarantee, and if so, I would think this is a filesystem problem, not that of user's application or Boost.Filesystem. This may be a design choice (which would be wrong, IMHO) or even a configurable option with some tradeoff, not necessarilly a programming bug.
As another possibility, creating and writing to the file is not atomic with subsequent renaming. It is always possible that another process removes or renames the written file before you attempt to rename it. It may not be intended in your setup, but you should verify this possibility, and even if that isn't supposed to happen, be prepared that it happens anyway (e.g. due to human actions or some sort of system failure).
On 14 March 2019 12:54 Andrey Semashev wrote:
From:
On 3/14/19 3:46 PM, Andrey Semashev wrote: On 3/14/19 2:29 PM, Florian Lindner via Boost wrote:
Am 14.03.19 um 10:11 schrieb Andrey Semashev via Boost:
I haven't had experience with Lustre, but I'm guessing it may be related. Did you try calling fsync between close and rename?
No, I was assuming that close() does this. I have modified the code to
{ namespace fs = boost::filesystem; auto path = getFilename(); auto tmp = fs::path(path + "~"); fs::create_directories(tmp.parent_path()); boost::iostreams::streamboost::iostreams::file_descriptor_sink ofs(tmp); ofs << info; ::fdatasync(ofs->handle()); ofs.close(); fs::rename(tmp, path); }
Reproducing the bug is hard, as so far, it only has appeared on really huge runs with more than 4000 processors.
As another possibility, creating and writing to the file is not atomic with subsequent renaming. It is always possible that another process removes or renames the written file before you attempt to rename it. It may not be intended in your setup, but you should verify this possibility, and even if that isn't supposed to happen, be prepared that it happens anyway (e.g. due to human actions or some sort of system failure).
Having spent ages in the past trying to catch a similar very rare bug - is there some sort of virus detection getting a false positive here?
On 3/14/19 9:31 AM, Florian Lindner via Boost wrote:
However, I have no idea how that could happen. I use the rename, so that a reading process never sees an empty file, but only no file or filed with info. Is there any race involved between ofs.close() and fs:rename()? The code was executed on a distributed network filesystem (lustre).
Any ideas anyone?
`strace -e trace=%file,close program` and complain to file system for not working. This doesn't seem to have anything to do with boost::filesystem, more with the underlying filesystem. It's not close that matters, but open. Your program should work just fine without the explicit close. - Adam
Am 14.03.19 um 11:19 schrieb Adam Majer via Boost:
On 3/14/19 9:31 AM, Florian Lindner via Boost wrote:
However, I have no idea how that could happen. I use the rename, so that a reading process never sees an empty file, but only no file or filed with info. Is there any race involved between ofs.close() and fs:rename()? The code was executed on a distributed network filesystem (lustre).
Any ideas anyone?
`strace -e trace=%file,close program` and complain to file system for not working. This doesn't seem to have anything to do with boost::filesystem, more with the underlying filesystem.
It's not close that matters, but open. Your program should work just fine without the explicit close.
Hey, I was not trying to blame boost::filesystem, just trying to get ideas what can cause this. I have modified the code to: { namespace fs = boost::filesystem; auto path = getFilename(); auto tmp = fs::path(path + "~"); fs::create_directories(tmp.parent_path()); boost::iostreams::streamboost::iostreams::file_descriptor_sink ofs(tmp); ofs << info; ::fdatasync(ofs->handle()); ofs.close(); fs::rename(tmp, path); } maybe the fdatasync helps. strace is complicated, as I haven't managend to reliable reproduce the bug. Best, Florian
I have this innocent piece of code:
namespace fs = boost::filesystem; auto path = getFilename(); // returns a string fs::create_directories(fs::path(path).parent_path()); std::ofstream ofs(path + "~"); ofs << info; ofs.close(); fs::rename(path + "~", path);
which causes the exception:
boost::filesystem::rename: No such file or directory: "../9f/061b4f7a5e529c964659226eedd4e5~", "../9f/061b4f7a5e529c964659226eedd4e5"
However, I have no idea how that could happen. I use the rename, so that a reading process never sees an empty file, but only no file or filed with info. Is there any race involved between ofs.close() and fs:rename()? The code was executed on a distributed network filesystem (lustre).
Any ideas anyone?
Your code is racy. POSIX offers no guarantee that a file entry continues to exist after an open(). At any moment the file entry may be renamed, deleted, or otherwise disappear or mutate. Your filing system is therefore entirely within specification if the file entry is not there after an open() returns. You should adjust your code to be correct. (To write correct code, you may wish to look into renameat(). You would create an anonymous inode using O_TMPFILE, write its contents, open a handle to the destination directory using O_PATH, and renameat() your temporary inode over the destination file entry, atomically replacing it. If this sounds involved, the P1031 Low level file i/o reference implementation library LLFIO lets you do this portably using a somewhat less low level interface) Niall
participants (5)
-
Adam Majer
-
Alex Perry
-
Andrey Semashev
-
Florian Lindner
-
Niall Douglas