[filesystem] How to remove specific files from a directory?
Hi, I wonder what is boost::filesystem recommendation for solving the following problem. I want to remove all files from a given directory that satisfy a certain predicate, e.g., only these whose names start with letter "s". It is my understanding that the calling filesystem::remove may invalidate the iterator, and therefore the following solution is incorrect: fsys::directory_iterator it{path}, itEnd{}; for ( ; it != itEnd ; ++it ) { if (predicate(*it)) fsys::remove(it->path()); } But, is the following guaranteed to work?: fsys::directory_iterator it{path}, itEnd{}; for ( ; it != itEnd ; ) { if (predicate(*it)) fsys::remove((it++)->path()); else ++it; } If not, does there exist a dedicated solution for solving this problem that would not require of me N traversals through the directory? Thanks, &rzej;
On 09/07/16 16:06, Andrzej Krzemienski wrote:
Hi, I wonder what is boost::filesystem recommendation for solving the following problem.
I want to remove all files from a given directory that satisfy a certain predicate, e.g., only these whose names start with letter "s".
It is my understanding that the calling filesystem::remove may invalidate the iterator, and therefore the following solution is incorrect:
fsys::directory_iterator it{path}, itEnd{}; for ( ; it != itEnd ; ++it ) { if (predicate(*it)) fsys::remove(it->path()); }
But, is the following guaranteed to work?:
fsys::directory_iterator it{path}, itEnd{}; for ( ; it != itEnd ; ) { if (predicate(*it)) fsys::remove((it++)->path()); else ++it; }
From the documentation, it seems the behavior should be similar to readdir, in which case it would seem that both pieces of code above are valid. Although I would prefer the second one as it is more in line with common practices in C++.
If not, does there exist a dedicated solution for solving this problem that would not require of me N traversals through the directory?
If you still want a different solution, you could collect the matching file names on the directory traversal and delete the files in the second loop.
On 7 September 2016 at 15:46, Andrey Semashev
On 09/07/16 16:06, Andrzej Krzemienski wrote:
Hi, I wonder what is boost::filesystem recommendation for solving the following problem.
I want to remove all files from a given directory that satisfy a certain predicate, e.g., only these whose names start with letter "s".
It is my understanding that the calling filesystem::remove may invalidate the iterator, and therefore the following solution is incorrect:
fsys::directory_iterator it{path}, itEnd{}; for ( ; it != itEnd ; ++it ) { if (predicate(*it)) fsys::remove(it->path()); }
But, is the following guaranteed to work?:
fsys::directory_iterator it{path}, itEnd{}; for ( ; it != itEnd ; ) { if (predicate(*it)) fsys::remove((it++)->path()); else ++it; }
From the documentation, it seems the behavior should be similar to readdir, in which case it would seem that both pieces of code above are valid.
Indeed, boost::filesystem and std::filesystem (ISO/IEC TS 18822:2015) as well as POSIX readdir behave in the same way. But, for all of them, behaviour is unspecified in case content of directory that is being traversed changes. Best regards, -- Mateusz Loskot, http://mateusz.loskot.net
On 09/07/16 17:07, Mateusz Loskot wrote:
On 7 September 2016 at 15:46, Andrey Semashev
wrote: On 09/07/16 16:06, Andrzej Krzemienski wrote:
Hi, I wonder what is boost::filesystem recommendation for solving the following problem.
I want to remove all files from a given directory that satisfy a certain predicate, e.g., only these whose names start with letter "s".
It is my understanding that the calling filesystem::remove may invalidate the iterator, and therefore the following solution is incorrect:
fsys::directory_iterator it{path}, itEnd{}; for ( ; it != itEnd ; ++it ) { if (predicate(*it)) fsys::remove(it->path()); }
But, is the following guaranteed to work?:
fsys::directory_iterator it{path}, itEnd{}; for ( ; it != itEnd ; ) { if (predicate(*it)) fsys::remove((it++)->path()); else ++it; }
From the documentation, it seems the behavior should be similar to readdir, in which case it would seem that both pieces of code above are valid.
Indeed, boost::filesystem and std::filesystem (ISO/IEC TS 18822:2015) as well as POSIX readdir behave in the same way. But, for all of them, behaviour is unspecified in case content of directory that is being traversed changes.
It is unspecified whether the removed/added files will be discovered as part of the traversal. Other than that the behavior is defined. For instance, the implementation should not crash and the iterator should still reach the end at some point.
On 7 September 2016 at 16:45, Andrey Semashev
On 09/07/16 17:07, Mateusz Loskot wrote:
On 7 September 2016 at 15:46, Andrey Semashev
wrote: On 09/07/16 16:06, Andrzej Krzemienski wrote:
Hi, I wonder what is boost::filesystem recommendation for solving the following problem.
I want to remove all files from a given directory that satisfy a certain predicate, e.g., only these whose names start with letter "s".
It is my understanding that the calling filesystem::remove may invalidate the iterator, and therefore the following solution is incorrect:
fsys::directory_iterator it{path}, itEnd{}; for ( ; it != itEnd ; ++it ) { if (predicate(*it)) fsys::remove(it->path()); }
But, is the following guaranteed to work?:
fsys::directory_iterator it{path}, itEnd{}; for ( ; it != itEnd ; ) { if (predicate(*it)) fsys::remove((it++)->path()); else ++it; }
From the documentation, it seems the behavior should be similar to readdir, in which case it would seem that both pieces of code above are valid.
Indeed, boost::filesystem and std::filesystem (ISO/IEC TS 18822:2015) as well as POSIX readdir behave in the same way. But, for all of them, behaviour is unspecified in case content of directory that is being traversed changes.
It is unspecified whether the removed/added files will be discovered as part of the traversal. Other than that the behavior is defined. For instance, the implementation should not crash and the iterator should still reach the end at some point.
Right, the iteration following any changes in the filesystem content remains valid. Thanks for the correction. Best regards, -- Mateusz Loskot, http://mateusz.loskot.net
On 7 Sep 2016 at 17:45, Andrey Semashev wrote:
Indeed, boost::filesystem and std::filesystem (ISO/IEC TS 18822:2015) as well as POSIX readdir behave in the same way. But, for all of them, behaviour is unspecified in case content of directory that is being traversed changes.
It is unspecified whether the removed/added files will be discovered as part of the traversal. Other than that the behavior is defined. For instance, the implementation should not crash and the iterator should still reach the end at some point.
Be aware that Windows can take quite a bit of time to delete a file, it happens asynchronously from the syscall. In this situation it's entirely possible that one could delete every entry in a directory, yet every entry would still be there. Windows may take until the next reboot to delete a file, and until then it cannot be opened by anybody. (AFIO v1 took special measures to quietly filter out pending-delete files. AFIO v2, in line with its new bare metal approach, simply renames to-be-deleted files to <32 random bytes>.deleted before requesting deletion) Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On 9/8/2016 12:24 AM, Niall Douglas wrote: <snip>
... Windows may take until the next reboot to delete a file, and until then it cannot be opened by anybody.
This is news to me. Do you have any links to documentation on this? eg
On 12 Sep 2016 at 7:51, eg wrote:
... Windows may take until the next reboot to delete a file, and until then it cannot be opened by anybody.
This is news to me. Do you have any links to documentation on this?
Short answer: as the Win32 docs for DeleteFile() says: "The DeleteFile function marks a file for deletion on close. Therefore, the file deletion does not occur until the last handle to the file is closed [in the system]. Subsequent calls to CreateFile to open the file fail with ERROR_ACCESS_DENIED." https://msdn.microsoft.com/en-us/library/windows/desktop/aa363915%28v= vs.85%29.aspx. [square brackets stuff added by me] Longer answer: the Windows NT kernel was originally the next edition of the VAX VMS kernel, and was considered by many to have a set of superior, if more conservative, design choices to the Unixes of the day which ended up becoming POSIX. Many of the NT kernel APIs are very similar to those of VMS as a result. One interesting feature of the VMS filesystem was that when you deleted a file, the system went off and securely scrubbed the contents before doing the deletion, a process which could take considerable time. NTFS and Windows NT inherited that behaviour, and it was furthermore considered valuable for secure by design code to lock out use of a file being deleted because it makes inode swapping tricks and other such security holes on POSIX systems impossible on VMS and NT systems. NT absolutely allows you to explicitly opt into POSIX semantics by renaming a file before deletion as AFIO does, this leads to default more secure semantics than the POSIX default behaviour. Ultimately of course the ship has sailed, and POSIX is now the standard. NT reflects a continuing objection to many design failures in POSIX especially around the filesystem where POSIX has many deeply flawed design decisions. As a result of the above behaviours, unfortunately the lion's share of code out there written for Windows which deals with the filesystem is simply wrong. It just happens to work most of the time and people are too ignorant and/or don't care that it is racy and will cause misoperation for some users. A lot of big famous open source projects indeed refuse to fix incorrect code after a bug report because they just don't believe it's a problem, mainly through not fully understanding what files that can't be unlinked means for correct filesystem code design. There is a group of dedicated AFIO repo followers who have tried logging bugs about these bad design patterns with various projects, and it's been quite amazing how little people care that their code will always fail under the right conditions. But in the end the file system has always been considered as unchanging when programmers write code for it, thus leading to a large number of race bugs and security holes caused by unconsidered concurrent third party induced changes. AFIO is intended to help programmers to avoid these sorts of issue more easily than is possible with just the Boost and standard C++ library facilities where it is currently quite hard to write completely bug free filesystem code without resorting to proprietary APIs. Those wanting lots more detail may find my conference presentations worth watching: 20150924 CppCon Racing the File System Workshop https://www.youtube.com/watch?v=uhRWMGBjlO8 20160421 ACCU Distributed Mutual Exclusion using Proposed Boost.AFIO https://www.youtube.com/watch?v=elegewDwm64 The third in the series is coming next week at CppCon. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On 13 Sep 2016 at 7:37, degski wrote:
... Longer answer ...
Thanks for the write-up... It's a shame Windows doesn't do the VMS file-shredding though...
It would be hard to implement in NTFS. Each file is stored in a chain of 64Kb extents. Modifying a segment is a read-copy-update operation and relinking the chain, so as a file is updated you are basically leaking bits of data all over the free space list over time. Therefore shredding on delete is not particularly effective at truly deleting the file contents on NTFS, and that's why their defrag API on a cronjob is a much better way of doing it (and I think what the DoD C2 secure edition does). I should apologise to the list for yesterday not actually explaining why deleted files take a while to delete on Windows. All I can say is it's very busy as Boost Summer of Code winds down and CppCon nears. It's too easy to brain dump. The historical reason for that behaviour was explained, but not why it's still done today. The reason is because NTFS and Windows really does care about your data and forces a metadata fsync to the journal on the containing directory when you delete a file entry within it. Obviously this forces a journal write per file entry deleted, and if you're deleting say 1m file entries from a directory that would mean 1m fsyncs. To solve this, Windows actively avoids deleting files if the filesystem is busy despite that all handles are closed and the file was marked with the delete on close flag. I've seen up to two seconds in testing here locally. It'll then do a batch pass of writing a new MFT record with all the deleted files removed and fsync that, so instead of 1m fsyncs, there is just one. Some might ask why not immediately unlink it in RAM as Linux does? Linux historically really didn't try hard to avoid data loss on sudden power loss, and even today it uniquely requires programmers to explicitly call fsync on containing directories in order to achieve sudden power loss safety. NTFS and Windows tries much harder, and it tries to always keep what *metadata* the program sees via the kernel syscalls equal to what is on physical storage (actual file data is a totally separate matter). It makes programming reliable filesystem code much easier on Windows than on Linux which was traditionally a real bear. (ZFS on FreeBSD interestingly takes a middle approach in between Windows' and Linux's - it allows a maximum 5 second reordering window after which writes arrive on physical storage exactly in the order issued. This lets the program get ahead of storage by up to 30 seconds or so, but because you get a fairly total sequentially consistent ordering it makes sudden power loss recovery vastly easier because you only need to scan +/- 5 seconds to recover a valid state) Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On 09/13/16 21:51, Niall Douglas wrote:
Some might ask why not immediately unlink it in RAM as Linux does? Linux historically really didn't try hard to avoid data loss on sudden power loss, and even today it uniquely requires programmers to explicitly call fsync on containing directories in order to achieve sudden power loss safety. NTFS and Windows tries much harder, and it tries to always keep what *metadata* the program sees via the kernel syscalls equal to what is on physical storage (actual file data is a totally separate matter). It makes programming reliable filesystem code much easier on Windows than on Linux which was traditionally a real bear.
I'm not sure I understand how Windows behavior you described provides better protection against power loss. If the power is lost before metadata is flushed to media then the file stays present after reboot. The same happens in Linux, AFAICT, only you can influence the FS behavior with mount options. The irritating difference is that even though the file is deleted (by all means the application has to observe that), the OS still doesn't allow to delete the containing folder because it's not empty. I'm seeing this effect nearly every time I boot into Windows - when I delete the bin.v2 directory created by Boost.Build. There may be historical reasons to it, but seriously, if the OS tries to cheat and pretends the file is deleted then it should go the whole way and act as if it is. Workarounds like rename+delete are a sorry excuse because it's really difficult to say where to rename the file in presence of reparse points, quotas and permissions. And most importantly - why should one jump through these hoops in one specific case, on Windows? The same goes about inability to delete/move open files.
On 13 Sep 2016 at 22:18, Andrey Semashev wrote:
Some might ask why not immediately unlink it in RAM as Linux does? Linux historically really didn't try hard to avoid data loss on sudden power loss, and even today it uniquely requires programmers to explicitly call fsync on containing directories in order to achieve sudden power loss safety. NTFS and Windows tries much harder, and it tries to always keep what *metadata* the program sees via the kernel syscalls equal to what is on physical storage (actual file data is a totally separate matter). It makes programming reliable filesystem code much easier on Windows than on Linux which was traditionally a real bear.
I'm not sure I understand how Windows behavior you described provides better protection against power loss. If the power is lost before metadata is flushed to media then the file stays present after reboot. The same happens in Linux, AFAICT, only you can influence the FS behavior with mount options.
You're thinking in terms of "potential loss of user data", and in that sense you're right. I'm referring to "writing multi-process concurrent filesystem code which is algorithmically correct and won't lose data". In this situation having the kernel only tell you what is actually physically on disk makes life much easier when writing correct code. In Linux in particular you have to spam fsync all over the place and pray the user hasn't set "journal=writeback" or barriers off etc, and also such design patterns are inefficient as you end up doing too many directory fsyncs. During AFIO v1 I used to get very annoyed that metadata views on Windows from other processes did not match the modifying process' view until the updates reach physical storage, so process A could extend a file and process B wouldn't see the extension for potentially many seconds later (same goes for hard links, timestamps etc). It seemed easier if every process saw the same thing and had a sequentially consistent view. But with the benefit of getting used to it, and also the fact that Linux (+ ext4) would appear to be the exceptional outlier here, it does have a compelling logic and it definitely can be put to very good use when writing algorithmically correct filesystem code.
The irritating difference is that even though the file is deleted (by all means the application has to observe that), the OS still doesn't allow to delete the containing folder because it's not empty.
Ah but the file is not deleted, so refusing to delete the containing folder is correct. It is "pending deletion" which means anything still using it can continue to do so, but nothing new can use it [1]. You can, of course, also unmark a file marked for deletion in Windows. Linux has a similar feature by letting you create a file entry to an anonymous inode. [1]: Also an opt-out Windows behaviour.
I'm seeing this effect nearly every time I boot into Windows - when I delete the bin.v2 directory created by Boost.Build. There may be historical reasons to it, but seriously, if the OS tries to cheat and pretends the file is deleted then it should go the whole way and act as if it is.
Are you referring to Windows Explorer hiding stuff you delete with it when it's not really deleted? That's a relatively recent addition to Windows Explorer. It's very annoying.
Workarounds like rename+delete are a sorry excuse because it's really difficult to say where to rename the file in presence of reparse points, quotas and permissions. And most importantly - why should one jump through these hoops in one specific case, on Windows? The same goes about inability to delete/move open files.
You can delete, rename and move open files just fine on Windows. Indeed an AFIO v1 unit test fires up a thread randomly renaming a few dozen files and directories and then ensure that a loop of filesystem operations on a rapidly changing filesystem does not race nor misoperate. You are correct that you must opt-in to being able to do this. The Windows kernel folk correctly observed most programmers, even otherwise expert ones, consistently write unsafe filesystem code. They therefore defaulted to an abundance of defaulted options to safety (and I would agree too much so, especially making symbolic links effectively an unusable feature). Regarding an ideally efficient way of correctly deleting a directory tree on Windows, AFIO v1 had an internal algorithm which when faced with pending delete files during a directory tree deletion, it would probe around for suitable locations to rename them to in order to scrub the directory tree immediately. It was pretty effective especially if %TEMP% is on the same volume, and the NT kernel API makes figuring out what's also on your volume trivial as compared to say statfs() on Linux which is awful. AFIO v2 will at some point expose that algorithm as a generic templated edition into afio::algorithm so anybody can use it. In the end, these platform specific differences are indeed annoying. But that's the whole point of system libraries and abstraction libraries like many of those in Boost, you write code once and it works equally everywhere. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
participants (6)
-
Andrey Semashev
-
Andrzej Krzemienski
-
degski
-
eg
-
Mateusz Loskot
-
Niall Douglas