boost::filesystem - directory_iterator makes optimization hard

Hi, I recently had a need to do directory lookups in C++ and thought I'd take a look at boost::filesystem. I ran the sample_ls.cpp from the documentation, and it worked great. The problem is that directory_iterator appears to return a pointer to a path object. This path object is then passed to functions such as is_directory() to find the type. On systems that support type information in the directory entry itself, this structure limits what data can be returned in an iteration. I would suggest something like (roughly): class path; class dirent; class directory_iterator { // ... const dirent * operator-> () const; // ... }; class dirent { // some easy method to convert to path operator path (); // or perhaps a safer method would just be // to duplicate the path functions const std::string & leaf() const; // etc. // and then possible optimized versions bool is_directory() const; // ... }; The members of dirent would mirror the available functions that use the path object. If no optimization is possible, it just calls what the user would have had to call anyway. But if it is possible to optimize certain items (such as a struct dirent containing d_type), this would be used, possibly saving a call to stat(). Normally optimization should be left as implementation details... but in this case, I believe the class design limits what optimization is allowed. I would be pleased if I'm wrong on this. Thanks for reading this far. - Chris

"Chris Frey" <cdfrey@netdirect.ca> wrote in message news:20050602055255.GA32207@netdirect.ca...
Hi,
I recently had a need to do directory lookups in C++ and thought I'd take a look at boost::filesystem. I ran the sample_ls.cpp from the documentation, and it worked great.
The problem is that directory_iterator appears to return a pointer to a path object. This path object is then passed to functions such as is_directory() to find the type.
On systems that support type information in the directory entry itself, this structure limits what data can be returned in an iteration.
I would suggest something like (roughly):
class path; class dirent; class directory_iterator { // ... const dirent * operator-> () const; // ... };
class dirent { // some easy method to convert to path operator path (); // or perhaps a safer method would just be // to duplicate the path functions const std::string & leaf() const; // etc.
// and then possible optimized versions bool is_directory() const;
// ... };
The members of dirent would mirror the available functions that use the path object. If no optimization is possible, it just calls what the user would have had to call anyway. But if it is possible to optimize certain items (such as a struct dirent containing d_type), this would be used, possibly saving a call to stat().
Normally optimization should be left as implementation details... but in this case, I believe the class design limits what optimization is allowed. I would be pleased if I'm wrong on this.
Thanks for reading this far.
The question has come up in the past, although I don't recall anyone proposing exactly the solution you suggest. So the thoughts that follow are based on considerable thought, although that doesn't always mean much. Here is the analysis: * That is a lot of additional interface complexity to support an optimization that applies to Windows but not POSIX. Some of the other schemes (which involved additional overloads to specific operations functions) had less visible impact on the interface. * There have been no timings to indicate the inefficiency of the current interface actually impacts production applications. * AFAICS, there is nothing in the current interface which prevents caching of directory entries. Caching would probably aid more use cases than proposed changes to interfaces. But both caching and user dirent storage introduce serious additional race conditions. Not a good thing. In fact a showstopper unless cache management is introduced, further complicating the interface. * There is another outstanding issue (lack of directory iterator filtering and/or globbing) that does in fact impact both timing and ease of use of real-world applications and so is the highest priority for future work. Those are enough concerns to make the current interface look pretty good, IMO. Thanks for your interest, --Beman

Beman Dawes wrote:
* That is a lot of additional interface complexity to support an optimization that applies to Windows but not POSIX. Some of the other schemes (which involved additional overloads to specific operations functions) had less visible impact on the interface.
* There have been no timings to indicate the inefficiency of the current interface actually impacts production applications.
* AFAICS, there is nothing in the current interface which prevents caching of directory entries. Caching would probably aid more use cases than proposed changes to interfaces. But both caching and user dirent storage introduce serious additional race conditions. Not a good thing. In fact a showstopper unless cache management is introduced, further complicating the interface.
As a data point, PHP does stat caching: http://www.php.net/manual/en/function.is-dir.php because of performance issues. Stat caching does apply to POSIX systems as well. Either a transparent caching scheme or is_* overloads taking a directory iterator may work. The race conditions argument isn't very strong. The information returned by is_directory is inherently race-prone, caching or no caching. I won't be surprised if an OS does its own per-process stat caching, for instance.

"Peter Dimov" <pdimov@mmltd.net> wrote in message news:000901c56770$deb96fe0$6401a8c0@pdimov2...
Beman Dawes wrote:
* That is a lot of additional interface complexity to support an optimization that applies to Windows but not POSIX. Some of the other schemes (which involved additional overloads to specific operations functions) had less visible impact on the interface.
* There have been no timings to indicate the inefficiency of the current interface actually impacts production applications.
* AFAICS, there is nothing in the current interface which prevents caching of directory entries. Caching would probably aid more use cases than proposed changes to interfaces. But both caching and user dirent storage introduce serious additional race conditions. Not a good thing. In fact a showstopper unless cache management is introduced, further complicating the interface.
As a data point, PHP does stat caching:
Interesting. Thanks for the reference.
because of performance issues. Stat caching does apply to POSIX systems as well. Either a transparent caching scheme or is_* overloads taking a directory iterator may work.
Yes, and one or the other of those would be my preferred solution if we ever decide to do something about the issue.
The race conditions argument isn't very strong. The information returned by is_directory is inherently race-prone, caching or no caching. I won't be surprised if an OS does its own per-process stat caching, for instance.
I'd be very surprised if modern OS implementations didn't do some caching, at least at the disk page level, if not also at the directory or dirent level. But the OS is more likely than a user program to know when to refresh or flush a cache entry. The belief that there is probably OS caching is one of the reasons I'd like to see actual performance measurements in a realistic use case before worrying about efficiency. --Beman

On Thu, Jun 02, 2005 at 08:16:59AM -0400, Beman Dawes wrote:
* That is a lot of additional interface complexity to support an optimization that applies to Windows but not POSIX. Some of the other schemes (which involved additional overloads to specific operations functions) had less visible impact on the interface.
Windows is not the only system with d_type. I was writing with Linux in mind.
* There have been no timings to indicate the inefficiency of the current interface actually impacts production applications.
I'm not sure this is the right view to take when it comes to something as low level and highly used as a filesystem. Taking a bird's eye view, an "strace ls" compared with an "strace simple_ls" results in a stat kernel call for every file and directory found. This can add up. The application I'm writing is a replacement for a mail script on a heavily loaded system. This script runs regularly and the disk is the biggest bottleneck (not just for my application, but in general too). Since the system is loaded, I'm looking to cut the number of system calls as much as possible, especially since I'm going to the trouble of rewriting this in C++. :-) Taking this view, I decided to run a test, comparing directory_iterator with the find system utility, since the style of work is similar to one part of my application, and the result set will be large enough to see any differences. Code for my modified simple_ls.cpp is below. If I've made a horrendous performance blunder, please let me know, because the difference seems quite high. My test system is linux 2.4.x, on a PII, IDE disk, with 128 megs of RAM. This is enough to hold my home directory info in memory (the disk doesn't thrash after my initial run). My home directory has a number of files: bash-2.05b$ find | wc -l 33370 First the simple_ls, run 3 times back to back after an initial run: bash-2.05b$ time ./simple_ls > /dev/null real 0m7.763s user 0m7.000s sys 0m0.700s bash-2.05b$ time ./simple_ls > /dev/null real 0m7.724s user 0m6.970s sys 0m0.710s bash-2.05b$ time ./simple_ls > /dev/null real 0m7.702s user 0m7.110s sys 0m0.580s strace summary: bash-2.05b$ strace -c ./simple_ls > /dev/null execve("./simple_ls", ["./simple_ls"], [/* 44 vars */]) = 0 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 78.40 0.760433 23 33498 15 stat64 8.11 0.078632 10 7540 getdents64 7.91 0.076689 20 3757 open 1.63 0.015779 4 3757 close 1.58 0.015302 4 3757 fstat64 1.18 0.011491 5 2420 write 1.15 0.011201 3 3749 fcntl64 0.02 0.000174 11 16 mmap2 0.01 0.000084 14 6 read 0.01 0.000064 32 2 munmap 0.00 0.000024 6 4 brk 0.00 0.000015 15 1 getcwd 0.00 0.000007 7 1 uname 0.00 0.000004 4 1 1 ioctl ------ ----------- ----------- --------- --------- ---------------- 100.00 0.969899 58509 16 total Next I ran find: bash-2.05b$ time find > /dev/null real 0m0.310s user 0m0.140s sys 0m0.170s bash-2.05b$ time find > /dev/null real 0m0.339s user 0m0.180s sys 0m0.140s bash-2.05b$ time find > /dev/null real 0m0.313s user 0m0.160s sys 0m0.150s strace summary: bash-2.05b$ strace -c find > /dev/null execve("/usr/bin/find", ["find"], [/* 44 vars */]) = 0 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 37.51 0.111190 7 15750 lstat64 21.50 0.063734 8 7535 getdents64 15.62 0.046298 6 7493 chdir 10.97 0.032505 9 3752 open 4.90 0.014519 4 3751 close 4.06 0.012038 3 3751 fstat64 3.14 0.009313 2 3747 fcntl64 2.23 0.006596 3 2036 write 0.02 0.000072 12 6 mmap2 0.02 0.000050 25 2 munmap 0.01 0.000039 20 2 read 0.01 0.000021 4 5 brk 0.00 0.000009 5 2 fchdir 0.00 0.000007 7 1 uname 0.00 0.000003 3 1 time 0.00 0.000003 3 1 1 ioctl ------ ----------- ----------- --------- --------- ---------------- 100.00 0.296397 47835 1 total
* AFAICS, there is nothing in the current interface which prevents caching of directory entries. Caching would probably aid more use cases than
Caching would likely help most long-running applications, but mine has a very short lifetime and needs to run this new each time.
* There is another outstanding issue (lack of directory iterator filtering and/or globbing) that does in fact impact both timing and ease of use of real-world applications and so is the highest priority for future work.
This would indeed help my application if I could filter on directory / file / symlink type, etc. Anyway, an optimization based on overloaded is_* functions taking a straight iterator would work. I didn't think of that earlier. I fear that the performance implications of the different calls will not be obvious, but I guess that's a job for documentation. :-) Thanks, - Chris The simple_ls.cpp modified into a simple_find.cpp: // simple_ls program -------------------------------------------------------// // � Copyright Jeff Garland and Beman Dawes, 2002 // Use, modification, and distribution is subject to the Boost Software // License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at // http://www.boost.org/LICENSE_1_0.txt) // See http://www.boost.org/libs/filesystem for documentation. #include "boost/filesystem/operations.hpp" #include "boost/filesystem/path.hpp" #include <iostream> namespace fs = boost::filesystem; void print_dir(const fs::path &full_path) { std::cout << full_path.native_directory_string() << "\n"; fs::directory_iterator end_iter; for ( fs::directory_iterator dir_itr( full_path ); dir_itr != end_iter; ++dir_itr ) { try { if ( fs::is_directory( *dir_itr ) ) { print_dir(*dir_itr); } else { std::cout << dir_itr->native_file_string() << "\n"; } } catch ( const std::exception & ex ) { std::cout << dir_itr->leaf() << " " << ex.what() << std::endl; } } } int main( int argc, char* argv[] ) { fs::path full_path( fs::initial_path() ); if ( argc > 1 ) full_path = fs::system_complete( fs::path( argv[1], fs::native ) ); else std::cout << "\nusage: simple_ls [path]" << std::endl; if ( !fs::exists( full_path ) ) { std::cout << "\nNot found: " << full_path.native_file_string() << std::endl; return 1; } if ( fs::is_directory( full_path ) ) { print_dir(full_path); } else // must be a file { std::cout << "\nFound: " << full_path.native_file_string() << "\n"; } return 0; }

At 02:31 PM 6/8/2005, Chris Frey wrote:
On Thu, Jun 02, 2005 at 08:16:59AM -0400, Beman Dawes wrote:
* That is a lot of additional interface complexity to support an optimization that applies to Windows but not POSIX. Some of the other schemes (which involved additional overloads to specific operations functions) had less visible impact on the interface.
Windows is not the only system with d_type. I was writing with Linux in mind.
Sorry. I usually just look at the POSIX docs when I want to know what Unix-style operating system functions are available, and forget that there are also operating system specific extensions available.
* There have been no timings to indicate the inefficiency of the current interface actually impacts production applications.
I'm not sure this is the right view to take when it comes to something as low level and highly used as a filesystem.
I should have put that in a broader context. The first priority is getting the interface functionality, safety, and ease-of-use right. I've been tied up with those issues as related to the i18n branch for a long time now, but as of yesterday finished that work to my satisfaction. It will take a mini-review to see if that work satisfies others, of course. Once functionality, safety, and ease-of-use concerns are addressed, then efficiency becomes more important. But because efficiency is such a notoriously slippery issue, the first step has got to be timings in realistic use scenarios.
... I decided to run a test, comparing directory_iterator with the find system utility...
Your tests clearly raise directory_iterator performance concerns. Terence Wilson also reported directory_iterator performance concerns, and also provided a timing program. The concern I had with both of these timing tests was that while not totally apples-and-oranges comparisons, there were still a lot of uncontrolled variables. I put together a timing test program (see below) that depends entirely on Boost.Filesystem operations. Since the only difference between the two modes of operation is the use of boost::filesystem::status(), any timing differences are caused by that alone. The timing differences between the two modes are dramatic. With Windows XP SP 2, 1 gigabyte main memory, compiled with VC++ 7.1 in release mode, in an NTFS directory with 15,046 files, run from a freshly booted machine, average of three runs: 6.06 seconds with status() 1.04 seconds without status() Additional runs (showing no disk activity whatsoever because of disk caching): 1.03 seconds with status() .31 seconds without status() The timing differences are explained by watching file activity (using the Diskmon utility from http://www.sysinternals.com/). Without status() enabled, there is 1 DIRECTORY action every 34 or so files, and 1 READ action every 17 or so files. With status() enabled, there is 1 DIRECTORY action every 34 or so files, and 1 INFORMATION QUERY action _every_ file. Each of those is actually causing disk activity , too, based on the state of the disk light (but only if the cache is cold). In other words, use of status() is causing roughly 17 times as much disk activity. So it looks like a status() overload which takes an iterator directly (with the one byte status value cached in the iterator) would be a big performance plus. Thanks for going to the trouble of doing timings. They were very motivating! --Beman

"Beman Dawes" <bdawes@acm.org> wrote in message news:6.0.3.0.2.20050609093713.03bf6990@mailhost.esva.net...
I put together a timing test program (see below) that depends entirely on Boost.Filesystem operations. Since the only difference between the two modes of operation is the use of boost::filesystem::status(), any timing differences are caused by that alone.
I've now done a trial implementation that keeps a copy of the status byte in each directory iterator on operating systems which support it. That means at least Windows, Linux, Mac OS X, and any OS derived from BSD.
The timing differences between the two modes are dramatic. With Windows XP SP 2, 1 gigabyte main memory, compiled with VC++ 7.1 in release mode, in an NTFS directory with 15,046 files, run from a freshly booted machine, average of three runs:
6.06 seconds with status() 1.04 seconds without status()
.91 seconds with status() from iterator I doubt "status() from iterator is actually faster than "without status()". The timing abnormality was probably because the tests were separated by two days, and an automatic disk defrag ran in the meantime.
Additional runs (showing no disk activity whatsoever because of disk caching):
1.03 seconds with status() .31 seconds without status()
.31 seconds with status() from iterator. Given that there is a six times performance difference on real disk pages, a three times difference even on cached pages, and these differences have a practical impact on real applications, Boost.Filesystem needs to address this. My trial implementation depended on an additional overload for the predicate functions. Using it led to three conclusions: (1) enable_if is wonderful (but we already knew that:-), (2) with overloaded predicate functions, the "too similar" interface problem that Chris Frey mentioned is more serious than I thought, and (3) overloaded predicate functions do work but are a kludge. It is important to understand that caching a copy of the status byte in directory iterators, and then later referencing that copy rather than going back to the disk, has to be under user control because it alters the behavior of programs. (I suspect that's why the operating systems themselves don't do hidden status caching.) Because in some applications an iterator can quickly become very stale, the user needs explicit control to go to the disk if concerned about race conditions, or to used the iterator status if efficiency is of greater concern. The "too similar" problem with overloading predicate functions on directory_iterator is that two very similar calls have different behavior: if ( is_directory( *itr ) ) // uses current status if ( is_directory( itr ) ) // uses cached status, which may be very stale That looks like a race-condition trap for unwary users. The kludge aspect of directory iterator overloads for certain functions is that these depend on secret sharing (via friendship) of information between the directory iterator implementation and the predicate function implementation. Not a good sign. If the interface were explicit, the user controls when and where to use the cached copy, and does so via syntax that is different enough to eliminate at least some cases of inadvertent use: if( itr->is_directory() ) // uses cached status That's a long way of saying that Chris Frey's original suggestion to have directory_iterator's value type be a directory_entry class is looking like a better design. I hate adding more visible interface, but that looks like the right thing to do. I'll do a trial implementation and verify it doesn't impact existing code, etc. --Beman

Beman Dawes wrote:
The "too similar" problem with overloading predicate functions on directory_iterator is that two very similar calls have different behavior: if ( is_directory( *itr ) ) // uses current status
if ( is_directory( itr ) ) // uses cached status, which may be very stale
That looks like a race-condition trap for unwary users.
I could argue that the second form is less "trappy" for unwary users, actually, because all is_directory( itr ) calls will always yield the same value (and exists(itr) will never yield false for an enumerated item). This comes into play in situations like: void f( itr ) { if( is_directory(itr) ) { g( itr ); } else { h( itr ); } } void g( itr ) { if( is_directory(itr) ) { // ... } else { // ... } }

"Peter Dimov" <pdimov@mmltd.net> wrote in message news:005901c57053$c77160e0$6401a8c0@pdimov2...
Beman Dawes wrote:
The "too similar" problem with overloading predicate functions on directory_iterator is that two very similar calls have different behavior: if ( is_directory( *itr ) ) // uses current status
if ( is_directory( itr ) ) // uses cached status, which may be very stale
That looks like a race-condition trap for unwary users.
I could argue that the second form is less "trappy" for unwary users, actually, because all is_directory( itr ) calls will always yield the same value (and exists(itr) will never yield false for an enumerated item).
Yes. I'm probably worrying too much. --Beman
participants (3)
-
Beman Dawes
-
Chris Frey
-
Peter Dimov