
On Thu, Jun 02, 2005 at 08:16:59AM -0400, Beman Dawes wrote:
* That is a lot of additional interface complexity to support an optimization that applies to Windows but not POSIX. Some of the other schemes (which involved additional overloads to specific operations functions) had less visible impact on the interface.
Windows is not the only system with d_type. I was writing with Linux in mind.
* There have been no timings to indicate the inefficiency of the current interface actually impacts production applications.
I'm not sure this is the right view to take when it comes to something as low level and highly used as a filesystem. Taking a bird's eye view, an "strace ls" compared with an "strace simple_ls" results in a stat kernel call for every file and directory found. This can add up. The application I'm writing is a replacement for a mail script on a heavily loaded system. This script runs regularly and the disk is the biggest bottleneck (not just for my application, but in general too). Since the system is loaded, I'm looking to cut the number of system calls as much as possible, especially since I'm going to the trouble of rewriting this in C++. :-) Taking this view, I decided to run a test, comparing directory_iterator with the find system utility, since the style of work is similar to one part of my application, and the result set will be large enough to see any differences. Code for my modified simple_ls.cpp is below. If I've made a horrendous performance blunder, please let me know, because the difference seems quite high. My test system is linux 2.4.x, on a PII, IDE disk, with 128 megs of RAM. This is enough to hold my home directory info in memory (the disk doesn't thrash after my initial run). My home directory has a number of files: bash-2.05b$ find | wc -l 33370 First the simple_ls, run 3 times back to back after an initial run: bash-2.05b$ time ./simple_ls > /dev/null real 0m7.763s user 0m7.000s sys 0m0.700s bash-2.05b$ time ./simple_ls > /dev/null real 0m7.724s user 0m6.970s sys 0m0.710s bash-2.05b$ time ./simple_ls > /dev/null real 0m7.702s user 0m7.110s sys 0m0.580s strace summary: bash-2.05b$ strace -c ./simple_ls > /dev/null execve("./simple_ls", ["./simple_ls"], [/* 44 vars */]) = 0 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 78.40 0.760433 23 33498 15 stat64 8.11 0.078632 10 7540 getdents64 7.91 0.076689 20 3757 open 1.63 0.015779 4 3757 close 1.58 0.015302 4 3757 fstat64 1.18 0.011491 5 2420 write 1.15 0.011201 3 3749 fcntl64 0.02 0.000174 11 16 mmap2 0.01 0.000084 14 6 read 0.01 0.000064 32 2 munmap 0.00 0.000024 6 4 brk 0.00 0.000015 15 1 getcwd 0.00 0.000007 7 1 uname 0.00 0.000004 4 1 1 ioctl ------ ----------- ----------- --------- --------- ---------------- 100.00 0.969899 58509 16 total Next I ran find: bash-2.05b$ time find > /dev/null real 0m0.310s user 0m0.140s sys 0m0.170s bash-2.05b$ time find > /dev/null real 0m0.339s user 0m0.180s sys 0m0.140s bash-2.05b$ time find > /dev/null real 0m0.313s user 0m0.160s sys 0m0.150s strace summary: bash-2.05b$ strace -c find > /dev/null execve("/usr/bin/find", ["find"], [/* 44 vars */]) = 0 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 37.51 0.111190 7 15750 lstat64 21.50 0.063734 8 7535 getdents64 15.62 0.046298 6 7493 chdir 10.97 0.032505 9 3752 open 4.90 0.014519 4 3751 close 4.06 0.012038 3 3751 fstat64 3.14 0.009313 2 3747 fcntl64 2.23 0.006596 3 2036 write 0.02 0.000072 12 6 mmap2 0.02 0.000050 25 2 munmap 0.01 0.000039 20 2 read 0.01 0.000021 4 5 brk 0.00 0.000009 5 2 fchdir 0.00 0.000007 7 1 uname 0.00 0.000003 3 1 time 0.00 0.000003 3 1 1 ioctl ------ ----------- ----------- --------- --------- ---------------- 100.00 0.296397 47835 1 total
* AFAICS, there is nothing in the current interface which prevents caching of directory entries. Caching would probably aid more use cases than
Caching would likely help most long-running applications, but mine has a very short lifetime and needs to run this new each time.
* There is another outstanding issue (lack of directory iterator filtering and/or globbing) that does in fact impact both timing and ease of use of real-world applications and so is the highest priority for future work.
This would indeed help my application if I could filter on directory / file / symlink type, etc. Anyway, an optimization based on overloaded is_* functions taking a straight iterator would work. I didn't think of that earlier. I fear that the performance implications of the different calls will not be obvious, but I guess that's a job for documentation. :-) Thanks, - Chris The simple_ls.cpp modified into a simple_find.cpp: // simple_ls program -------------------------------------------------------// // � Copyright Jeff Garland and Beman Dawes, 2002 // Use, modification, and distribution is subject to the Boost Software // License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at // http://www.boost.org/LICENSE_1_0.txt) // See http://www.boost.org/libs/filesystem for documentation. #include "boost/filesystem/operations.hpp" #include "boost/filesystem/path.hpp" #include <iostream> namespace fs = boost::filesystem; void print_dir(const fs::path &full_path) { std::cout << full_path.native_directory_string() << "\n"; fs::directory_iterator end_iter; for ( fs::directory_iterator dir_itr( full_path ); dir_itr != end_iter; ++dir_itr ) { try { if ( fs::is_directory( *dir_itr ) ) { print_dir(*dir_itr); } else { std::cout << dir_itr->native_file_string() << "\n"; } } catch ( const std::exception & ex ) { std::cout << dir_itr->leaf() << " " << ex.what() << std::endl; } } } int main( int argc, char* argv[] ) { fs::path full_path( fs::initial_path() ); if ( argc > 1 ) full_path = fs::system_complete( fs::path( argv[1], fs::native ) ); else std::cout << "\nusage: simple_ls [path]" << std::endl; if ( !fs::exists( full_path ) ) { std::cout << "\nNot found: " << full_path.native_file_string() << std::endl; return 1; } if ( fs::is_directory( full_path ) ) { print_dir(full_path); } else // must be a file { std::cout << "\nFound: " << full_path.native_file_string() << "\n"; } return 0; }