Re: [boost] boost::filesystem - directory_iterator makesoptimizationhard

13 Jun 2005

      "Beman Dawes" <bdawes@acm.org> wrote in message
news:6.0.3.0.2.20050609093713.03bf6990@mailhost.esva.net...
...
I put together a timing test program (see below) that depends entirely on
Boost.Filesystem operations. Since the only difference between the two
modes of operation is the use of boost::filesystem::status(), any timing
differences are caused by that alone.
I've now done a trial implementation that keeps a copy of the status byte in
each directory iterator on operating systems which support it. That means at
least Windows, Linux, Mac OS X, and any OS derived from BSD.
...
The timing differences between the two modes are dramatic. With Windows XP
SP 2, 1 gigabyte main memory, compiled with VC++ 7.1 in release mode, in
an NTFS directory with 15,046 files, run from a freshly booted machine,
average of three runs:
6.06 seconds with status()
     1.04 seconds without status()
.91 seconds with status() from iterator

I doubt "status() from iterator is actually faster than "without status()".
The timing abnormality was probably because the tests were separated by two
days, and an automatic disk defrag ran in the meantime.
...
Additional runs (showing no disk activity whatsoever because of disk
caching):
1.03 seconds with status()
      .31 seconds without status()
.31 seconds with status() from iterator.

Given that there is a six times performance difference on real disk pages, a
three times difference even on cached pages, and these differences have a
practical impact on real applications, Boost.Filesystem needs to address
this.

My trial implementation depended on an additional overload for the predicate
functions. Using it led to three conclusions: (1) enable_if is wonderful
(but we already knew that:-), (2) with overloaded predicate functions, the
"too similar" interface problem that Chris Frey mentioned is more serious
than I thought, and (3) overloaded predicate functions do work but are a
kludge.

It is important to understand that caching a copy of the status byte in
directory iterators, and then later referencing that copy rather than going
back to the disk, has to be under user control because it alters the
behavior of programs. (I suspect that's why the operating systems themselves
don't do hidden status caching.) Because in some applications an iterator
can quickly become very stale, the user needs explicit control to go to the
disk if concerned about race conditions, or to used the iterator status if
efficiency is of greater concern.

The "too similar" problem with overloading predicate functions on
directory_iterator is that two very similar calls have different behavior:

     if ( is_directory( *itr ) ) // uses current status

     if ( is_directory( itr ) ) // uses cached status, which may be very
stale

That looks like a race-condition trap for unwary users.

The kludge aspect of directory iterator overloads for certain functions is
that these depend on secret sharing (via friendship) of information between
the directory iterator implementation and the predicate function
implementation. Not a good sign. If the interface were explicit, the user
controls when and where to use the cached copy, and does so via syntax that
is different enough to eliminate at least some cases of inadvertent use:

     if( itr->is_directory() ) // uses cached status

That's a long way of saying that Chris Frey's original suggestion to have
directory_iterator's value type be a directory_entry class is looking like a
better design. I hate adding more visible interface, but that looks like the
right thing to do.  I'll do a trial implementation and verify it doesn't
impact existing code, etc.

--Beman