Any interest in a boost::glob() function?

Dear all, I've taken Richard Johnson's glob_iterator that he posted to the list back in January and re-written it completely, using Boost.Spirit to transform the input 'glob' into an equivalent regex. The result is a 'boost::glob' function that I believe is fully POSIX-conformant. Example usage: fs::path const starting_directory("."); std::list<fs::path> const matches = boost::glob("../te?t/foo*bar", starting_directory); The docs can be browsed online at http://www.devel.lyx.org/~leeming/globbing/index.html The library can be found at http://www.devel.lyx.org/~leeming/globbing.zip It should unzip to give: boost/glob.hpp boost/detail/glob.hpp libs/glob + ensuing sub directories and files. The code has been tested on Linux (g++ 3.3.2) and on Windows using MinGW/MinSys. There are still a few "issues" that I don't know how to resolve. Any pointers would be most welcome. DOCS ==== * It would be nice to insert a table inside an itemized list item, but quickbook doesn't like doing that. I need to add a separate paragraph. * boostbook/doxygen fails to document the glob_flags enum. * I generate a BOOST_LIB_NAME.html page and would like not to. BUILD ===== I have a couple of targets, lib/glob/example/real_glob.cpp and lib/glob/test/test_real_glob.cpp that I'd like to build only on POSIX machines. (They're wrappers for the system glob function.) Any pointers on how to instruct bjam to do that? REGEX ===== One test is failing on Windows "foo[[:alpha:]]bar". It works when compiled under Linux. Is there anything I should be aware of about Boost.Regex under MinGW/MinSYS? Regards, Angus

Gennadiy Rozental wrote:
Hi, Gennadiy. Rich Johnson's original proposal was for class glob_iterator : public filter_iterator< glob_predicate<>, filesystem::directory_iterator> struct glob_iterator {}; Indeed, I've retained this, although I shoved it into namespace boost::detail. The problem is that this glob_iterator will iterate only over the contents of a single directory. Moreover, good implementations of the system glob require that: In order to have access to a pathname, this function requires search permission on every component of a pathname except the last, and read permission on each directory of any filename component in @c pattern that contains a wild card. which means that you *shouldn't* use glob_iterator if the component does not contain wildcards. Would you be happy if I changed the interface to: std::list<fs::path> matches; boost::glob(matches, "../te?t/foo*bar", starting_directory); If not, could you expand further? Regards, Angus

Why? Because of filesystem::directory_iterator? Use different implementation
Sorry. It's out of scope of my understanding. Maybe filesystem experts could comment on that. Or you could rephrase it.
It slightly better, but I still prefer iterator interface. What I am going to do with above list? Most probably I a mgoing to iterate through it to do somethins, or find simething. What if I only need first matching element? We have several years expirience now with container/iterator alternatives. I believe iterator interface is more natual here (and ultimately more efficient).
Gennadiy.

Gennadiy Rozental wrote:
Don't get me wrong, I'm not disagreeing. It's just that I'm on a steep learning curve here. Maybe it helps us both if I present some code. The first thing that boost::glob does is parse the "../foo*bar/*.cpp" pattern into a list of predicates, one per component in the pattern tree (3 in this case): std::list<detail::glob_predicate<> > predicates; bool contains_wildcards = false; for (fs::path::iterator it = glob_path.begin(), end = glob_path.end(); it != end; ++it) { detail::glob_predicate<> const predicate(*it, flags); contains_wildcards += predicate.contains_wildcards(); predicates.push_back(predicate); } The glob_predicate is a functor that is used by glob_iterator to filter files: template <typename TraitsT = glob_traits> struct glob_predicate { /** @returns true if @c p.leaf() matches the glob expression. */ bool operator()(filesystem::path const & p) const }; However, if the pattern I'm trying to match doesn't contain any wild cards at all, then there's no need for any iteration at all: std::list<fs::path> matches; if (contains_wildcards) { glob_impl(matches, predicates.begin(), prior(predicates.end()), working_dir, flags); } else { fs::path abs_path = working_dir / glob_path; if (exists(abs_path)) matches.push_back(abs_path); } That doesn't seem to fit nicely with an iterator interface to me, but I emphasise I'm coming from a position of zero knowledge. Similarly, glob_impl is a recursive function that calls itself for each component in the "predicates" tree. At each level, it checks whether that component contains a wild card. If not, it just adds the component to the matching path: void glob_impl(std::list<fs::path> & matches, predicate_vec::const_iterator first, predicate_vec::const_iterator last, fs::path const & working_dir, glob_flags flags) { detail::glob_predicate<> const & predicate = *first; if (first == last) { // We've reached the destination directory. // Add any matches to @c matches. if (!predicate.contains_wildcards()) { fs::path const candidate = working_dir / fs::path(predicate.file(), fs::no_check); if (exists(candidate) && (!(flags & glob_onlydir) || is_directory(candidate))) matches.push_back(candidate); } else { detail::glob_iterator git(predicate, working_dir); detail::glob_iterator const gend; for (; git != gend; ++git) { if (!(flags & glob_onlydir) || is_directory(*git)) matches.push_back(*git); } } return; } if (!predicate.contains_wildcards()) { // No testing for the existence of the directory. // That would place additional, unnecessary requirements on glob(). glob_impl(matches, next(first), last, working_dir / fs::path(predicate.file(), fs::no_check), flags); } else { detail::glob_iterator git(predicate, working_dir); detail::glob_iterator const gend; for (; git != gend; ++git) { if (is_directory(*git)) glob_impl(matches, next(first), last, *git, flags); } } } Now, presumably you're going to come back at me and tell me that this is all just implementation detail. That's true, it is. But my problem is that I don't yet see how I could attack the problem more efficiently using some other approach. Does any of this make sense? Regards, Angus

[...]
You need to implement iterator that match InputIterator concept. Also it is very similar to iteration though N-ary matrix. You need to build a collection of directory iterators on each level. Here is a sketch of how next function would look like (if you use input_iterator_adaptor). void next() { starting from end find first level iterator that is not equal to end If not found, finite la comedi. if found last level, increment it and that's all If found level in the middle increment it, rebuild all directory iterators for following levels and start from the beginning. } I think there should be iterator like this (generic, over matrix) somewhere.If not it would be interesting undertaking. Regards, Gennadiy.

Angus Leeming wrote:
Ok, I've thought a bit and think I understand your proposal. Let's use a class to store the predicates that will be used to initialise the filter_iterators at each level of the tree: class glob_pattern { vector<glob_predicate<> > predicates; }; Ie, "../foo*bar/baz?.eps" would require that glob_pattern stored a vector of 3 predicates. Let's call these "filter_iterators at each level of the tree" glob_directory_iterators. A glob_iterator will store a vector of glob_directory_iterators. (Also 3 in the example's case). Incrementing a glob_iterator will use a strategy similar to the one you outline above. class glob_pattern { vector<glob_predicate<> > predicates; public: glob_iterator begin() const; glob_iterator end() const; }; class glob_iterator { vector<glob_directory_iterator> dir_iterators; public: glob_iterator(vector<glob_predicate<> > const & p) { typedef vector<glob_predicate<> > p_vec; dir_iterators.reserve(p.size()); p_vec::const_iterator it = p.begin(); p_vec::const_iterator const end = p.end(); for (; it != end; ++it) dir_iterators.push_back( glob_directory_iterator(*it)); } glob_iterator operator++() { /* implement the "next" strategy */ } }; This strategy moves all the nonsense about "does the predicate contain wildcards or not" into glob_directory_iterator. Fine. No doubt there's lots of implementation details left, (like using input_iterator_adaptor), but I think that this pseudo code covers the idea. Am I on the right track? Angus

Looks Ok.
You may use iterator_facade also. I just found that implementing Input iterators using input_iterator_adaptor is easier. It's located in Boost.Test details derectory along with some usage examples.
Am I on the right track? Angus
Definitely. Looking forward to see it implemented. Regards, Gennadiy.

Gennadiy Rozental wrote:
Maybe, we can just decide that all reasonable compilers support NRVO, and stop worring about returning vectors/list by value? At least, when I write function which returns a vector, I start it with vector<...> result; so NRVO would surely work. The primary reason for 'result' is that I like all functions to have a single return, but as a side effect it make NRVO always working. - Volodya

Hi Volodya.
Maybe, we can just decide that all reasonable compilers support NRVO, and stop worring about returning vectors/list by value?
Hmmm... I do not think VC7.1 supports NRVO (it does support unnamed RVO though). Do you consider it a 'reasonable compiler'? Or should we say that it does not matter if that compiler generated slower code? Best regards, Jurko

"Slower" is relative to the speed it takes to glob the files in the first place. No comment on whether you should use iterators or containers though. Depending on how you implement iterators and how client works, the semantics might get really ugly. I can imagine that at least the semantics of underlying readdir() calls and concurrent changes to the directory get really ugly if you expect to be able to copy iterators and/or change the directory while the iterator is scanning it. Obviously there are still race conditions even if you use containers. Lassi -- Another month ends. All targets met. All systems working. All customers satisfied. All staff eager and enthusiastic. All pigs fed and ready for take off.

Hi Jurko,
Apparently, I'd have to decide it's not 100% reasonable ;-)
Or should we say that it does not matter if that compiler generated slower code?
Probably. The interface which returns vector<> is nicer, and we should not kill a good interface because of a specific compiler. Besides, in this case, both vector<> returning function and an iterator interface are reasonable, so if you're on VC7.1 and care about globbing performance, you can use the iterators interface. - Volodya

How do I use it in windows where case-sensitivity is a file attribute (FILE_FLAG_POSIX_SEMANTICS). boost::glob("*.[tT][xX][tT]", ...) might return more files than FindFirstFile("*.txt")/FindNextFile() e.g. on a mounted case-sensitive filesystem

Martin wrote:
Sorry, I don't understand your "how do I use it" question. Do you want to match files irrespective of case? If so, I don't know of a way other than "*.[tT][xX][tT]". That's how "glob" works. All I'm doing is filtering the output of filesystem::directory_iterator. Essentially, you're asking filesystem::directory_iterator to dereference to a (lowercase) filename which can then be matched against the regex, "*.txt", right? Hmmmm. I'll add it to the list of things to think about. Angus

If I write *.txt in a windows file selection dialog box, I will see all files in the directory ending with ".txt" (in lowercase) but also all files ending in ".TXT", ".tXt" etc that doesn't have the FILE_FLAG_POSIX_SEMANTICS attribute set. I get the same result by using "*.txt" with the win32 native functions FindFirstFile/FindNextFile. It is normally not a problem but when trying to write portable code it is not clear if I should write "*.txt" or "*.[tT][xX][tT]". Both might give non- expected results. The best thing would be if the glob function treated case sensitivity in the same way as the underlying filsystem but then it would have to be part of the filesystem library. Another thing: The boost.filesystem function is_directory might throw for security reasons on files in a directory. I recommend that you create your own non-throwing overload bool is_directory(const path& aPath, bool aErrorRet) { try { return is_directory(aPath); } catch (...) { return aErrorRet; } }

Martin wrote:
Ok, understood. Thanks.
I think so too. However, the problem becomes "Design an interface to glob so that I can pass flags to the underlying filesystem::directory_iterator". Assuming that this is the solution that Beman adopts. Perhaps you should direct your problem towards him in the first instance.
Good. Thanks. Actually I think I'll just use: bool nothrow_is_directory(filesystem::path const & path) { try { return is_directory(path); } catch (filesystem::filesystem_error const &) { return false; } } Since the function description talks explicitly about the permissions needed to find a match. I guess that I'll need to wrap "exists(path)" too. Will have a look. Regards, Angus

Martin wrote:
Good morning, Martin. Having slept on it, I think that my original answer of "ask Beman" is wrong. I think that handling this requirement is easy. In boost/details/glob.hpp you'll find template <typename TraitsT = glob_traits> class glob_predicate { ... public: /** @returns true if @c p.leaf() matches the glob expression. */ bool operator()(filesystem::path const & p) const { if (contains_wildcards_) { boost::cmatch match_; return regex_match(p.leaf().c_str(), match_, regex_); } return p.leaf() == pattern_; } }; I have two suggestions: 1. Add the ability to ignore case for all users: #include <boost/algorithm/case_conv.hpp> enum glob_flags { ... glob_case_insensitive = (1 << 5) }; template <typename TraitsT = glob_traits> class glob_predicate { bool case_insensitive_; public: glob_predicate(std::string const & pattern, glob_flags flags) : regex_(reg_expression()) , case_insensitive(flags & glob_case_insensitive) { ... } bool operator()(filesystem::path const & p) const { std::string const leaf = case_insensitive_ ? to_lower_copy(p.leaf()) : p.leaf(); if (contains_wildcards_) { boost::cmatch match_; return regex_match(leaf.c_str(), match_, regex_); } return leaf == pattern_; } }; 2. Interrogate the value of FILE_FLAG_POSIX_SEMANTICS on Windows boxes. I'm no Windows programmer, but a quick google suggests that this might do the trick. Feel free to improve: bool operator()(filesystem::path const & p) const { bool case_insensitive = case_insensitive_; #ifdef BOOST_WINDOWS if (!case_insensitive) { WIN32_FILE_ATTRIBUTE_DATA data; if (GetFileAttributesEx(p.native_file_string(), 0, data)) { case_insensitive = !(data.fileAttributes & FILE_FLAG_POSIX_SEMANTICS); } } #endif ... } I have one further question for you, however. If the 'top level' pattern doesn not contain any wild cards (foo.cpp in ../*/foo.cpp), then I use fs::path const candidate = working_dir /= predicate.file(); if (exists(candidate) matches.push_back(candidate); So, will 'exists' do the right thing here? Any other comments? Regards, Angus

Hi Angus, Both your solutions will work ofcourse but I think it solves the problem in the wrong end. Why not drop the dependency on fs:directory_iterator and solve the problem at the root. My suggestion is that you create your own iterators. 1. filtered_directory_iterator. A simple filter that matches leafs in one directory only using '*' & '?' only. Filtering can also be on files and directories. filtered_directory_iterator(const path& p, const char* pattern, enum { {no_filter, no_files, no_directories, no_devices...} filter); This would solve all problems with case sensitivity since you pass the pattern direct to filesystem (findfirstfile) (glob() is needed for posix). It would also be a "cheap" operation for the most common case like finding all files in a directory: filtered_directory_iterator(p, "*", ~no_files) 2. glob_iterator. A regex based iterator which might even recurse down directories to find matches. Personally I can't really see the application for a regex file search so you know the requirement better than I do. - I have one further question for you, however. If the 'top level' - pattern doesn not contain any wild cards (foo.cpp in ../*/foo.cpp), - then I use - fs::path const candidate = working_dir /= predicate.file(); - if (exists(candidate) - matches.push_back(candidate); - - So, will 'exists' do the right thing here? Don't know. "exists" only return false if the filesystem says "no such file exist". If the filesystem says "access denied" the files is assumed to exist but I don't know what happens if you are not allowed to descend working_dir.

Martin wrote:
Hi Angus,
Hi, Martin. Thanks for your continued feedback.
An interesting idea and certainly much less work ;-) However, as I understand it, you're suggesting limiting the wildcards simply to ensure that the filtered_directory_iterator behaves the same on posix and windows systems? Of course that is possible, but my suggestion gives windows users the full power of glob(). Don't you ever search for things like "[a-d]*.{cxx,hpp}"? Sorry, but I don't see why such a proposal is an improvement. Also, how do you limit the wildcards? I take it you don't, but that the underlying matcher (findfirstfile, glob) will behave differently on receipt of the same pattern.
I haven't created a regex-filtering iterator. I have created a glob-filtering iterator that uses boost::regex as an implementation detail. Having said that, all the effort in the proposal went into transforming a true globbing pattern into an equivalent regex. It would be trivial to create a true regex-filtering iterator, although I don't see the application either. The recursion down directories bit is a separate layer that is built on top of whatever predicate the filter_iterator uses to match the leaves in one directory. Regards, Angus

No. The main reason was to have a simple iterator for simple (and what I think is the most common) cases which also avoid the need to go via a list.
Of course that is possible, but my suggestion gives windows users the full power of glob().
So did I but I put it into a separate iterator where you can define the rules completely independent of the filesystem.
Don't you ever search for things like "[a-d]*.{cxx,hpp}"?
I do it in the shell but I have never had the need to do it inside an application. I'm sure there are such applications.
Sorry, but I don't see why such a proposal is an improvement.
The filesystems already behave differently since one is case-sensitive and the other is not. Anyway, I think it is reasonable to limit the wildcards to some portable syntax e.g. max 2 '*' are allowed and they must either be the last character or followed by a '.'.

Martin wrote:
That's two separate requirements. 1. A simple iterator. By 'simple', you mean one using the underlying API. Right? 2. Avoid the list in list<path> glob(string const & pattern, path const & working_dir); Actually, this second requirement is contradictory to the first because glob()'s results must be stored internally for the iterator to then iterate over. No?
So did I but I put it into a separate iterator where you can define the rules completely independent of the filesystem.
This is, in essence, what I am proposing. I have now reworked the interface following Gennadiy's suggestion. Here's a glob_iterator that can recurse down directories: class BOOST_GLOB_DECL glob_iterator : public iterator_facade< glob_iterator // Derived type , filesystem::path const // value_type , single_pass_traversal_tag > { public: glob_iterator() {} glob_iterator(std::string const & pattern, filesystem::path const & wd, glob_flags flags); private: ... }; It works, but is considerably slower than the function returning a list. No doubt profiling will help track down what I'm doing inefficiiently. # A simple wrapper for the real glob() $ time ./real_glob_rls '*/*/*.hpp' '/home/angus/boost/cvs/' | wc -l 934 real 0m0.042s user 0m0.010s sys 0m0.010s # The glob() function I posted earlier in the week. $ time ./glob_fun_rls '*/*/*.hpp' '/home/angus/boost/cvs/' | wc -l 934 real 0m0.099s user 0m0.070s sys 0m0.010s # The new glob_iterator. $ time ./glib_it_rls '*/*/*.hpp' '/home/angus/boost/cvs/' | wc -l 934 real 0m0.236s user 0m0.200s sys 0m0.010s I'm never sure whether to pay attention to the 'real' or to the 'user' times... Anyway, there's a clear heirarchy ATM.
Here's one. Qt (QProcess), gtk (gspawn*) and ACE (ACE_Process) all enable the user to spawn a child process in a portable way. However, what they all lack is a *powerful* way to initialise their data from a string containing a "command-line like" syntax. (And, no, passing an arbitrary "ls `rm -f *` foo.cpp" to the system() command isn't a viable alternative.) I've been playing around writing something that can parse a subset of the Bourne shell. Enough to make it easy and safe to launch a single process from a string. "parse_pseudo_command_line" fills a "spawn_data" variable. It's then simple to ascertain whether the request is safe or not. Now *this* is a function that would benefit from a portable glob. http://www.devel.lyx.org/~leeming/libs/child/doc/html/parse_pseudo_command_l... Equivalent URL: http://tinyurl.com/4c4v9
Again, how do you *limit* them? That implies that you must prescan the pattern, presumably throwing once you've determined that the thing is breaking your "reasonable" limits. Regards, Angus

Hi Angus,
The standard fs::directory_iterator iterates over everything (files, directories, links, devices) in a directory. My assumption is that most applications are only interested in a subset of the entries and that the filtering in most cases is "simple". e.g. only files ending with ".txt" or only directories. This has nothing to do with the underlying API. Currently I have to do somthing like this for portable code. for (fs::directory_iterator itr(p); p!= fs::direcory_iterator(); ++p) if (fs::is_directory(p, true) || // my own non-throwing overload #ifdef BOOST_POSIX fs::extension(p) == ".txt" #else algorithm::iequals(fs::extension(p), ".txt") #endif ) constinue; ... I have no problem with the above code but I was hoping that the glob_iterator would simplify it. It just seem like an overkill to pull in a glob/regex parser to just match "*.txt" on files in one directory when that function is already available in both windows "FindFirstFile" and posix "fnmatch".
For me it is enough if the pattern is just passed to FindFirstFile/fnmatch without parsing. An option could be to do it like the fs library and have different pattern checkers (e.g. native, posix, windows, portable). Since your library is intended for much more complicated things than what I have ever had the need for I think I end the discussion here. Good Luck with your library submission.

Martin wrote:
Sorry, by "simple" I meant one that used fnmatch or findfirstfile to perform the filtering rather than a boost::regex.
Martin, I think you need class filtered_directory_iterator : public filter_iterator<filter_predicate, filesystem::directory_iterator> { typedef filter_iterator<filter_predicate, filesystem::directory_iterator> base; typedef filesystem::directory_iterator raw_iterator; public: filtered_directory_iterator() {} filtered_directory_iterator(std::string const & pattern, int flags, filesystem::path const & wd) : base(filter_predicate(pattern, flags), raw_iterator(wd)) {} }; where class filter_predicate { std::string pattern; int flags; public: filter_predicate(std::string const & p, int f) : pattern(p), flags(f) {} bool operator()(filesystem::path const & p) const { // Insert fnmatch, FindFirstFile magic here. ... } }; Doesn't that cover it?
You mean 'native' as above and 'portable' as in my mess? 'posix' and 'windows' get subsumed by 'native', no?
I think that I have a way to go yet ;-) Angus

From: Angus Leeming <angus.leeming@btopenworld.com>
I think you misunderstood what Gennadiy meant. (If not, I think this is a good idea, anyway.) Your function should accept an output iterator through which you save the results. That enables the caller to decide where the results go instead of you deciding that they go into a list which then must be iterated. Thus, I'm proposing something like this: template <typename OutIt, typename Char> OutIt glob(OutIt & out, std::basic_string<Char> const & pattern, filesystem::path const & pathname, glob_flags flags); Appropriate output iterators can forward the matches to a collection, to a GUI control, to a file, etc.
It works, but is considerably slower than the function returning a list. No doubt profiling will help track down what I'm doing inefficiiently.
No doubt. -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

Rob Stewart wrote:
No need to parameterize on Char type. This thing uses filesystem to do the iteration over the contents of a directory and filesystem knows nothing of wide strings. We can/should revisit that later as and when filesystem is changed. As for the output iterator idea, I don't see how to use it. For example, with my suggestion I could populate an STL container with: namespace fs = filesystem; std::list<fs::path> const matches( glob_iterator("*.cpp", fs::path("."), glob_brace), glob_iterator()); How do I do that with your idea? Once you've explained it, can you tell me why it's better than glob_iterator, above? Angus (ever ignorant)

From: Angus Leeming <angus.leeming@btopenworld.com>
Gennadiy already weighed in on my interpretation of his statement. As to what I've suggested, you already had code that built a std::list -- your original version, IIRC. Where you appended to the list, you would assign to the dereferenced output iterator. That deref/assignment can be coded to do many things by the author of the output iterator, including appending to a std::list. -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

From: "Gennadiy Rozental" <gennadiy.rozental@thomson.com>
Sorry, I misunderstood you, then.
would I iterate using above interface?
You wouldn't, of course. You'd use it to write to something else, including std::cout.
What he described sounded like he populated a container with the results and then iterated that. Using the resulting iterator to populate another container would, thus, reduce efficiency. -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

Earlier, Gennadiy Rozental wrote: AL>> No doubt there's lots of implementation details left, (like AL>> using input_iterator_adaptor), but I think that this AL>> using idea. GR> You may use iterator_facade also. I just found that GR> implementing Input iterators using input_iterator_adaptor GR> is easier. It's located in Boost.Test details derectory GR> along with some usage examples. AL>> Am I on the right track? GR> Definitely. Looking forward to see it implemented. Later, Rob Stewart wrote: AL>>>> It works, but is considerably slower than the function AL>>>> returning a list. No doubt profiling will help track AL>>>> down what I'm doing inefficiently. RS>>> No doubt. GR>> I do not see anything in iterator implementation that GR>> suggest inherent inefficiency. RS> What he described sounded like he populated a container with the RS> results and then iterated that. Using the resulting iterator to RS> populate another container would, thus, reduce efficiency. Ahhhh. Now I understand you. Actually, the glob_iterator I've implemented and have just posted here: http://www.devel.lyx.org/~leeming/glob.zip (Docs can be viewed on line at: http://www.devel.lyx.org/~leeming/libs/glob/doc/html/) doesn't store anything but remains two to three times slower than the glob function that stores its results in a std::list. $ cd libs/glob/example $ time ./glob_function '*/*/*.hpp' '/home/angus/boost/cvs/' > /dev/null count_operator 1997 count_regex_match 1997 real 0m0.095s user 0m0.080s sys 0m0.020s $ time ./glob_iterator '*/*/*.hpp' '/home/angus/boost/cvs/' > /dev/null count_operator 1997 count_regex_match 1997 real 0m0.241s user 0m0.220s sys 0m0.010s I'm pretty sure I'm not doing anything daft. The 'count' data shows how many times I call the glob_predicate::operator() that decides whether a given file matches the globbing pattern. As you can see, this function is called the same number of times in each case. The algorithm to iterate over the files '*/*/*.hpp' is basically the same in the two cases, but the iterator version obviously has more 'if' statements in the inner loop. I guess that this is the killer and I don't think that there's anything we can do about it. Soooooooooo... do we flag that one up as a valient effort and move on or can the iterator be as efficient as the function and I just haven't found the right algorithm? Angu

On Tuesday, October 5, 2004, at 04:18 AM, Angus Leeming wrote:
Hi folks-- I'm sorry I've been offline for a while and missed this thread. Angus, GREAT job completing the grammar and packaging everything up! Your modifications look like a _major_ improvement over my QnD implementation--especially incorporating spirit. I can't wait to try them out. My only quibble is with the Windows #ifdef's and specialization flags--I much prefer subclassing. Before everyone jumps on that, I know it's primarily a matter of style--and that performance often rules the day. I especially abhor #ifdef's in the body of the code--when unavoidable they should be restricted to a configuration section--such as initializing a class variable. There's something funny with your performance stats: the glob_function results show an elapsed time that's less than the sum of system and user time. How can that be? --rich

Rich Johnson wrote:
Hi, Rich. Apologies for the delay in replying. I've been out of town. Stats on dual processor machines are too weird to query too hard. I wouldn't worry. I have been away for a couple of weeks but was busy playing around with 'glob' before I left. It transpires that a boost::glob function that stores its data in a std::vector has similar performance characteristics to the system glob function. The interface is template <typename ContainerT> void glob(ContainerT & container, std::string const & pattern, filesystem::path const & working_dir, glob_flags flags); The only requirements on ContainerT is that it should have a push_back member function. Using a std::list rather than a std::vector degrades performance through all those calls to operator new. Using Gennadiy's glob_iterator suggestion as the interface has performance an order of magnitude worse than that of the glob function with a std::vector<filesystem> as the container. Regards, Angus

On another topic.... I seem to recall a boost::filesystem 'is_equivalent_name()' or 'name_equals()' convenience function mentioned sometime over the last week or so. There's currently a fair bit of debate over character codings, etc.; this function would handle that. AFAIK, these issues have (to date) been outside the scope of the globber. As they get settled out, this service should be utilized in the glob predicate. My main concern would relate to performance--especially regarding the frequency of constructing parsers. --rich

Rich Johnson wrote:
Thanks for the heads up. I skipped many of the 1200+ postings to the Boost Devel list since I read it last... I think that I understand what you're talking about. The fact that Win32 regards 'FileName' and 'filename' as equivalent file names unless the file has the FILE_FLAG_POSIX_SEMANTICS attribute set, right?
You mean the transformation of the globbing pattern '*/*.cpp' into the equivalent regex? That's done only once, right at the very start of the process. My tests indicate that my glob function has similar performance to the system glob on a 'nix box, so long as an appropriate container is used to store the results. I'll try and update asap the posted code to reflect my experiments. Angus

On Tuesday, November 23, 2004, at 10:27 AM, Angus Leeming wrote:
My bias would be that if this sort of stuff is maintained below the FS abstraction, then it's up to the FS to report whether there's a match or not. Reverse-engineering the FS's logic should be avoided as much as possible. [ Perhaps this would be better phrased as a requirement of the boost::filesystem library. ]
Yup. This is getting further into hypothetical design than is probably appropriate, but....assuming the FS provides an is_name_equivalent() convenience function (which I think it should): A monolithic 'is_name_equivalent()' function that sets up and tears down a regex each time it's invoked would have performance issues when used in the context of the globber. A reusable 'name_comparator' object(functor) supplied by the FS layer could be instantiated once as part of the glob_predicate. Of course this raises the issue of what happens when you glob across a FS boundary(mount points and links)--must the name_comparator change accordingly? --rich

At 10:27 AM 11/23/2004, Angus Leeming wrote:
Not entirely. 'FileName' and 'filename' could be two distinct and separate files, and could have a variety of names. Use the new equivalent() function to determine path equivalence. See http://www.boost.org/libs/filesystem/doc/operations.htm#equivalent We've already had one bug report involving subst drives; equivalence on Win32 systems is really tricky to determine. --Beman
participants (9)
-
Angus Leeming
-
Beman Dawes
-
Gennadiy Rozental
-
Jurko Gospodnetic
-
Lassi A.Tuura
-
Martin
-
Rich Johnson
-
Rob Stewart
-
Vladimir Prus