[filesystem]Sorting based on equivalence information?
Hi all, I have written a class around the glob_iterator class that W. Richard Johnson put into the file vault last year. My class finds all the paths and/or files matching a multi-level wildcard pattern. (i.e. the matching expressions may appear in more than one place, such as "/Users/*/MySource/??foo/*.cpp") I need to detect loops caused by links between directories, and I also want to detect multiple paths to the same file, and only keep one of them. The only method that I have come up with so far is to keep the paths that I have found in a collection and test each new path against all of them by calling equivalent() once for each collection member. Needless to say, this is a little slow when the number of files or directories becomes large. It could be faster if I only tested symlinks, but I think that the existence of hard links forces me to test everything. My first question: Can anyone suggest a better method of achieving the duplicates detection? My second question (This is probably directed at Beman Dawes): I read the code for equivalent(), and it seems like I could derive a class from path that would implement operator< in such a way that it sorted by the equivalence information, rather than by the path string. If I could sort the paths by their equivalence information, then I could binary search the collection when I look for duplicates. I would have to copy the code in equivalent(), which is an ugly thing to do, and I guess I would need to define all of the operators that are defined in path, since they are not virtual. Assuming that I did these things, though, does this sound like it's even valid? From the comments that I read in the equivalent() source, it seems that the file equivalence information can change, at least on a windows system. I guess I could create a handle_wrapper for each path that I add to the collection in order to keep the file from being moved while it's in the collection. Of course, I can't be the first person who has gone down this path, or the library would probably already support this sort of sorting. Is this just a stupid idea? Thanks, Rush
Rush Manbert wrote:
I have written a class around the glob_iterator class that W. Richard Johnson put into the file vault last year. My class finds all the paths and/or files matching a multi-level wildcard pattern. (i.e. the matching expressions may appear in more than one place, such as "/Users/*/MySource/??foo/*.cpp")
I need to detect loops caused by links between directories, and I also want to detect multiple paths to the same file, and only keep one of them. The only method that I have come up with so far is to keep the paths that I have found in a collection and test each new path against all of them by calling equivalent() once for each collection member. Needless to say, this is a little slow when the number of files or directories becomes large. It could be faster if I only tested symlinks, but I think that the existence of hard links forces me to test everything.
My first question: Can anyone suggest a better method of achieving the duplicates detection?
Maybe you could use the graph library to build a graph of your directory structure and then work with that to find duplicate paths, loops, etc. I know next to nothing about the library, but this sounds like a graph-theoretic problem to me. Paul
troy d. straszheim wrote:
My first question: Can anyone suggest a better method of achieving the duplicates detection?
Have you considered checking files' inode numbers?
Needs to work on Windows too, which I think rules out inodes. But thanks for the suggestion. - Rush
Paul Giaccone wrote:
Rush Manbert wrote:
My first question: Can anyone suggest a better method of achieving the duplicates detection?
Maybe you could use the graph library to build a graph of your directory structure and then work with that to find duplicate paths, loops, etc. I know next to nothing about the library, but this sounds like a graph-theoretic problem to me.
Thanks for the suggestion. The big issue with this approach is actually the same problem I have with my "second question" in the original post. The equivalence information is what really identifies a directory or file, not the path represented as a string. Once I start inheriting from or modifying the path class to get at that information (which I will need in order to uniquely identify the graph nodes), I might as well put them all in a set and test whether a new path is already in the set. - Rush
participants (3)
-
Paul Giaccone
-
Rush Manbert
-
troy d. straszheim