[filesystem] adding path equality and path equivalence

Equality -------- The path relational operators will be defined purely in terms of string comparison between the textual representation of the two paths. There are two ways to do this: (1) string comparison using std::string relational operators or (2) string comparison using std::lexicographical_compare. Given two paths, "foo.bar" and "foo/bar", (1) will order them: foo.bar foo/bar While (2) will order them: foo/bar foo.bar The primary use case I know of for operator<() is default ordering for paths used as keys in associative containers. I can't see that either approach is superior for this use, so unless someone comes up with a compelling argument, (1) will be used. Equivalence ----------- Two paths will be considered equivalent if they resolve to the same physical directory or file. Question 1: What is a use case that requires this function? Verifying that source and target files are different before some modifying operation is the only one I've come up with. I guess that is sufficient to justify adding the function. Question 2: What if neither exist? Only one exists? My initial thought is that these are likely to be errors, so treat them as such. It could be argued that if either or both don't exist, they can't be equivalent, so return false. Question 3: The implementation on Windows (see below) leaves a small hole in that duplicated media (such as two CD's) mounted on devices with the same device id on two different networked machines would be reported as equivalent. POSIX requires that such networked devices have different device id's, avoiding the problem. Is the fact that Windows and POSIX implementations would perform slightly differently on this corner case a showstopper? I think not. Windows logic for path equivalent: same device id AND same media volume serial number AND same physical location on disk AND same creation time. This works even in degenerate cases like camera formatted FAT flash memory cards or floppy disks with volume serial numbers incorrectly initialized to 0. POSIX logic: same device id AND same physical location on disk AND same modification time. The modification time is in theory redundant, but is an added protection in case the device id on networked devices failed to meet the POSIX specs. --Beman

Equivalence -----------
Two paths will be considered equivalent if they resolve to the same physical directory or file.
Question 1: What is a use case that requires this function? Verifying that source and target files are different before some modifying operation is the only one I've come up with. I guess that is sufficient to justify adding the function.
I have another: if you are walking a directory tree and following links, then you need to be able to detect whether two directories are the same in order to prevent infinite recursion, caused by cyclic directory graphs.
Question 2: What if neither exist? Only one exists? My initial thought is that these are likely to be errors, so treat them as such. It could be argued that if either or both don't exist, they can't be equivalent, so return false.
If only one exists, it's not an error, the two are necessarily non-equivalent. If neither exists, you have a problem ;-) Probably an error, unless someone has a compelling use case.
Question 3: The implementation on Windows (see below) leaves a small hole in that duplicated media (such as two CD's) mounted on devices with the same device id on two different networked machines would be reported as equivalent. POSIX requires that such networked devices have different device id's, avoiding the problem. Is the fact that Windows and POSIX implementations would perform slightly differently on this corner case a showstopper? I think not.
Windows logic for path equivalent: same device id AND same media volume serial number AND same physical location on disk AND same creation time. This works even in degenerate cases like camera formatted FAT flash memory cards or floppy disks with volume serial numbers incorrectly initialized to 0.
POSIX logic: same device id AND same physical location on disk AND same modification time. The modification time is in theory redundant, but is an added protection in case the device id on networked devices failed to meet the POSIX specs.
Wow, I'm impressed, that looks good to me though. John.

On Sun, 1 Feb 2004 16:46:18 -0000, John Maddock wrote
Question 2: What if neither exist? Only one exists? My initial thought is that these are likely to be errors, so treat them as such. It could be argued that if either or both don't exist, they can't be equivalent, so return false.
If only one exists, it's not an error, the two are necessarily non-equivalent. If neither exists, you have a problem ;-) Probably an error, unless someone has a compelling use case.
How about 2 symbolic links pointing at the same unmounted volume? I'm presuming that symbolic links will be followed (perhaps incorrectly) based on the first statement: "Two paths will be considered equivalent if they resolve to the same physical directory or file." One possibility is that they are equivalent since they point at the same target. The other is that it is an error...
Windows logic for path equivalent: same device id AND same media volume serial number AND same physical location on disk AND same creation time.
I've recently seen a problem where the creation time between some windows platforms was off by one second. I haven't finished tracking it down, but at first blush it looked like the creation time when measured from a local Windows XP machine was different from the time reported for that same file from a Windows 2K machine mounted across the network. And now that I think about it this would never create a problem within the same program anyway :-)
Wow, I'm impressed, that looks good to me though.
I agree, this looks pretty good. Jeff

"Jeff Garland" <jeff@crystalclearsoftware.com> writes:
[snip]
How about 2 symbolic links pointing at the same unmounted volume? I'm presuming that symbolic links will be followed (perhaps incorrectly) based on the first statement: "Two paths will be considered equivalent if they resolve to the same physical directory or file."
One possibility is that they are equivalent since they point at the same target. The other is that it is an error...
I would think that the default behavior should be to resolve symbolic links. To better handle symbolic links, several functions could be added, which might include is_symbolic_link, resolve_relative (or resolve_relatively), which works the same way as readlink, and resolve (or resolve_absolute or resolve_absolutely or resolve_complete), which would be equivalent to complete(resolve_relative(ph), ph). In addition, it would be useful to have a function unresolved(path) or dont_resolve(path) which returns a path wrapped such that the various functions do not resolve it if it is a symbolic link.
[snip]
-- Jeremy Maitin-Shepard

Beman Dawes writes:
Equivalence
Two paths will be considered equivalent if they resolve to the same physical directory or file
Question 2: What if neither exist?
"Jeff Garland" <jeff@crystalclearsoftware.com> writes:
How about 2 symbolic links pointing at the same unmounted volume?
On Sun, 01 Feb 2004 17:58:17 -0500, Jeremy Maitin-Shepard wrote: I would think that the default behavior should be to resolve symbolic links. To better handle symbolic links, several functions could be added, which might include is_symbolic_link, resolve_relative (or resolve_relatively), which works the same way as readlink, and resolve (or resolve_absolute or resolve_absolutely or resolve_complete), which would be equivalent to complete(resolve_relative(ph), ph). In addition, it would be useful to have a function unresolved(path) or dont_resolve(path) which returns a path wrapped such that the various functions do not resolve it if it is a symbolic link.
All this sounds reasonble, but it complicates things alot for both the user and Beman. I'd certainly be ok without these functions for the first release with future addition subject to user need as judged by Beman. Jeff

"Jeff Garland" <jeff@crystalclearsoftware.com> writes:
[snip]
All this sounds reasonble, but it complicates things alot for both the user and Beman. I'd certainly be ok without these functions for the first release with future addition subject to user need as judged by Beman.
Yes, I certainly think any symbolic link support can be added separately from this path equivalence support. However, I do think the functionality provided by the functions I described is needed. -- Jeremy Maitin-Shepard

At 11:46 AM 2/1/2004, John Maddock wrote:
Question 2: What if neither exist? Only one exists? My initial thought is that these are likely to be errors, so treat them as such. It could be argued that if either or both don't exist, they can't be equivalent, so return false.
If only one exists, it's not an error, the two are necessarily non-equivalent. If neither exists, you have a problem ;-) Probably an error, unless someone has a compelling use case.
That makes sense to me, unless someone can come up with counter arguments. --Beman

Beman Dawes wrote:
Equality --------
The path relational operators will be defined purely in terms of string comparison between the textual representation of the two paths. There are two ways to do this: (1) string comparison using std::string relational operators or (2) string comparison using std::lexicographical_compare.
I'm not not entirely sure that I understand you correctly but I think that you mean p.string() < q.string() vs lexicographical_compare( p.begin(), p.end(), q.begin(), q.end() ) (equivalent to p.m_name < q.m_name). There are two reasons to prefer lexicographical_compare. First, container requirements. Second, consider the case where p is { "first/", "second" } and q is { "first", "/second" }.

Beman Dawes wrote:
Equality --------
The path relational operators will be defined purely in terms of string comparison between the textual representation of the two paths. There are two ways to do this: (1) string comparison using std::string relational operators or (2) string comparison using std::lexicographical_compare.
I'm not not entirely sure that I understand you correctly but I think
At 11:47 AM 2/1/2004, Peter Dimov wrote: that
you mean
p.string() < q.string()
vs
lexicographical_compare( p.begin(), p.end(), q.begin(), q.end() )
(equivalent to p.m_name < q.m_name).
Yes, exactly.
There are two reasons to prefer lexicographical_compare. First, container requirements.
Ha! Interesting point! I hadn't though of that. Peter is point out that a good way to conceptualize class path is as a container-like sequence of elements. Now if it actually were specified as a sequence, it should meet the requirements of 23.1 table 65, and that means lexicographical_compare. Since class path isn't actually specified as a container, that isn't a totally killer argument to me, but it does at least tilt the scale more towards lexicographical_compare.
Second, consider the case where p is { "first/", "second" } and q is { "first", "/second" }.
That's seriously perverse and devious. I'm very impressed! The case would arise if the native format allowed slashes in names, and used, say ':' as a name separator. p.native_file_string() would return "first/:second" and q.native_file_string() would return "first:/second". But both p.string() and q.string() would return "first//second". In other words, there are some native paths that path::string() doesn't yield an unambiguous result, as noted in the docs. path::, and that's a problem if used as a key in a unique associative container. That problem as Peter points out would be cleared by using lexicographical_compare. An alternative solution is to escape the '/' in path string(), something we might want to consider anyhow. I've been avoiding that mostly from laziness. The reason I'd like to avoid lexicographical_compare is that iteration over the elements in a path is serious inefficient, at least in the current Boost implementation. That's OK, because it is really only provided as a safety-value, and is rarely used. But if it becomes the basis for lexicographical_compare, performance might become more of an issue. I'll give it some more thought. One thing to remember is that this is only the default comparison for containers. Whatever we go with is overridable by the user. Anyhow, interesting issues. Thanks! --Beman

Beman Dawes <bdawes@acm.org> writes:
[snip: path ordering]
The primary use case I know of for operator<() is default ordering for paths used as keys in associative containers. I can't see that either approach is superior for this use, so unless someone comes up with a compelling argument, (1) will be used.
I would suggest lexicographical ordering of the components, i.e. option (2). Ordering based on the ``portable'' path representation would, I think, be confusing on platforms which do not have a native path format which is identical or very similar to the portable format. Furthermore, the primary if not exclusive purpose of the portable path format is to allow storing (relative) paths as string constants, which is functionality that many users may not need, and thus will not be using the portable path representation.
Equivalence -----------
Two paths will be considered equivalent if they resolve to the same physical directory or file.
Question 1: What is a use case that requires this function? Verifying that source and target files are different before some modifying operation is the only one I've come up with. I guess that is sufficient to justify adding the function.
Following directory trees is the common use case. Of course, without a reliable file identifier number, actually using this function would be highly inefficient. A link_count function would also be useful for supporting certain logic for dealing with making backup files, such as move if the file is linked only once, otherwise copy. In addition, useful functionality that could be implemented at a later point would be a unique file identifier object, which keeps an open file handle/descriptor, to ensure that the identifier remains valid. Then the object could be used as a key in associative containers, and allow for efficient implementation of directory recursion.
Question 2: What if neither exist? Only one exists? My initial thought is that these are likely to be errors, so treat them as such. It could be argued that if either or both don't exist, they can't be equivalent, so return false.
I would suggest that the function throw an exception if either file does not exist. The exception would allow the user to determine exactly which paths exist or do not exist. Any other behavior, given that the function can return only true or false, would in some circumstances give the user less information than desired.
Question 3: The implementation on Windows (see below) leaves a small hole in that duplicated media (such as two CD's) mounted on devices with the same device id on two different networked machines would be reported as equivalent.
Does Windows actually assign networked devices device ids which are also used for local devices? If it does, then disregard comments below about use of device id exclusively.
POSIX requires that such networked devices have different device id's, avoiding the problem. Is the fact that Windows and POSIX implementations would perform slightly differently on this corner case a showstopper? I think not.
Windows logic for path equivalent: same device id AND same media volume serial number AND same physical location on disk AND same creation time. This works even in degenerate cases like camera formatted FAT flash memory cards or floppy disks with volume serial numbers incorrectly initialized to 0.
Why not use exclusively the device id and ignore the media volume serial number? Shouldn't that solve the problems? I wouldn't be too worried about broken device ids, and I don't like the idea of using hacks like modification time. Before using modification time, it would be useful to determine if there are versions of Windows that sometimes give two devices the same device id (this really does seem highly unlikely).
POSIX logic: same device id AND same physical location on disk AND same modification time. The modification time is in theory redundant, but is an added protection in case the device id on networked devices failed to meet the POSIX specs.
As with Windows, do you know of any POSIX platforms that sometimes give two devices the same device id? Note: the sample code I posted incorrectly used stat(2) instead of fstat(2) -- fstat should be used to ensure that the file identifier remains valid, and that the file is not removed, changed, etc. -- Jeremy Maitin-Shepard

At 05:51 PM 2/1/2004, Jeremy Maitin-Shepard wrote:
...
Question 2: What if neither exist? Only one exists? My initial thought is that these are likely to be errors, so treat them as such. It could be argued that if either or both don't exist, they can't be equivalent, so return false.
I would suggest that the function throw an exception if either file does not exist. The exception would allow the user to determine exactly which paths exist or do not exist. Any other behavior, given that the function can return only true or false, would in some circumstances give the user less information than desired.
Regardless of what the function does, the user can always call exists() beforehand if a complete understanding of what is present and what isn't is required.
Question 3: The implementation on Windows (see below) leaves a small hole in that duplicated media (such as two CD's) mounted on devices with the same device id on two different networked machines would be reported as equivalent.
Does Windows actually assign networked devices device ids which are also used for local devices?
Yes, and I confirmed that by testing. The device id is just an ordinal number corresponding to the drive letter. a=0, b=1, c=2, etc. So two networked machines have the same device id for their c: drives.
POSIX logic: same device id AND same physical location on disk AND same modification time. The modification time is in theory redundant, but is an added protection in case the device id on networked devices failed to meet the POSIX specs.
As with Windows, do you know of any POSIX platforms that sometimes give two devices the same device id?
Not for sure, but knowing the history of device id's and volume serial numbers, it wouldn't surprise me if that happened when Unix was first ported to mainframes. In a world where there may be dozens of mounts per second, performed by robotic tape librarians on a drive available basis, only volume serial numbers are seen as reliable to establish media identity, while device id's are seen as physical hardware addresses which should be accurately reported. Networking hadn't been invented yet. If that happened, then that existing practice could have been preserved right to this day.
Note: the sample code I posted incorrectly used stat(2) instead of fstat(2) -- fstat should be used to ensure that the file identifier remains valid, and that the file is not removed, changed, etc.
Interesting. I'll give that some thought. --Beman

Beman Dawes <bdawes@acm.org> writes:
At 05:51 PM 2/1/2004, Jeremy Maitin-Shepard wrote:
[snip]
I would suggest that the function throw an exception if either file does not exist. The exception would allow the user to determine exactly which paths exist or do not exist. Any other behavior, given that the function can return only true or false, would in some circumstances give the user less information than desired.
Regardless of what the function does, the user can always call exists() beforehand if a complete understanding of what is present and what isn't is required.
Yes, but using two separate calls introduces race conditions. Filesystem race conditions, which can be a source of security vulnerabilities in programs, are extremely common, and I think it is important for this library to discourage them.
Question 3: The implementation on Windows (see below) leaves a small hole in that duplicated media (such as two CD's) mounted on devices with the same device id on two different networked machines would be reported as equivalent.
Does Windows actually assign networked devices device ids which are also used for local devices?
Yes, and I confirmed that by testing. The device id is just an ordinal number corresponding to the drive letter. a=0, b=1, c=2, etc. So two networked machines have the same device id for their c: drives.
Hmm okay, in that case I would agree that the use of things like volume serial numbers and file times is warranted.
[snip]
As with Windows, do you know of any POSIX platforms that sometimes give two devices the same device id?
Not for sure, but knowing the history of device id's and volume serial numbers, it wouldn't surprise me if that happened when Unix was first ported to mainframes. In a world where there may be dozens of mounts per second, performed by robotic tape librarians on a drive available basis, only volume serial numbers are seen as reliable to establish media identity, while device id's are seen as physical hardware addresses which should be accurately reported. Networking hadn't been invented yet. If that happened, then that existing practice could have been preserved right to this day.
Okay. -- Jeremy Maitin-Shepard

At 11:19 PM 2/1/2004, Jeremy Maitin-Shepard wrote:
Beman Dawes <bdawes@acm.org> writes:
At 05:51 PM 2/1/2004, Jeremy Maitin-Shepard wrote:
[snip]
I would suggest that the function throw an exception if either file does not exist. The exception would allow the user to determine exactly which paths exist or do not exist. Any other behavior, given that the function can return only true or false, would in some circumstances give the user less information than desired.
Regardless of what the function does, the user can always call exists() beforehand if a complete understanding of what is present and what isn't is required.
Yes, but using two separate calls introduces race conditions.
Currently, the possibility for this kind of race condition occurs in a number of places within the library. AFAIK, every function which uses more than one system call to accomplish behavior has the problem.
Filesystem race conditions, which can be a source of security vulnerabilities in programs, are extremely common, and I think it is important for this library to discourage them.
I'm intrigued by your idea in a prior posting of holding an open handle whenever such multiple system calls are needed, to prevent race conditions. There are actually two cases I can see, (1) where we wish to prevent deletion of a file, and (2) where we wish to prevent creation of a file. I see how your technique could prevent (1) but not (2). Even so, (1) is enough of a worry that the technique is well worth thinking about. Coming back to equivalence, I still like John Maddock's suggestion of throwing only if both do not exist. If fits my intuitive sense of what "equivalence" means for files and directories. However, such thoughts in the past sometimes changed when tests cases were being constructed. So no firm decision has been made yet. Thanks, --Beman

Beman Dawes <bdawes@acm.org> writes:
Coming back to equivalence, I still like John Maddock's suggestion of throwing only if both do not exist. If fits my intuitive sense of what "equivalence" means for files and directories. However, such thoughts in the past sometimes changed when tests cases were being constructed. So no firm decision has been made yet.
I don't think I'd want an exception in this case either. I think you should treat a path that doesn't exist like a non-signalling NaN. IIRC, two NaNs never compare equal. -- Dave Abrahams Boost Consulting www.boost-consulting.com

At 09:41 AM 2/2/2004, David Abrahams wrote:
Beman Dawes <bdawes@acm.org> writes:
Coming back to equivalence, I still like John Maddock's suggestion of throwing only if both do not exist. If fits my intuitive sense of what "equivalence" means for files and directories. However, such thoughts in the past sometimes changed when tests cases were being constructed. So no firm decision has been made yet.
I don't think I'd want an exception in this case either. I think you should treat a path that doesn't exist like a non-signalling NaN. IIRC, two NaNs never compare equal.
That makes sense, and it also occurred to me earlier. What would push me strongly in one direct or another is if someone came up with a case where the non-throwing approach was more dangerous. But I can't think of such a case. The NaN analogy is a clear way to explain the rationale for not throwing. --Beman

Beman Dawes <bdawes@acm.org> writes:
[snip]
Coming back to equivalence, I still like John Maddock's suggestion of throwing only if both do not exist. If fits my intuitive sense of what "equivalence" means for files and directories. However, such thoughts in the past sometimes changed when tests cases were being constructed. So no firm decision has been made yet.
Perhaps the function could return optional<bool> or something similar. -- Jeremy Maitin-Shepard

At 11:09 AM 2/2/2004, Jeremy Maitin-Shepard wrote:
Perhaps the function could return optional<bool> or something similar.
Give the stated aim of the library to be useful for simple, script-like programs, that seems a bit like overkill. Also, I'm grooming this library for submission to the Standards committee. The less dependencies on components not in the standard, the better. Of course, boost::optional may get submitted someday too... --Beman
participants (6)
-
Beman Dawes
-
David Abrahams
-
Jeff Garland
-
Jeremy Maitin-Shepard
-
John Maddock
-
Peter Dimov