[filesystem] On POSIX, any reliable way to find file size?

There have been requests for a Boost.Filesystem function which returns file size. But I don't want to provide such a function if it will be unreliable. On POSIX compliant 32-bit systems, is there any reliable way to find the size of a file? It looks to me as if stat::st_size's type of off_t isn't large enough on 32-bit systems to cope with large file sizes. Am I wrong, or missing something? How big is the largest possible file on, say, Linux? Isn't it larger than the 2 gigs a 32-bit signed integer can represent? --Beman

Beman Dawes wrote:
On POSIX compliant 32-bit systems, is there any reliable way to find the size of a file?
First, POSIX only had stat(), which is limited to 32 bits, as you found out. When files got large enough that this wasn't a good assumption any more even on 32bit systems, a new interface was introduced: stat64() which has, for example, a 64bit stat64::st_size . (Btw, Linux 2.6 supports up to 16 TB file sizes on 32bit machines, AFAIK.) Of course, 64bit environments have their regular stat() use 64bit stat::st_size. You get stat64() by #define _LARGEFILE64_SOURCE . You get stat() to use a 64bit stat::st_size even on 32bit environments by #define _FILE_OFFSET_BITS=64 . Both options require reasonably recent Unix systems. The latter option may impede binary compatibility if some modules are compiled with the #define and others aren't, because the perceived "struct stat" size will be different. Jens Maurer

At 12:44 PM 1/31/2004, Jens Maurer wrote:
Beman Dawes wrote:
On POSIX compliant 32-bit systems, is there any reliable way to find
the
size of a file?
First, POSIX only had stat(), which is limited to 32 bits, as you found out. When files got large enough that this wasn't a good assumption any more even on 32bit systems, a new interface was introduced: stat64() which has, for example, a 64bit stat64::st_size . (Btw, Linux 2.6 supports up to 16 TB file sizes on 32bit machines, AFAIK.)
Of course, 64bit environments have their regular stat() use 64bit stat::st_size.
You get stat64() by #define _LARGEFILE64_SOURCE . You get stat() to use a 64bit stat::st_size even on 32bit environments by #define _FILE_OFFSET_BITS=64 .
So it sounds like Boost.Filesystem could have an implementation-defined type, say "file_offset_type", which would be either long or long long. Then Boost.Filesystem would provide: file_offset_type size( const path & ); The implementation would go something like this: #if defined( BOOST_WINDOWS ) // use the Windows native API; file_offset_type is long long. ... #else #define _FILE_OFFSET_BITS 64 // this may or may not do anything // use stat(); file_offset_type is same as off_t, // which may be 32-bits on 32-bit platforms which don't // respond to _FILE_OFFSET_BITS ... #endif Does that make sense? It will take some configuration work to know whether to typedef file_offset_type to long or long long; we don't want to expose platform header files in the Boost header. Need to think about that.
Both options require reasonably recent Unix systems. The latter option may impede binary compatibility if some modules are compiled with the #define and others aren't, because the perceived "struct stat" size will be different.
If the use of "struct stat" is hidden within a Boost.Filesystem implementation file, will that be a problem? I wouldn't think so. Thanks, --Beman

"Beman" == Beman Dawes <bdawes@acm.org> writes: Beman> The implementation would go something like this:
Beman> #if defined( BOOST_WINDOWS ) Beman> // use the Windows native API; file_offset_type is long long. Beman> ... Beman> #else Beman> #define _FILE_OFFSET_BITS 64 // this may or may not do anything Beman> // use stat(); file_offset_type is same as off_t, Beman> // which may be 32-bits on 32-bit platforms which don't Beman> // respond to _FILE_OFFSET_BITS Beman> ... Beman> #endif All files must be compiled with the same _FILE_OFFSET_BITS. You don't want to get binary incompatible objects just because one happened to include the above header[1] and the other didn't. ~velco [1] The program may have reasons to use struct stat independed of the presence of boost.

At 04:33 AM 2/1/2004, Momchil Velikov wrote:
All files must be compiled with the same _FILE_OFFSET_BITS. You don't want to get binary incompatible objects just because one happened to include the above header[1] and the other didn't.
~velco
[1] The program may have reasons to use struct stat independed of the presence of boost.
Hum... Perhaps mistakenly, I assumed stat() was a macro that expanded to call stat64() if _FILE_OFFSET_BITS=64 was defined. Put another way, is a program linking these two files OK? file_a.cpp #define _FILE_OFFSET_BITS=64 #include <sty/stat.h> long long func_a() { struct stat s; if ( stat( "foo", &s ) != 0 ) { throw ... } return s.st_size; } file_b.cpp #include <sty/stat.h> long func_b() { struct stat s; if ( stat( "bar", &s ) != 0 ) { throw ... } return s.st_size; } If that causes an ABI clash, then we would have to use the explicit 64-bit stat64() interface if available, and otherwise default to stat(). Is there a de facto standard feature availability macro to determine if _LARGEFILE64_SOURCE is supported? Thanks, --Beman

"Beman" == Beman Dawes <bdawes@acm.org> writes:
Beman> At 04:33 AM 2/1/2004, Momchil Velikov wrote: >> All files must be compiled with the same _FILE_OFFSET_BITS. You >> don't want to get binary incompatible objects just because one >> happened to include the above header[1] and the other didn't. >> >> ~velco >> >> [1] The program may have reasons to use struct stat independed of the >> presence of boost. Beman> Hum... Perhaps mistakenly, I assumed stat() was a macro that expanded Beman> to call stat64() if _FILE_OFFSET_BITS=64 was defined. Indeed, probably is is "redirected" in some way to the appropriate syscall/function name. E.g. glibc on GNU/Linux can do something like: a) #define stat stat64, or b) extern int stat (struct stat *) asm ("stat64"), or c) extern int stat (struct stat *) __attribute__ ((weak, alias("stat64"))); Beman> Put another way, Beman> is a program linking these two files OK? Beman> file_a.cpp Beman> #define _FILE_OFFSET_BITS=64 Beman> #include <sty/stat.h> Beman> long long func_a() Beman> { Beman> struct stat s; Beman> if ( stat( "foo", &s ) != 0 ) { throw ... } Beman> return s.st_size; Beman> } Beman> file_b.cpp Beman> #include <sty/stat.h> Beman> long func_b() Beman> { Beman> struct stat s; Beman> if ( stat( "bar", &s ) != 0 ) { throw ... } Beman> return s.st_size; Beman> } Yes, this is probably ok. The problem is if parts of the program attempt to communicate ``struct stat'' values. ~velco

At 11:25 AM 2/1/2004, Momchil Velikov wrote:
Beman> Hum... Perhaps mistakenly, I assumed stat() was a macro that expanded Beman> to call stat64() if _FILE_OFFSET_BITS=64 was defined.
Indeed, probably is is "redirected" in some way to the appropriate syscall/function name. E.g. glibc on GNU/Linux can do something like:
a) #define stat stat64, or
b) extern int stat (struct stat *) asm ("stat64"), or
c) extern int stat (struct stat *) __attribute__ ((weak, alias("stat64")));
And of course there is always (d) "compiler magic" :-)
Beman> Put another way, Beman> is a program linking these two files OK?
Beman> file_a.cpp Beman> #define _FILE_OFFSET_BITS=64 Beman> #include <sty/stat.h> Beman> long long func_a() Beman> { Beman> struct stat s; Beman> if ( stat( "foo", &s ) != 0 ) { throw ... } Beman> return s.st_size; Beman> }
Beman> file_b.cpp Beman> #include <sty/stat.h> Beman> long func_b() Beman> { Beman> struct stat s; Beman> if ( stat( "bar", &s ) != 0 ) { throw ... } Beman> return s.st_size; Beman> }
Yes, this is probably ok. The problem is if parts of the program attempt to communicate ``struct stat'' values.
OK, thanks for the clarification. I'll be careful not to do that. A trial implementation should be in CVS in few days - that will allow months of testing before the next release. --Beman

As far as the name ``file_offset_type,'' perhaps the name ``file_size_type'' would be better (since size_type is used in the standard library for container offsets and size). -- Jeremy Maitin-Shepard

At 06:03 PM 2/1/2004, Jeremy Maitin-Shepard wrote:
As far as the name ``file_offset_type,'' perhaps the name ``file_size_type'' would be better (since size_type is used in the standard library for container offsets and size).
I'm having second thoughts about the typedef and the size() return type. intmax_t would be simpler, both to implement and understand. --Beman

On Sat, Jan 31, 2004 at 11:26:11AM -0500, Beman Dawes wrote:
It looks to me as if stat::st_size's type of off_t isn't large enough on 32-bit systems to cope with large file sizes. Am I wrong, or missing something?
How big is the largest possible file on, say, Linux? Isn't it larger than the 2 gigs a 32-bit signed integer can represent?
I don't know whether there is a solution within any formally defined Unix or Posix standard. But according to http://www.suse.de/~aj/linux_lfs.html the Linux 2.4 kernel supports the Large File Support (LFS) interface. There is also a link to the LFS specification: http://ftp.sas.com/standards/large.file/x_open.20Mar96.html But I don't have any first hand knowledge how complete and stable the LFS support is and how it interacts with binaries that use the traditional interface. Regards Christoph -- http://www.informatik.tu-darmstadt.de/TI/Mitarbeiter/cludwig.html LiDIA: http://www.informatik.tu-darmstadt.de/TI/LiDIA/Welcome.html
participants (5)
-
Beman Dawes
-
Christoph Ludwig
-
Jens Maurer
-
Jeremy Maitin-Shepard
-
Momchil Velikov