A library: out-of-core containers and algorithms

Roman Dementiev

6 Feb 2005 6 Feb '05

9:40 p.m.

Hello, I am developing a library called Stxxl. It is an implementation of STL for external memory (out-of-core) computations, i.e. Stxxl implements containers and algorithms that can process huge volumes of data that only fit on disks. Currently I have implemented vector, stack, and priority_queue. External memory map, list, and queue are coming soon. The containers take only specified (given) fixed amount of main memory, but can contain more elements than can fit into the main memory. The containers are compatible with STL algorithms, an example: #include <stxxl> // ten billion doubles stxxl::vector<double> HugeVector(10ULL * 1000000000ULL); std::fill(HugeVector.begin(), HugeVector.end(), 0.0); STL algorithms that rely on Output, Input, Forward and Bidirectional iterators will work I/O efficiently with Stxxl containers. For the RandomAccessIterator algorithms (sortting) I have implemented specialized I/O efficient implementations: // ten billion doubles stxxl::vector<double> HugeVector(10ULL * 1000000000ULL); std::generate(HugeVector.begin(), HugeVector.end(), MyRandom()); // sort the vector using only 512 MB of main memory stxxl::sort(HugeVector.begin(), HugeVector.end(), 128*1024*1024); or another example (sort a file): stxxl::file myfile("../myfile.dat",stxxl::file::RDWR); stxxl::vector<double> HugeVector(&myfile); // sort the vector using only 512 MB of main memory stxxl::sort(HugeVector.begin(), HugeVector.end(), 128*1024*1024); More details about the Stxxl library: http://i10www.ira.uka.de/dementiev/stxxl.shtml Would such a library be interesting for Boost users? Does it fit here? Is it worth to boostify it? I would be very glad to know your opinion. With best regards, Roman

Show replies by date

Thorsten Ottosen

7 Feb 7 Feb

11:08 a.m.

"Roman Dementiev" <dementiev@ira.uka.de> wrote in message news:cu62pe$a1i$1@sea.gmane.org... | Hello, | | I am developing a library called Stxxl. | Would such a library be interesting for Boost users? Does it fit here? | Is it worth to boostify it? | | I would be very glad to know your opinion. if you intend to submit it, would it then be possible to make it work on more platforms, unix (mac), windows? -Thorsten

Roman Dementiev

12:16 p.m.

Thorsten Ottosen wrote:

...

"Roman Dementiev" <dementiev@ira.uka.de> wrote in message news:cu62pe$a1i$1@sea.gmane.org... | Hello, | | I am developing a library called Stxxl.

| Would such a library be interesting for Boost users? Does it fit here? | Is it worth to boostify it? | | I would be very glad to know your opinion.

if you intend to submit it, would it then be possible to make it work on more platforms, unix (mac), windows?

yes. One must rewrite only the lower layer of Stxxl using native file access methods and native multithreading. The port to Windows is ongoing. Roman

Thorsten Ottosen

1:35 p.m.

"Roman Dementiev" <dementiev@ira.uka.de> wrote in message news:cu7m3l$lfo$1@sea.gmane.org... | Thorsten Ottosen wrote: | > "Roman Dementiev" <dementiev@ira.uka.de> wrote in message | > news:cu62pe$a1i$1@sea.gmane.org... | > | Hello, | > | | > | I am developing a library called Stxxl. | > | > | > | Would such a library be interesting for Boost users? Does it fit here? | > | Is it worth to boostify it? | > | | > | I would be very glad to know your opinion. | > | > if you intend to submit it, would it then be possible to make it | > work on more platforms, unix (mac), windows? | | yes. One must rewrite only the lower layer of Stxxl using native | file access methods and native multithreading. is that because you need mem.mapped files? Otherwise, why don't you use boost filsystem and boost threads? -Thorsten

Roman Dementiev

3:04 p.m.

Thorsten Ottosen wrote:

...

"Roman Dementiev" <dementiev@ira.uka.de> wrote in message news:cu7m3l$lfo$1@sea.gmane.org... | Thorsten Ottosen wrote: | > "Roman Dementiev" <dementiev@ira.uka.de> wrote in message | > news:cu62pe$a1i$1@sea.gmane.org... | > | Hello, | > | | > | I am developing a library called Stxxl. | > | > | > | Would such a library be interesting for Boost users? Does it fit here? | > | Is it worth to boostify it? | > | | > | I would be very glad to know your opinion. | > | > if you intend to submit it, would it then be possible to make it | > work on more platforms, unix (mac), windows? | | yes. One must rewrite only the lower layer of Stxxl using native | file access methods and native multithreading.

is that because you need mem.mapped files? Otherwise, why don't you use boost filsystem and boost threads?

I do not necessarily need memory mapped files. The Boost filesystem library is more about manipulating files and directories. But what I need is the file access itself: create/open/read/write/close file. I would have been using std::fstream as a portable file access method, but it lacks support of files larger than 2 GB. It is really big disadvantage in my case. The boost threads library does fit my requirements. I would use when it comes to boostifying Stxxl. Roman

Robert Ramey

4:33 p.m.

A very interesting idea - and potentially very useful. I'm not sure "out of core" is a great name - since we haven't used core memory in 25 years as far as I know.

...

create/open/read/write/close file. I would have been using std::fstream as a portable file access method, but it lacks support of files larger than 2 GB. It is really big disadvantage in my case.

I faced this problem some time ago. The problem is was the file sizes over 2GB were supported by the standard fopen, etc on windows. I don't rememeber how I dealt with unix. I endup making an reimplementing of fopen, fseek in terms of windows API. (wfopen? - I don't rememeber). At the same time I also implemented wrappers for posix async i/o around the very unwieldy windows async API. I have no idea if this information is useful Robert Ramey

Victor A. Wagner Jr.

4:42 p.m.

New subject: A library: out-of-core containers and algorithms

At Monday 2005-02-07 08:04, you wrote:

...

Thorsten Ottosen wrote:

...
"Roman Dementiev" <dementiev@ira.uka.de> wrote in message news:cu7m3l$lfo$1@sea.gmane.org... | Thorsten Ottosen wrote: | > "Roman Dementiev" <dementiev@ira.uka.de> wrote in message | > news:cu62pe$a1i$1@sea.gmane.org... | > | Hello, | > | | > | I am developing a library called Stxxl. | > | > | > | Would such a library be interesting for Boost users? Does it fit here? | > | Is it worth to boostify it? | > | | > | I would be very glad to know your opinion. | > | > if you intend to submit it, would it then be possible to make it | > work on more platforms, unix (mac), windows? | | yes. One must rewrite only the lower layer of Stxxl using native | file access methods and native multithreading. is that because you need mem.mapped files? Otherwise, why don't you use boost filsystem and boost threads?

I do not necessarily need memory mapped files.

The Boost filesystem library is more about manipulating files and directories. But what I need is the file access itself: create/open/read/write/close file. I would have been using std::fstream as a portable file access method, but it lacks support of files larger than 2 GB. It is really big disadvantage in my case.

not on MY system (Windows XP, VC++7.1). Perhaps your basic standard library needs fixing

...

The boost threads library does fit my requirements. I would use when it comes to boostifying Stxxl.

Roman

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Victor A. Wagner Jr. http://rudbek.com The five most dangerous words in the English language: "There oughta be a law"

christopher diggins

6:56 p.m.

New subject: A library: out-of-core containers and algorithms (a job for boost::iostreams)

Christopher Diggins Object Oriented Template Library (OOTL) http://www.ootl.org ----- Original Message ----- From: "Roman Dementiev" <dementiev@ira.uka.de> To: <boost@lists.boost.org> Sent: Monday, February 07, 2005 10:04 AM Subject: [boost] Re: A library: out-of-core containers and algorithms

...

Thorsten Ottosen wrote:

...
"Roman Dementiev" <dementiev@ira.uka.de> wrote in message news:cu7m3l$lfo$1@sea.gmane.org... | Thorsten Ottosen wrote: | > "Roman Dementiev" <dementiev@ira.uka.de> wrote in message | > news:cu62pe$a1i$1@sea.gmane.org... | > | Hello, | > | | > | I am developing a library called Stxxl.

...

...
| > | Would such a library be interesting for Boost users? Does it fit here? | > | Is it worth to boostify it?

Yes.

...

...
| > if you intend to submit it, would it then be possible to make it | > work on more platforms, unix (mac), windows?

I'd suggest that it should work on *all* platforms which support files, not just a handful.

...

...
| yes. One must rewrite only the lower layer of Stxxl using native | file access methods and native multithreading.

[snip]

...

But what I need is the file access itself: create/open/read/write/close file. I would have been using std::fstream as a portable file access method, but it lacks support of files larger than 2 GB. It is really big disadvantage in my case.

On most (many/all?) windows platforms you can not have files of more than 2GB, so I'd suggest you write the code to work with this restriction.

...

The boost threads library does fit my requirements. I would use when it comes to boostifying Stxxl.

I would strongly support the introduction of a new library such as proposed by Roman, however on the sole condition that it is portable. The current proposal to require native file access methods I think is too limiting. I would propose instead writing a new version of fstream which operates on vectors of files. This can be done relatively easily by using the boost::iostreams library by Jonathan Turkanis, I have posted some very preliminary prototype code (i.e. untested, naive, non-portable) at http://www.cdiggins.com/big_file.hpp just to give a glimpse of one possible approach. The code is modeled after the boost::iostreams::seekable concept ( http://home.comcast.net/~jturkanis/iostreams/libs/iostreams/doc/concepts/see... ) If Jonathan heeds our call, he could probably finish what I started in less than an hour. ;-) Christopher Diggins Object Oriented Template Library (OOTL) http://www.ootl.org

Jonathan Turkanis

10:05 p.m.

New subject: A library: out-of-core containers and algorithms (ajob for boost::iostreams)

christopher diggins wrote:

...

Christopher Diggins Object Oriented Template Library (OOTL) http://www.ootl.org ----- Original Message ----- From: "Roman Dementiev" <dementiev@ira.uka.de> To: <boost@lists.boost.org> Sent: Monday, February 07, 2005 10:04 AM Subject: [boost] Re: A library: out-of-core containers and algorithms

...

...
Thorsten Ottosen wrote:

...
"Roman Dementiev" <dementiev@ira.uka.de> wrote in message news:cu7m3l$lfo$1@sea.gmane.org...

...
Thorsten Ottosen wrote:

...
"Roman Dementiev" <dementiev@ira.uka.de> wrote in message news:cu62pe$a1i$1@sea.gmane.org...

...
Hello,

I am developing a library called Stxxl.

...
...
...
...
...
Would such a library be interesting for Boost users? Does it fit here? Is it worth to boostify it?

Yes.

...
...
...
...
if you intend to submit it, would it then be possible to make it work on more platforms, unix (mac), windows?

I'd suggest that it should work on *all* platforms which support files, not just a handful.

...

I would strongly support the introduction of a new library such as proposed by Roman, however on the sole condition that it is portable.

I don't see why it couldn't be made to work on a wide variety of systems. However, the initial goal should just be to support the most widely used systems. Porting to other systems can be done later as needed, preferably with the people who need it helping with the porting.

...

The current proposal to require native file access methods I think is too limiting. I would propose instead writing a new version of fstream which operates on vectors of files. This can be done relatively easily by using the boost::iostreams library by Jonathan Turkanis, I have posted some very preliminary prototype code (i.e. untested, naive, non-portable) at http://www.cdiggins.com/big_file.hpp just to give a glimpse of one possible approach. The code is modeled after the boost::iostreams::seekable concept ( http://home.comcast.net/~jturkanis/iostreams/libs/iostreams/doc/concepts/see... )

This is an intersting idea; I'd like to be convinced that it's necessary before implementing it. I think there will be some tricky issues similar to the ones encountered when implementing temp files. Specifically, you can't assume that the names file.1, file.2, ... will always be available; when you need a new file, you have to look for a name which is not used and create the file atomically. Also, the naming convention should be customizable.

...

If Jonathan heeds our call, he could probably finish what I started in less than an hour. ;-)

:-)

...

Christopher Diggins Object Oriented Template Library (OOTL) http://www.ootl.org

Jonathan

Scott Woods

8 Feb 8 Feb

3:17 a.m.

New subject: A library: out-of-core containers and algorithms(ajob for boost::iostreams)

Hi Roman, ----- Original Message ----- From: "Jonathan Turkanis" <technews@kangaroologic.com> To: <boost@lists.boost.org> Sent: Tuesday, February 08, 2005 11:05 AM Subject: [boost] Re: Re: A library: out-of-core containers and algorithms(ajob for boost::iostreams) <snip>

...

...
The current proposal to require native file access methods I think is too limiting. I would propose instead writing a new version of fstream which operates on vectors of files. This can be done relatively easily by using the boost::iostreams library by Jonathan Turkanis, I have posted some very preliminary prototype code (i.e. untested, naive, non-portable) at http://www.cdiggins.com/big_file.hpp just to give a glimpse of one possible approach. The code is modeled after the boost::iostreams::seekable concept (

http://home.comcast.net/~jturkanis/iostreams/libs/iostreams/doc/concepts/see kable_device.html

...

)

This is an intersting idea; I'd like to be convinced that it's necessary before implementing it.

I think there will be some tricky issues similar to the ones encountered when implementing temp files. Specifically, you can't assume that the names file.1, file.2, ... will always be available; when you need a new file, you have to look for a name which is not used and create the file atomically. Also, the naming convention should be customizable.

...
If Jonathan heeds our call, he could probably finish what I started in less than an hour. ;-)

I have been working on something similar for a while. Maybe some experiences along the way are relevant (helpful?). The functional requirements were in the area of network logging. The ability to speedily collect and randomly access huge amounts of data were fundamental goals. Huge files were a detail issue, i..e. how do you store and access over 2Gb in a normal OS file? Over 4Gb? More? Huge solitary files have a reputation for unexpectedly bad performance. In testing I have found that huge files are bad. But understanding the true significance of that trivial sample is time-consuming and thats before you consider all platforms. Pretty early on I moved to a striping strategy, i.e. a single virtual storage file comprising of a sequence of OS files. I also went as far as file hierarchies (i.e. folders of folders of files) as eventually folders have a reputation for performance problems beyond certain numbers of entries. (NB: striping has turned out to be rather convenient. It was quite simple to go on to develop a "sliding" version - FIFO) Having sorted out the mass storage issue I still had to deal with the "huge integers" thing. I suspect I have done a sidestep that may eventually turn around and byte me <ahem> but so far local requirements have been fulfilled even with the following limitations; * there is no knowledge of bytes consumed, instead I only remember Mb * the only "addressing" is by ordinal, e.g. log[ 68755 ] so my maximum addressable space is a function of 32-bit integers (the ordinal) and the bytes consumed by each log entry. I have a GUI accessing millions of logging entries over Gbs of data and getting constant performance. Cheers.

Jonathan Turkanis

3:27 a.m.

New subject: A library: out-of-core containers andalgorithms(ajob for boost::iostreams)

Scott Woods wrote:

...

Hi Roman,

...
Christopher Diggins wrote:

...
The current proposal to require native file access methods I think is too limiting. I would propose instead writing a new version of fstream which operates on vectors of files. This can be done relatively easily

...

I have been working on something similar for a while. Maybe some experiences along the way are relevant (helpful?).

...

* there is no knowledge of bytes consumed, instead I only remember Mb * the only "addressing" is by ordinal, e.g. log[ 68755 ] so my maximum addressable space is a function of 32-bit integers (the ordinal) and the bytes consumed by each log entry.

Could you elaborate on these points? What is the interface for accessing the data?

...

Cheers.

Jonathan

Scott Woods

5:01 a.m.

New subject: A library: out-of-core containersandalgorithms(ajob for boost::iostreams)

...

...
* there is no knowledge of bytes consumed, instead I only remember Mb * the only "addressing" is by ordinal, e.g. log[ 68755 ] so my maximum addressable space is a function of 32-bit integers (the ordinal) and the bytes consumed by each log entry.

Could you elaborate on these points? What is the interface for accessing

----- Original Message ----- From: "Jonathan Turkanis" <technews@kangaroologic.com> To: <boost@lists.boost.org> Sent: Tuesday, February 08, 2005 4:27 PM the

...

data?

The limitations of 32-bit integers first arose when dealing with the huge solitary files - I moved to striping. It cropped up again once I was dealing with seriously large volumes of logging. Starting with Roman's template names and adding mine; typedef stxxl::vector<double> actual_stripe; typedef striped_vector<actual_stripe> stripes; typedef striped_vector<stripes> striped_stripes; Knowing when to "close off" a stripe and open the next is driven by an "enum OPTIMAL_MAXIMUM". For a vector this was some reasonable byte figure for the local filesystem. The "striped_vector" expects and conforms to a minimal concept, i.e. it can also be passed as a "stripe-type". The calculation of OPTIMAL_MAXIMUM within striped_vector<striped_vector<...> > quickly blew out the 32-bit limit. So I moved it to a calculation of Mbs. I would include real examples of template usage but I suspect that unrelated material would confuse things (serialization techniques). So here is some modulated source; struct stored_log {}; typedef stxxl::vector<stored_log> log_stripe; typedef striped_vector<log_stripe> log_vector; // Folder of files // // class application_storage { folder_device home; log_vector log; .. .. log.open( home .. ); .. .. stored_log sl; .. .. log.push_back( sl ); .. .. stored_log &application_storage::line( unsigned long i ) { return log[ i ]; } My implementation of an out-of-core vector is nowhere near as complete as Roman's. My container is required to collect a sequence of data and provide random access to it. Taking this to heart there is no means by which you can modify something once it has been "push_back'd". The ramifications of this with respect to STL concepts and the algorithms that rely on them are obvious. The container is write-once-read-many. Hope this was a successful elaboration. Cheers.

christopher diggins

4:58 a.m.

New subject: A library: out-of-core containers and algorithms (ajob for boost::iostreams)

----- Original Message ----- From: "Jonathan Turkanis" <technews@kangaroologic.com>

...

...
----- Original Message ----- From: "Roman Dementiev" <dementiev@ira.uka.de>

...

...
I would strongly support the introduction of a new library such as proposed by Roman, however on the sole condition that it is portable.

I don't see why it couldn't be made to work on a wide variety of systems. However, the initial goal should just be to support the most widely used systems. Porting to other systems can be done later as needed, preferably with the people who need it helping with the porting.

Sounds to me like a great deal of work for something that isn't that complicated to begin with. The vector fstream solution I proposed could be up and running in a couple of days and would be extermely portable.

...

...
The current proposal to require native file access methods I think is too limiting. I would propose instead writing a new version of fstream which operates on vectors of files. This can be done relatively easily by using the boost::iostreams library by Jonathan Turkanis, I have posted some very preliminary prototype code (i.e. untested, naive, non-portable) at http://www.cdiggins.com/big_file.hpp just to give a glimpse of one possible approach. The code is modeled after the boost::iostreams::seekable concept ( http://home.comcast.net/~jturkanis/iostreams/libs/iostreams/doc/concepts/see... )

This is an intersting idea; I'd like to be convinced that it's necessary before implementing it.

Because it's far easier to implement and is more portable than the other method of writing a whole bunch of ports, and testing all of the ports. Also I think it would be an excellent example for your iostreams library. I think Scott had something to say about performance of huge files. The file vector solution is probably significantly more efficient.

...

I think there will be some tricky issues similar to the ones encountered when implementing temp files.

The library I assume, already has dealt with the problem of temp files, so the solution should likely be the same.

...

Specifically, you can't assume that the names file.1, file.2, ... will always be available; when you need a new file, you have to look for a name which is not used and create the file atomically. Also, the naming convention should be customizable.

Yes.

...

...
If Jonathan heeds our call, he could probably finish what I started in less than an hour. ;-)

:-)

Done yet? CD

Jonathan Turkanis

5:26 a.m.

New subject: A library: out-of-core containers and algorithms(ajob for boost::iostreams)

christopher diggins wrote:

...

----- Original Message ----- From: "Jonathan Turkanis" <technews@kangaroologic.com>

...
...
----- Original Message ----- From: "Roman Dementiev" <dementiev@ira.uka.de>

...
...
I would strongly support the introduction of a new library such as proposed by Roman, however on the sole condition that it is portable.

I don't see why it couldn't be made to work on a wide variety of systems. However, the initial goal should just be to support the most widely used systems. Porting to other systems can be done later as needed, preferably with the people who need it helping with the porting.

Sounds to me like a great deal of work for something that isn't that complicated to begin with.

It's really not very hard to support several platforms using native APIs. See, e.g., file_descriptor.cpp: http://www.kangaroologic.com/iostreams/libs/iostreams/src/file_descriptor.cp...

...

The vector fstream solution I proposed could be up and running in a couple of days and would be extermely portable.

I'm not yet sure what the portability issues are. If it's just the ability to use large offsets, a couple of small changes to file_descriptor would be sufficient.

...

...
This is an intersting idea; I'd like to be convinced that it's necessary before implementing it.

Because it's far easier to implement and is more portable than the other method of writing a whole bunch of ports, and testing all of the ports.

Often you just need two or three versions of a small bit of code; the boost regression testing system makes testing on many platforms easy.

...

Also I think it would be an excellent example for your iostreams library. I think Scott had something to say about performance of huge files. The file vector solution is probably significantly more efficient.

What interests me about this solution is that it uses a collection of files. Whether the files are accessed using std::filebuf's or file_descriptors may turn out to be not so important. Or it may turn out that using one or the other is actually better for some reason. So there are several different issues here.

...

...
I think there will be some tricky issues similar to the ones encountered when implementing temp files.

The library I assume, already has dealt with the problem of temp files, so the solution should likely be the same.

No, unfortunately I haven't done it yet.

...

...
Specifically, you can't assume that the names file.1, file.2, ... will always be available; when you need a new file, you have to look for a name which is not used and create the file atomically. Also, the naming convention should be customizable.

Yes.

...
...
If Jonathan heeds our call, he could probably finish what I started in less than an hour. ;-)

:-)

Done yet?

Give me five more minutes.

...

CD

Jonathan

Beman Dawes

2:04 a.m.

New subject: A library: out-of-core containers and algorithms (a job for boost::iostreams)

At 01:56 PM 2/7/2005, christopher diggins wrote:

...

On most (many/all?) windows platforms you can not have files of more than

...

2GB, so I'd suggest you write the code to work with this restriction.

That is not correct. Windows has supported large files for many years. I've got a Win32 Programmer's Reference copyright 1993 which includes the functions supporting file sizes greater than 2 GB, so support has existed at least since '93 for all Windows versions which support the Win32 API. Most, if not all, current POSIX implementations also support large file sizes. Like Windows, this support has been available for many years. Both Windows and POSIX cases use non-standard API's for at least portions of their large file support, but the number of functions involved is so small that it isn't hard for Boost code to support both. --Beman

Roman Dementiev

12:31 p.m.

New subject: A library: out-of-core containers and algorithms (a job for boost::iostreams)

Beman Dawes wrote:

...

At 01:56 PM 2/7/2005, christopher diggins wrote:

...
On most (many/all?) windows platforms you can not have files of more than 2GB, so I'd suggest you write the code to work with this restriction.

That is not correct. Windows has supported large files for many years. I've got a Win32 Programmer's Reference copyright 1993 which includes the functions supporting file sizes greater than 2 GB, so support has existed at least since '93 for all Windows versions which support the Win32 API.

Yes, large files sizes are supported by Windows API for many years. e.g. seeking function: DWORD SetFilePointer( HANDLE hFile, LONG lDistanceToMove, PLONG lpDistanceToMoveHigh, DWORD dwMoveMethod ); But the question whether the 64-bit offset functionality is really implemented. The recent MSDN tells that under Windows Me/98/95 lpDistanceToMoveHigh is not supported :( http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base... Pupular Windows file systems do not support large files by it's design, e.g. FAT16, FAT32. http://www.microsoft.com/resources/documentation/Windows/XP/all/reskit/en-us... I fear in order to support these enviroments one has to follow "the vector of files" approach by Christopher.

...

Most, if not all, current POSIX implementations also support large file sizes. Like Windows, this support has been available for many years.

not really true: http://www.suse.de/~aj/linux_lfs.html e.g. Linux supports large files only since 2.3 kernel. But it seems that design of POSIX/Unix file systems supports the large files from the beginning. Roman

christopher diggins

3:01 p.m.

New subject: A library: out-of-core containers and algorithms (a job for boost::iostreams)

----- Original Message ----- From: "Beman Dawes" <bdawes@acm.org> To: <boost@lists.boost.org>; <boost@lists.boost.org> Sent: Monday, February 07, 2005 9:04 PM Subject: Re: [boost] Re: A library: out-of-core containers and algorithms (a job for boost::iostreams)

...

At 01:56 PM 2/7/2005, christopher diggins wrote:

...
On most (many/all?) windows platforms you can not have files of more than

...
2GB, so I'd suggest you write the code to work with this restriction.

That is not correct. Windows has supported large files for many years. I've got a Win32 Programmer's Reference copyright 1993 which includes the functions supporting file sizes greater than 2 GB, so support has existed at least since '93 for all Windows versions which support the Win32 API.

Most, if not all, current POSIX implementations also support large file sizes. Like Windows, this support has been available for many years.

Both Windows and POSIX cases use non-standard API's for at least portions of their large file support, but the number of functions involved is so small that it isn't hard for Boost code to support both.

I stand corrected. I made that assumption based on the partition limits on certain filesystems. CD

Beman Dawes

18 Feb 18 Feb

2:27 a.m.

New subject: A library: out-of-core containers and algorithms (a job for boost::iostreams)

At 10:01 AM 2/8/2005, christopher diggins wrote:

...

I stand corrected. I made that assumption based on the partition limits on certain filesystems.

It is confusing because as Roman Dementiev pointed out, even though the Windows API has supported large files for a long time, some older versions of that operating system may not implement the functionality, particularly if they only supported file systems which were limited to smaller files. I'd hate to see otherwise useful Boost code held back because it didn't support large files on systems which don't normally support large amounts of data anyhow. --Beman

Jonathan Turkanis

7 Feb 7 Feb

9:50 p.m.

Roman Dementiev wrote:

...

Thorsten Ottosen wrote:

...
"Roman Dementiev" <dementiev@ira.uka.de> wrote in message

...

...
...
yes. One must rewrite only the lower layer of Stxxl using native file access methods and native multithreading.

is that because you need mem.mapped files? Otherwise, why don't you use boost filsystem and boost threads?

I do not necessarily need memory mapped files.

The Boost filesystem library is more about manipulating files and directories. But what I need is the file access itself: create/open/read/write/close file. I would have been using std::fstream as a portable file access method, but it lacks support of files larger than 2 GB. It is really big disadvantage in my case.

The boost iostreams library, which is in CVS now and will be part of the next boost release, contains a class file_descriptor for accessing files using OS or runtime library file descriptors. http://www.kangaroologic.com/iostreams/libs/iostreams/doc/?path=6.1.3 Currently, random access if performed using offsets of type std::streamoff, which is often implemented as a 32-bit unsigned long. However, I recently decided to relax this restriction and allow offsets to be specified using larger intergral type. On windows, seeking will be implemented using _lseek64 where available. On Posix, lseek will be used; it uses off_t, which is (I think) often 64 bits. Access via iostreams will still have to use std::streamoff, since that is out of my control; but you should be able to use an instance of file_descriptor directly. The new code should be available soon. Best Regards, Jonathan

Roman Dementiev

8 Feb 8 Feb

11:01 a.m.

Jonathan Turkanis wrote:

...

Roman Dementiev wrote:

...
Thorsten Ottosen wrote:

...
"Roman Dementiev" <dementiev@ira.uka.de> wrote in message

...
...
...
yes. One must rewrite only the lower layer of Stxxl using native file access methods and native multithreading.

is that because you need mem.mapped files? Otherwise, why don't you use boost filsystem and boost threads?

I do not necessarily need memory mapped files.

The Boost filesystem library is more about manipulating files and directories. But what I need is the file access itself: create/open/read/write/close file. I would have been using std::fstream as a portable file access method, but it lacks support of files larger than 2 GB. It is really big disadvantage in my case.

The boost iostreams library, which is in CVS now and will be part of the next boost release, contains a class file_descriptor for accessing files using OS or runtime library file descriptors.

http://www.kangaroologic.com/iostreams/libs/iostreams/doc/?path=6.1.3

Currently, random access if performed using offsets of type std::streamoff, which is often implemented as a 32-bit unsigned long. However, I recently decided to relax this restriction and allow offsets to be specified using larger intergral type. On windows, seeking will be implemented using _lseek64 where available. On Posix, lseek will be used; it uses off_t, which is (I think) often 64 bits.

Under Linux if preprocessor macros _LARGEFILE_SOURCE, _LARGEFILE64_SOURCE, _FILE_OFFSET_BITS=64 are defined then off_t is 64 bit, otherwise it is 32 bit.

...

Access via iostreams will still have to use std::streamoff, since that is out of my control; but you should be able to use an instance of file_descriptor directly.

The new code should be available soon.

I think boost::iostreams::file_descriptor is what I need. Of course with 64 bit seek. Another reason why I do not like std::fstream is its bad performance. It introduces a superfluous copying into it's internal buffer when doing I/O on user buffer. Another issue is that the performance of my library will benefit from unbuffered file system access. Many operating system support it (O_DIRECT open option in Unix systems, FILE_FLAG_NO_BUFFERING|FILE_FLAG_WRITE_THROUGH option for "CreateFile" Windows API call). Roman

Jonathan Turkanis

7:33 p.m.

Roman Dementiev wrote:

...

Jonathan Turkanis wrote:

...
Roman Dementiev wrote:

...

...
...
The Boost filesystem library is more about manipulating files and directories. But what I need is the file access itself: create/open/read/write/close file. I would have been using std::fstream as a portable file access method, but it lacks support of files larger than 2 GB. It is really big disadvantage in my case.

...

...
The boost iostreams library, which is in CVS now and will be part of the next boost release, contains a class file_descriptor for accessing files using OS or runtime library file descriptors.

...

Under Linux if preprocessor macros _LARGEFILE_SOURCE, _LARGEFILE64_SOURCE, _FILE_OFFSET_BITS=64 are defined then off_t is 64 bit, otherwise it is 32 bit.

...

...
Access via iostreams will still have to use std::streamoff, since that is out of my control; but you should be able to use an instance of file_descriptor directly.

The new code should be available soon.

I think boost::iostreams::file_descriptor is what I need. Of course with 64 bit seek.

Okay, I'll make sure to add this.

...

Another reason why I do not like std::fstream is its bad performance. It introduces a superfluous copying into it's internal buffer when doing I/O on user buffer.

You can also use a FILE* and specify _IONBF with setvbuf.

...

Another issue is that the performance of my library will benefit from unbuffered file system access. Many operating system support it (O_DIRECT open option in Unix systems, FILE_FLAG_NO_BUFFERING|FILE_FLAG_WRITE_THROUGH option for "CreateFile" Windows API call).

Could you point me to documentation for O_DIRECT? It looks like FILE_FLAG_NO_BUFFERING requires some care to use properly, so I think it would have to go into a specialized component instead of being added to file_descriptor.

...

Roman

Jonathan

Roman Dementiev

9 Feb 9 Feb

10:12 a.m.

Jonathan Turkanis wrote:

...

Roman Dementiev wrote:

...
Jonathan Turkanis wrote:

...
Roman Dementiev wrote:

...
...
...
The Boost filesystem library is more about manipulating files and directories. But what I need is the file access itself: create/open/read/write/close file. I would have been using std::fstream as a portable file access method, but it lacks support of files larger than 2 GB. It is really big disadvantage in my case.

...
...
The boost iostreams library, which is in CVS now and will be part of the next boost release, contains a class file_descriptor for accessing files using OS or runtime library file descriptors.

...
Under Linux if preprocessor macros _LARGEFILE_SOURCE, _LARGEFILE64_SOURCE, _FILE_OFFSET_BITS=64 are defined then off_t is 64 bit, otherwise it is 32 bit.

...
...
Access via iostreams will still have to use std::streamoff, since that is out of my control; but you should be able to use an instance of file_descriptor directly.

The new code should be available soon.

I think boost::iostreams::file_descriptor is what I need. Of course with 64 bit seek.

Okay, I'll make sure to add this.

...
Another reason why I do not like std::fstream is its bad performance. It introduces a superfluous copying into it's internal buffer when doing I/O on user buffer.

You can also use a FILE* and specify _IONBF with setvbuf.

...
Another issue is that the performance of my library will benefit from unbuffered file system access. Many operating system support it (O_DIRECT open option in Unix systems,

FILE_FLAG_NO_BUFFERING|FILE_FLAG_WRITE_THROUGH option for "CreateFile"

...
Windows API call).

Could you point me to documentation for O_DIRECT? man 2 open http://www.die.net/doc/linux/man/man2/open.2.html http://www.mcsr.olemiss.edu/cgi-bin/man-cgi?open+2

...

It looks like FILE_FLAG_NO_BUFFERING requires some care to use properly, so I think it would have to go into a specialized component instead of being added to file_descriptor. Do you mean the offset requirements described in http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base... ?

In particular Stxxl file access patterns meet these requirements. Roman

David Abrahams

7 Feb 7 Feb

12:37 p.m.

Roman Dementiev <dementiev@ira.uka.de> writes:

...

Would such a library be interesting for Boost users? Does it fit here? Is it worth to boostify it?

I think the answer to all those questions should be yes, even though I don't have any need for it personally. Have you considered that algorithms tuned for segmented data structures (per the Austern paper) might be suitable for out-of-core data without modification? -- Dave Abrahams Boost Consulting www.boost-consulting.com

7422

Age (days ago)

7434

Last active (days ago)

List overview

Download

22 comments

9 participants

participants (9)

Beman Dawes
christopher diggins
David Abrahams
Jonathan Turkanis
Robert Ramey
Roman Dementiev
Scott Woods
Thorsten Ottosen
Victor A. Wagner Jr.