library for managing a large number of FILE objects

I know that C++ IOStreams are supposed to take over the world, and cstdio with FILE is considered taboo by some C++ efficionados. However, due to many reasons, lots of code still uses cstdio FILE I/O. One issue that comes up from time to time is the need to manage more files than the OS will allow a single process to have open simultaneously. I process tons of data, and this is an issue for me. I have developed a small library that provides on-demand caching of FILE pointers, so that an application can "open" as many FILEs as necessary, and use them as normal. A simple LRU eviction algorithm is used to reclaim FILEs when "all" have been used. I was discussing this library with another developer, and he said he has seen several questions recently about a similar issue, and he advised me to ask if there was interest here. This library, however, does not seem "on the edge" enough though... A very simple example of how you can use the library (of course there are better ways to do the following, but it is meant to be a small easy to use example). // Open 10,000 FILEs wjh::file_cache file_cache; std::vector<wjh::cached_fptr> fp; for (int i = 0; i < 10000; ++i) { std::stringstream strm; strm << "FILE_" << i; wjh::cached_fptr fptr = file_cache.open(strm.str().c_str(), "w"); if (!fptr) { std::cerr << strm.str() << ": " << strerror(errno) << std::endl; break; } fp.push_back(fptr); } // Randomly write to a particular file. for (int i = 0; i < 200000; ++i) { int x = rand() % fp.size(); fprintf(fp[x], "file %d, iteration %d\n", x, i); } Is it something useful for more than just me (and the pitiful souls who work for me and must use it) on this list? Is it something worth posting here?

Jody Hagins wrote:
I know that C++ IOStreams are supposed to take over the world, and cstdio with FILE is considered taboo by some C++ efficionados. However, due to many reasons, lots of code still uses cstdio FILE I/O. One issue that comes up from time to time is the need to manage more files than the OS will allow a single process to have open simultaneously. I process tons of data, and this is an issue for me. I have developed a
But don't use close a file after you've used it? You mean to tell me you want to actually keep all 10000 files open at all times? That seems a little extreme, or crazy or something ;) If you need something like this (which I think would happen very rarely), you can simply have a cache which opens FILEs on demand. You just request a read or write with a file name. Internally, if that file name is open, do read or write. Otherwise, open it and do the same. This should be a very simple class - no more than 100 lines of code. Best, John

On Fri, 16 Apr 2004 08:16:44 +0200 John Torjo <john.lists@torjo.com> wrote:
But don't use close a file after you've used it? You mean to tell me you want to actually keep all 10000 files open at all times? That seems a little extreme, or crazy or something ;)
I've been called worse ;-> I really appreciate your input, and have a little more rationale is at the end, specific to how I use this library.
If you need something like this (which I think would happen very rarely), you can simply have a cache which opens FILEs on demand. You just request a read or write with a file name. Internally, if that file name is open, do read or write. Otherwise, open it and do the same. This should be a very simple class - no more than 100 lines of code.
I must not have explained well earlier, because that's what I was trying to describe. Actually, that is what this library does (and once I wrote it, I use it quite a bit). However, it is considerably more than 100 lines. Granted, a large part of the code wraps the handles and normal stdio functions so applications can safely use almost all the normal stdio functions exactly the same with these cached file pointers. In addition, I suppose my coding, like my writing, is a bit on the verbose side. So, what do I use it for? I use this library for many things now, but the original need was post-processing a large amount of US stock market information. I have a stream of data that gets processed at the end of each day. The data contains lots of information about each specific stock (e.g. quote and trade information). However, there are more than 10,000 different symbols in this file. The post processing splits information for each symbol up into a separate file, one for each symbol. The vast majority of the information pertains to a smallish subset of the symbols (a few hundred). I have found it much easier to handle this information like so: fwrite(buffer, recsz, nrecs, symbol_info[sym]<file_ptr>.get()); SIDE NOTE: symbol_info is a std::map<symbol, wjh::dynamic_tuple>. A dynamic_tuple is kinda like a boost::tuple, except you can add members by type at run time (you still get compile time type checking though), and you can access them through named type tags instead of just integrals. The call to symbol_info[sym]<file_ptr>.get() returns a reference to an object of type wjh::stdio::cached_file, and the proper overload of frwite() is called. So I keep the "cached" file handle as an attribute of the symbol. The first time that symbol is seen, the file is opened, and the handle is put into the dynamic_tuple, associated with the type tag file_ptr. Thus, anytime I want to write to the file associated with that symbol, I simply do so. The file will be automatically reopened (with proper mode, and file pointer repositioned) if it has been swapped out to accomodate other accesses. I find it nice to use virtual FILE pointers, so I do not have to worry about running out. In practice, for my apps, I do not experience a terrible amount of swapping (relative to the number of positive cache hits). However, your point is well taken, and while I have many uses for this library, others may not (unless you need to use lots of files, or use an OS with very limiting restrictions on the number of open files). Thanks!!! -- Jody Hagins
participants (2)
-
Jody Hagins
-
John Torjo