[interprocess] Reading huge files
Dear all,
I am new to boost memory mapping, so this question might look simplistic.
I need to read huge amounts of data (for instance, a 20GB file), and
since memory mapping is quite fast, I was going to use it. However, I
don't know what it would be faster when, due to memory constraints, I
need to partition the file into regions. Moreover, I should treat the
file as a string (I need to perform string operations).
What I'm trying now is just to read the entire file:
boost::interprocess::file_mapping mmap(input_filename.c_str(),
boost::interprocess::read_only);
boost::interprocess::mapped_region map(mmap,
boost::interprocess::read_only);
std::size_t l = map.get_size(), tot_read = 0;
void *ptr = map.get_address();
while (tot_read < l)
{
register std::size_t x = std::min(l - tot_read,
static_caststd::size_t(prealloc));
std::copy_n(static_cast
On 10/11/2013 02:55 PM, Sensei wrote:
Another side-question, if you don't mind. I'm not sure that what I'm doing is efficient, especially the need to copy from the region to a string. If you have suggestions, I'm more than happy to hear these.
You may consider using
On 10/11/13 4:40pm, Bjorn Reese wrote:
On 10/11/2013 02:55 PM, Sensei wrote:
Another side-question, if you don't mind. I'm not sure that what I'm doing is efficient, especially the need to copy from the region to a string. If you have suggestions, I'm more than happy to hear these.
You may consider using
to avoid the copying.
Hi Bjorn, I've tried to understand how boost::interprocess::string may work in conjunction with a mmapped file, but I'm lost in the documentation. All I've found regards shared memory between processes, but as far as I understand, I don't really need shared memory, since all my processing will be (for now) on a single process; in the future, threads, so even then I won't need shmem. Is there a document where I can read how to construct a container (or better a boost::interprocess::string) without shared memory? I'm not hopeful that a doc with boost::interprocess::string and mapped_region exists :) Thanks!
On Fri, Oct 11, 2013 at 5:55 AM, Sensei
Dear all,
I am new to boost memory mapping, so this question might look simplistic.
I need to read huge amounts of data (for instance, a 20GB file), and since memory mapping is quite fast, I was going to use it. However, I don't know what it would be faster when, due to memory constraints, I need to partition the file into regions. Moreover, I should treat the file as a string (I need to perform string operations).
What I'm trying now is just to read the entire file:
boost::interprocess::file_mapping mmap(input_filename.c_str(), boost::interprocess::read_only); boost::interprocess::mapped_region map(mmap, boost::interprocess::read_only);
std::size_t l = map.get_size(), tot_read = 0;
void *ptr = map.get_address();
while (tot_read < l) { register std::size_t x = std::min(l - tot_read, static_caststd::size_t(prealloc));
std::copy_n(static_cast
(ptr) + tot_read, x, line.begin()); // Do something here...
tot_read += x; }
So, when the file is huge, do I need to create a mapped_region inside the loop? I didn't see anywhere in the documentation the possibility to move the mapped region.
If you're on a 64-bit system, you can simply mmap the entire file. There is no need to break the file into regions just because it's huge :) The OS will page the data in as required. On 32-bit, you do need to manage regions because you would otherwise exceed your address space. This might be kinda crappy as you'd ideally want to split your regions at EOL boundaries, and you need to parse your file before you know where these are. In practice, you'd be stuck worrying about straddling EOL, but hey, that's the price you pay if you want to run 32-bit code.
Another side-question, if you don't mind. I'm not sure that what I'm doing is efficient, especially the need to copy from the region to a string. If you have suggestions, I'm more than happy to hear these.
I would use boost's new string_ref instead of string. The obvious solution would be to use boost.tokenizer to break up the giant string into string_ref lines; however, I'm unsure that this is supported yet. An EOL tokenizer should be only a few lines of code though, and you could fairly trivially tokenize your string into string_refs. Brian
On 10/11/13 7:14 PM, Brian Budge wrote:
If you're on a 64-bit system, you can simply mmap the entire file. There is no need to break the file into regions just because it's huge :) The OS will page the data in as required. On 32-bit, you do need to manage regions because you would otherwise exceed your address space. This might be kinda crappy as you'd ideally want to split your regions at EOL boundaries, and you need to parse your file before you know where these are. In practice, you'd be stuck worrying about straddling EOL, but hey, that's the price you pay if you want to run 32-bit code.
But on 32-bit systems I need to say "hey this program will go fubar as you load big files, use it at your own peril!" :)
Another side-question, if you don't mind. I'm not sure that what I'm doing is efficient, especially the need to copy from the region to a string. If you have suggestions, I'm more than happy to hear these.
I would use boost's new string_ref instead of string. The obvious solution would be to use boost.tokenizer to break up the giant string into string_ref lines; however, I'm unsure that this is supported yet. An EOL tokenizer should be only a few lines of code though, and you could fairly trivially tokenize your string into string_refs.
Awesome classes, I will try them! Thanks!
I need to read huge amounts of data (for instance, a 20GB file), and since memory mapping is quite fast, I was going to use it.
Illusion. The first time you'll use your data elements, you'll face a "soft" page defaults. If your code has an initialization process to map the shared memory, I advise you to read some elements of the data (use a lopp to read one element on ? elements). So when you'll use your data when you'll really need it, it will be faster.
participants (4)
-
Bjorn Reese
-
Brian Budge
-
Oodini
-
Sensei