On Fri, Oct 11, 2013 at 5:55 AM, Sensei
Dear all,
I am new to boost memory mapping, so this question might look simplistic.
I need to read huge amounts of data (for instance, a 20GB file), and since memory mapping is quite fast, I was going to use it. However, I don't know what it would be faster when, due to memory constraints, I need to partition the file into regions. Moreover, I should treat the file as a string (I need to perform string operations).
What I'm trying now is just to read the entire file:
boost::interprocess::file_mapping mmap(input_filename.c_str(), boost::interprocess::read_only); boost::interprocess::mapped_region map(mmap, boost::interprocess::read_only);
std::size_t l = map.get_size(), tot_read = 0;
void *ptr = map.get_address();
while (tot_read < l) { register std::size_t x = std::min(l - tot_read, static_caststd::size_t(prealloc));
std::copy_n(static_cast
(ptr) + tot_read, x, line.begin()); // Do something here...
tot_read += x; }
So, when the file is huge, do I need to create a mapped_region inside the loop? I didn't see anywhere in the documentation the possibility to move the mapped region.
If you're on a 64-bit system, you can simply mmap the entire file. There is no need to break the file into regions just because it's huge :) The OS will page the data in as required. On 32-bit, you do need to manage regions because you would otherwise exceed your address space. This might be kinda crappy as you'd ideally want to split your regions at EOL boundaries, and you need to parse your file before you know where these are. In practice, you'd be stuck worrying about straddling EOL, but hey, that's the price you pay if you want to run 32-bit code.
Another side-question, if you don't mind. I'm not sure that what I'm doing is efficient, especially the need to copy from the region to a string. If you have suggestions, I'm more than happy to hear these.
I would use boost's new string_ref instead of string. The obvious solution would be to use boost.tokenizer to break up the giant string into string_ref lines; however, I'm unsure that this is supported yet. An EOL tokenizer should be only a few lines of code though, and you could fairly trivially tokenize your string into string_refs. Brian