[boost] [Containers] Review

10 Aug 2011

      Dear All,

Following are some comments about Ion's proposed Boost.Containers.  I 
have not really reviewed the code, but frankly that doesn't seem 
necessary: Ion knows what he's doing, much of this code has already 
been in Boost for a while as part of Interprocess, and much of it is 
designed to implement standard interfaces.  I'll confine my comments to 
a few words about the documentation, and some observations about the 
"flat" containers.

Regarding the documentation, I believe that it would benefit from:

- The "Acknowledgements" page is currently empty.  There are various 
(c) notices in the source referring to sources from which parts of the 
code may have come.  This page should clarify these sources, including 
any cases where the original code has now all been replaced, and assert 
that the licenses are all in order.  It should also point out the 
history of the code in Boost, i.e. that parts were previously in 
Boost.Interprocess, so that anyone with previous exposure to the 
Interprocess containers is not confused.

- A few lines of rationale for stateful allocators would be useful, 
i.e. pointing out that although the std::allocator has a typedef for 
pointer the containers can assume it is T*, and the implications of 
that.  (I dealt with all this back in about 2004, and it took me a long 
time to get up-to-speed with the issues.)

- Similarly, some rationale for string (and pointing out how it differs 
from std::string in the common implementations) would be useful.  (Now, 
I can't even find where it says that it is non-COW.  I know that I've 
seen that written somewhere.  A page specifically about string shown on 
the top-level table of contents would be appropriate.)

- The Doxygen reference docs seem to jump to the header docs from the 
table of contents; I think it would be more useful to jump primarily to 
the classes.  Is this the same issue that I saw in the last review?

Regarding the "flat" containers, I've previously posted about this in a 
couple of threads on this list.

I have used this sort of structure in two applications recently:

- In the first case, I store read-only (key,value) data in a binary 
file.  I have a utility that prepares this file offline by inserting 
into a vector<pair<>>.  I then sort it, and save the contents of the 
vector in the file.  In the program that uses the data, I memory-map 
the file.  I then have an adaptor class that takes a pair of pointers 
to the begin and end of the mapped data and permits read-only access 
like Ion's flat_multimap<>.

- In the second case, I have an insert-only key-value data structure.  
New data is inserted in batches.  I have a vector<pair<>> and insert 
new data at the end; at the end of each batch, I sort the new values 
and merge them into the body of the vector using std::inplace_merge.

Neither of these cases can be implemented efficiently using the current 
containers.  Doing so would need:

- Access to the underlying container, at least for reading.
- An "adaptor" version that could wrap a pair of pointers to raw 
memory, or more generally a pair of iterators, at least for reading.
- Some sort of efficient batched insert mode.

Regarding the batched insert, we have previously discussed the 
complexity of inserting a range.  Fundamentally, however, I don't think 
it is acceptable to have efficient insertion ONLY in the case of the 
insert(begin,end) method: it is too common to want to insert several 
values in a loop without creating a temporary container.  I'm not aware 
of any established (i.e. std:: or boost::) pattern for doing this:

class flat_container {
   void insert(val);  // appends at the end of the vector
   void commit();     // merges new values into the body of the vector
};

- Between calling insert() and calling sort(), what do lookups do?  
Undefined?  Only look at old values?  Sort-on-demand?
- Is there also an "insert one" method?  How are the variants named?
- What about erase()?
- How about a "batch" or "transaction" object?:

flat_container c;
{
   flat_container::insert_batch b(c);
   b.insert(data1);
   b.insert(data2);
} // b's dtor merges new values.

A further interesting case is when the new values are already sorted, 
because insertion can be more efficient in this case:

void insert_sorted(iter first, iter last)
{
   resize vector to make space for new items
   for each new item in reverse order {
     find the position where it should be inserted using binary chop
     move old items after that point forward
     move the new item in behind them
   }
}

I could imagine a general insert_batch and a specialised 
insert_sorted_batch, for example.

Anyway, this is all just food for thought really.  It would be 
interesting to hear from other people who have used data structures 
like this.  In my own code I have taken a rather ad-hoc approach, but 
that is the nature of custom vs. library development: to be useful, a 
library needs to try to second-guess the requirements of all potential 
users, yet do so without becoming too complex for anyone to learn.

Hopefully there will be some more discussion about some of this, but as 
far as the review is concerned we should allow Ion to decide what he 
eventually commits.

Regards,  Phil.

[boost] [Containers] Review

Phil Endecott