[Containers] Review

Dear All, Following are some comments about Ion's proposed Boost.Containers. I have not really reviewed the code, but frankly that doesn't seem necessary: Ion knows what he's doing, much of this code has already been in Boost for a while as part of Interprocess, and much of it is designed to implement standard interfaces. I'll confine my comments to a few words about the documentation, and some observations about the "flat" containers. Regarding the documentation, I believe that it would benefit from: - The "Acknowledgements" page is currently empty. There are various (c) notices in the source referring to sources from which parts of the code may have come. This page should clarify these sources, including any cases where the original code has now all been replaced, and assert that the licenses are all in order. It should also point out the history of the code in Boost, i.e. that parts were previously in Boost.Interprocess, so that anyone with previous exposure to the Interprocess containers is not confused. - A few lines of rationale for stateful allocators would be useful, i.e. pointing out that although the std::allocator has a typedef for pointer the containers can assume it is T*, and the implications of that. (I dealt with all this back in about 2004, and it took me a long time to get up-to-speed with the issues.) - Similarly, some rationale for string (and pointing out how it differs from std::string in the common implementations) would be useful. (Now, I can't even find where it says that it is non-COW. I know that I've seen that written somewhere. A page specifically about string shown on the top-level table of contents would be appropriate.) - The Doxygen reference docs seem to jump to the header docs from the table of contents; I think it would be more useful to jump primarily to the classes. Is this the same issue that I saw in the last review? Regarding the "flat" containers, I've previously posted about this in a couple of threads on this list. I have used this sort of structure in two applications recently: - In the first case, I store read-only (key,value) data in a binary file. I have a utility that prepares this file offline by inserting into a vector<pair<>>. I then sort it, and save the contents of the vector in the file. In the program that uses the data, I memory-map the file. I then have an adaptor class that takes a pair of pointers to the begin and end of the mapped data and permits read-only access like Ion's flat_multimap<>. - In the second case, I have an insert-only key-value data structure. New data is inserted in batches. I have a vector<pair<>> and insert new data at the end; at the end of each batch, I sort the new values and merge them into the body of the vector using std::inplace_merge. Neither of these cases can be implemented efficiently using the current containers. Doing so would need: - Access to the underlying container, at least for reading. - An "adaptor" version that could wrap a pair of pointers to raw memory, or more generally a pair of iterators, at least for reading. - Some sort of efficient batched insert mode. Regarding the batched insert, we have previously discussed the complexity of inserting a range. Fundamentally, however, I don't think it is acceptable to have efficient insertion ONLY in the case of the insert(begin,end) method: it is too common to want to insert several values in a loop without creating a temporary container. I'm not aware of any established (i.e. std:: or boost::) pattern for doing this: class flat_container { void insert(val); // appends at the end of the vector void commit(); // merges new values into the body of the vector }; - Between calling insert() and calling sort(), what do lookups do? Undefined? Only look at old values? Sort-on-demand? - Is there also an "insert one" method? How are the variants named? - What about erase()? - How about a "batch" or "transaction" object?: flat_container c; { flat_container::insert_batch b(c); b.insert(data1); b.insert(data2); } // b's dtor merges new values. A further interesting case is when the new values are already sorted, because insertion can be more efficient in this case: void insert_sorted(iter first, iter last) { resize vector to make space for new items for each new item in reverse order { find the position where it should be inserted using binary chop move old items after that point forward move the new item in behind them } } I could imagine a general insert_batch and a specialised insert_sorted_batch, for example. Anyway, this is all just food for thought really. It would be interesting to hear from other people who have used data structures like this. In my own code I have taken a rather ad-hoc approach, but that is the nature of custom vs. library development: to be useful, a library needs to try to second-guess the requirements of all potential users, yet do so without becoming too complex for anyone to learn. Hopefully there will be some more discussion about some of this, but as far as the review is concerned we should allow Ion to decide what he eventually commits. Regards, Phil.

Regarding the batched insert, we have previously discussed the complexity of inserting a range. Fundamentally, however, I don't think it is acceptable to have efficient insertion ONLY in the case of the insert(begin,end) method: it is too common to want to insert several values in a loop without creating a temporary container. I'm not aware of any established (i.e. std:: or boost::) pattern for doing this:
There is no need to create a temporary container to use a method that takes a pair of iterators. With the help of the Boost.Range library and some lambdas (Boost.Phoenix or C++0x, depending on your taste and constraints), you should be able to create a lazy range view of wherever you're getting your data from in your loop. I do this fairly regularly in the code I write, and it has worked well for me. On occasion I need to write a custom iterator, but that is by far the exception and not the rule. There is also a Boost.Range extension library called P-Stade Oven [1] which provides many additional range algorithms and range adaptors; I've found these very helpful too. Regards, Nate [1] http://p-stade.sourceforge.net/oven/doc/html/index.html

Nathan Ridge wrote:
Regarding the batched insert, we have previously discussed the complexity of inserting a range. Fundamentally, however, I don't think it is acceptable to have efficient insertion ONLY in the case of the insert(begin,end) method: it is too common to want to insert several values in a loop without creating a temporary container. I'm not aware of any established (i.e. std:: or boost::) pattern for doing this:
There is no need to create a temporary container to use a method that takes a pair of iterators. With the help of the Boost.Range library and some lambdas (Boost.Phoenix or C++0x, depending on your taste and constraints), you should be able to create a lazy range view of wherever you're getting your data from in your loop.
Well, yes and no. If I show this code to a Java or C programmer, they can probably work out what it does very easily: int i = 0; while (i != 10) { container.insert(i); i = f(i); } How easily would they understand your version? There are also cases where it is fundamentally hard to write a range generator without using e.g. co-routines: void towers_of_hanoi(container& c, string from, string to, string tmp, int n) { if (n==1) { c.append(make_pair(from,to)); // "Move a disk from " from " to " to." } else { towers_of_hanoi(c,from,tmp,to,n-1); towers_of_hanoi(c,from,to,tmp,1); towers_of_hanoi(c,tmp,to,from,n-1); } } I really don't think it's acceptable to require users to use tricky and limited features like lazy range generators just to get efficient insertion into a container. Regards, Phil.

El 10/08/2011 23:26, Phil Endecott escribió:
Dear All,
Following are some comments about Ion's proposed Boost.Containers. I have not really reviewed the code, but frankly that doesn't seem necessary: Ion knows what he's doing, much of this code has already been in Boost for a while as part of Interprocess, and much of it is designed to implement standard interfaces. I'll confine my comments to a few words about the documentation, and some observations about the "flat" containers.
Thanks for your time.
Regarding the documentation, I believe that it would benefit from:
- The "Acknowledgements" page is currently empty. There are various (c) notices in the source referring to sources from which parts of the code may have come. This page should clarify these sources, including any cases where the original code has now all been replaced, and assert that the licenses are all in order. It should also point out the history of the code in Boost, i.e. that parts were previously in Boost.Interprocess, so that anyone with previous exposure to the Interprocess containers is not confused.
Ok.
- A few lines of rationale for stateful allocators would be useful, i.e. pointing out that although the std::allocator has a typedef for pointer the containers can assume it is T*, and the implications of that. (I dealt with all this back in about 2004, and it took me a long time to get up-to-speed with the issues.)
Ok, this already has been requested.
- Similarly, some rationale for string (and pointing out how it differs from std::string in the common implementations) would be useful. (Now, I can't even find where it says that it is non-COW. I know that I've seen that written somewhere. A page specifically about string shown on the top-level table of contents would be appropriate.)
Ok.
- The Doxygen reference docs seem to jump to the header docs from the table of contents; I think it would be more useful to jump primarily to the classes. Is this the same issue that I saw in the last review?
Doxygen reference is auto-generated by Quickbook, but I'll check what we need to do to improve this.
Regarding the "flat" containers, I've previously posted about this in a couple of threads on this list.
I have used this sort of structure in two applications recently:
- In the first case, I store read-only (key,value) data in a binary file. I have a utility that prepares this file offline by inserting into a vector<pair<>>. I then sort it, and save the contents of the vector in the file. In the program that uses the data, I memory-map the file. I then have an adaptor class that takes a pair of pointers to the begin and end of the mapped data and permits read-only access like Ion's flat_multimap<>.
- In the second case, I have an insert-only key-value data structure. New data is inserted in batches. I have a vector<pair<>> and insert new data at the end; at the end of each batch, I sort the new values and merge them into the body of the vector using std::inplace_merge.
Neither of these cases can be implemented efficiently using the current containers. Doing so would need:
- Access to the underlying container, at least for reading.
I think we can get something useful now that underlying implementations are movable or swappable. Something to extract the implementation, manipulate it and reinsert it (with some checks in debug mode about invariants).
- An "adaptor" version that could wrap a pair of pointers to raw memory, or more generally a pair of iterators, at least for reading.
We could write an adaptor taking a read-only random-access iterator range (boost::range) that could be used as implementation, only implementing read-only support, so that flat_xxx fails (static_asserts with "no permission to modify container: read only implementation") at compile time if you call non-const member functions.
- Some sort of efficient batched insert mode.
It would be nice if you could measure your inplace_merge strategy with insert(iterator begin, iterator end) for different input ranges and already inserted values.
Regarding the batched insert, we have previously discussed the complexity of inserting a range. Fundamentally, however, I don't think it is acceptable to have efficient insertion ONLY in the case of the insert(begin,end) method: it is too common to want to insert several values in a loop without creating a temporary container. I'm not aware of any established (i.e. std:: or boost::) pattern for doing this:
I agree we should find a solution. However, I think this shouldn't stop flat_xxx adoption, since it's already quite useful in its current form as a drop-in replacement for set/map. We can work in the mailing list until we find a satisfactory solution.
class flat_container { void insert(val); // appends at the end of the vector void commit(); // merges new values into the body of the vector };
- Between calling insert() and calling sort(), what do lookups do? Undefined? Only look at old values? Sort-on-demand? - Is there also an "insert one" method? How are the variants named? - What about erase()? - How about a "batch" or "transaction" object?:
flat_container c; { flat_container::insert_batch b(c); b.insert(data1); b.insert(data2); } // b's dtor merges new values.
Sounds very interesting, insert_batch could reuse memory from the container, we could even "suggest" the insertion range in the constructor (so that we can "reserve" in the flat container). each insertion could write at the end of the container ("reserved" memory) destroying already constructed objects in case reserve fails and we have already inserted some values (we could also "move" them, right?). In the destructor we can just use your inplace_merge algorithm, if that's more efficient that current insert loop.
A further interesting case is when the new values are already sorted, because insertion can be more efficient in this case:
So similar to constructors taking already sorted ranges, right? We can also add these "extensions" to map/set, although I'm not an expert in trees so we'd need to rebalance the tree for each insertion. But I guess that with help from boosters we can find an efficient batch insertion for tree-like containers ;-) Thanks for your valuable input! Ion

2011/8/11 Ion Gaztañaga <igaztanaga@gmail.com>:
- The Doxygen reference docs seem to jump to the header docs from the table of contents; I think it would be more useful to jump primarily to the classes. Is this the same issue that I saw in the last review?
Doxygen reference is auto-generated by Quickbook, but I'll check what we need to do to improve this.
It's actually generated by Boostbook (the most important distinction being that it's implemented in XSL). One thing you could try is using autoindex to generate a class index. Another possibility is to use the doxygen to quickbook converter developed for the geometry documentation, which you can read about at: http://www.boost.org/doc/libs/1_47_0/libs/geometry/doc/html/geometry/aboutdo... I haven't tried it myself, but feedback seems good so far. It isn't integrated into the build system, so it might need a bit of work to use it.

El 11/08/2011 9:43, Daniel James escribió:
It's actually generated by Boostbook (the most important distinction being that it's implemented in XSL). One thing you could try is using autoindex to generate a class index. Another possibility is to use the doxygen to quickbook converter developed for the geometry documentation, which you can read about at:
http://www.boost.org/doc/libs/1_47_0/libs/geometry/doc/html/geometry/aboutdo...
I haven't tried it myself, but feedback seems good so far. It isn't integrated into the build system, so it might need a bit of work to use it.
Thanks Daniel, I will have a look to them. Ion

Phil Endecott wrote:
- How about a "batch" or "transaction" object?:
flat_container c; { flat_container::insert_batch b(c); b.insert(data1); b.insert(data2); } // b's dtor merges new values.
That's an interesting idea, but aren't exceptions possible in merging the new values? If so, you don't want the dtor doing the work since it can't emit an exception and may not be able to undo the work it began. Also, if flat_container::insert_batch::insert() or the constructor do anything to c, the dtor should be able to revert those changes unless the merge occurs. Here's another approach: flat_container c; { flat_container::insert_batch b(c); b.insert(data1); b.insert(data2); b.merge(); } That allows the dtor to clean up should an exception occur within the compound statement, including from merge(). Of course it requires explicitly asking that the data be merged rather than doing so automatically. _____ Rob Stewart robert.stewart@sig.com Software Engineer using std::disclaimer; Dev Tools & Components Susquehanna International Group, LLP http://www.sig.com ________________________________ IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

El 11/08/2011 13:16, Stewart, Robert escribió:
Here's another approach:
flat_container c; { flat_container::insert_batch b(c); b.insert(data1); b.insert(data2); b.merge(); }
That's the idea I had, some "commit", otherwise the destructor can't know if it's being called because an exception has occurred. The destructor can rollback actions if merge or commit has not been requested. Best, Ion

-----Mensaje original----- De: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] En nombre de Ion Gaztañaga Enviado el: viernes, 12 de agosto de 2011 11:58 Para: boost@lists.boost.org Asunto: Re: [boost] [Containers] Review
El 11/08/2011 13:16, Stewart, Robert escribió:
Here's another approach:
flat_container c; { flat_container::insert_batch b(c); b.insert(data1); b.insert(data2); b.merge(); }
It might be more convenient to use operator() rather than insert() for adding items to the batch, so as to allow for easy interoperability with for_each, for instance. Joaquín M López Muñoz Telefónica, Investigación y Desarrollo Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo. This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at. http://www.tid.es/ES/PAGINAS/disclaimer.aspx
participants (6)
-
Daniel James
-
Ion Gaztañaga
-
JOAQUIN M. LOPEZ MUÑOZ
-
Nathan Ridge
-
Phil Endecott
-
Stewart, Robert