[allocators] Article and library about allocator improvements

Hi, While developing of Interprocess, I started to think about optimizations for STL-like allocators. I even wrote a proposal about some new features for standard containers (N2045, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2045.html) that was not received with much interest by the committee. I refined those improvements for shared memory containers and I measured performance benefits when containers take advantage of more advanced allocators. The last few months, I've taken another step forward and I've developed a small Boost library based on a modified DLmalloc code. I've developed some allocators above the modified DLmalloc and I've measured the performance benefits that can be obtained with optimized allocators that offer some new features like buffer expansion, burst allocation, etc... Test were performed using Interprocess containers, but instead of shared memory allocators, I've compared DLmalloc based heap allocators against the standard std::allocator<>. I've written a long article with my explanations and conclusions: http://www.drivehq.com/web/igaztanaga/allocplus/ The library (includes the article) can be downloaded here: http://www.drivehq.com/web/igaztanaga/allocplus.zip The library has no docs (the article explains all the basics) and does not surely meet Boost standards. The only goal of this article, the library and this post is to share information with other Boost container authors and discuss possible improvements that maybe could be applied to Boost containers. If those improvements are considered interesting, it would be great if we could write a library with allocators and develop some "existing practice" with boost containers before writing a future proposal to improve allocators. I plan to publish the article somewhere (supposing someone is crazy enough to publish it ;-) ) but first I wanted to share this with boosters to try to improve the article/library and know if developers are interested in allocator issues. Comments are welcome! Regards, Ion

Ion Gaztañaga wrote:
I plan to publish the article somewhere (supposing someone is crazy enough to publish it ;-) ) but first I wanted to share this with boosters to try to improve the article/library and know if developers are interested in allocator issues. Comments are welcome!
Hi Ion, Are you aware of the "scoped allocator" support that has recently been accepted into the WP? -- Dave Abrahams BoostPro Computing http://www.boostpro.com

David Abrahams wrote:
Ion Gaztañaga wrote:
I plan to publish the article somewhere (supposing someone is crazy enough to publish it ;-) ) but first I wanted to share this with boosters to try to improve the article/library and know if developers are interested in allocator issues. Comments are welcome!
Hi Ion,
Are you aware of the "scoped allocator" support that has recently been accepted into the WP?
Yes, thanks. I participated in some discussions with Pablo Halpern and John Lakos to make Scoped allocators with non-scoped ones (basically shared memory). Scoped allocators are basically mainly related to allocator propagation and polymorphic allocators (plus the machinery to implement this). My proposal has 3 different parts: 1) Realloc (expand in-place) and shrink to fit for std::vector. This means reusing the old buffer (and avoiding moving or copying data) if the allocator supports this. 2) Burst allocation: Some node containers (list, map...) just implement range insertion with a loop: for(/*every input element*/){ this->insert(pos, *inpit); } the idea here is to make a single to the allocator to allocate N nodes with a single call. The allocator can just allocate a single big chunk (thus, avoiding N lookups) and divide the big chunk in individually deallocatable nodes. 3) Adaptive pools: Some allocators, (like Boost.Pool) use simple segregated storage to minimize node overhead. Sadly, they don't implicitly return memory to the general-purpose allocator because the operation is costly (Boost.Pool offers a purge() or release_memory() function to explicitly trim allocated nodes). Adaptive pools are similar to simple segregated storage allocators but they use alignment to automatically deallocate memory while maintaining zero per-node payload. These allocators don't need any allocator extension but they work extremely well with burst-allocation because we can simplify bookkeeping operations that have some impact when using single node allocation scheme. In other words, Scoped Allocators and my proposal are completely complementary. Regards, Ion

Ion Gaztañaga wrote> >In other words, Scoped Allocators and my proposal are completely complementary To clarify "Scoped Allocators" is somewhat of a misnomer. It does not really address allocator implementation. To presume to speak for the authors (and I have had years collaboration with them when we worked together)the scoped allocator proposal intended two things: 1. The type of something does not change according to what "policy" is used to implement it - esp allocators. In particular, one should not be able to programatically determine differences between two types outside of public accessors, such that the operator== is equivalent to the logical conjunction of all public members (and accessors) equality compared. Furthermore, the equality test is the sole determiner of substitutability. (this was later "relaxed" to refer to only "salient" properties of a type). Also, any identical sequence of modifications applied to two "equal" instances should end up as "equal" (in other words, the serializable property) 2. The allocator in a container can be thought of modeling a storage class. So if I have a container that is a member of a struct, and that struct is instanced in storage class A (heap, stack, etc) then each member should also gets it memory from that same region. This is what the "scoped" part means. To obstensibly enhance performance (by increasing locality of reference perhaps) each object should be able to take it own allocator (arena, region, whatever), and pass that along to its members. So instead of "class specific" allocators we should have "object specific" allocators (hence the runtime polymorphism). (see Large Scale C++ chap 10 for a sketch of this) In summary, the approach is a generalization of how C types interact with the UNIX process memory model in a sequential program. For instance, in C operator== for a struct is implemented as bitwise compare, and this has no dependence on which storage class is used to hold the "value." So one could argue (as the authors do) that policies that change the type (as in policies that are template parameters) are counterintuitive, and even bad design. Why? Because for otherwise "equivalent" things, operator== is inaccessible because the types do not allow it -- and policies do not change the "value" of a type, vis a vis what this type represent in the domain model (think OO programming). Indeed, for an allocator to be "scopeable" it cannot know anything about the type is allocates. Therefore, it has a particualry hard time taking advantage of any optimizations that are possible when you do know the type. For example, it would be nearly impossible to make a Burst Allocator scopable. And pretty hard to make a shared memory allocator scopeable. What did eventally make it to the WP was a design that allowed all kinds of options for how allocators were propogated via copy ctorsm, move ctors, and assignments. Lance -----Original Message----- From: Ion Gaztañaga [mailto:igaztanaga@gmail.com] Sent: Monday, June 23, 2008 12:15 PM To: boost@lists.boost.org Subject: Re: [boost] [allocators] Article and library about allocatorimprovements David Abrahams wrote: > Ion Gaztañaga wrote: > >> I plan to publish the article somewhere (supposing someone is crazy >> enough to publish it ;-) ) but first I wanted to share this with >> boosters to try to improve the article/library and know if developers >> are interested in allocator issues. Comments are welcome! > > Hi Ion, > > Are you aware of the "scoped allocator" support that has recently been > accepted into the WP? Yes, thanks. I participated in some discussions with Pablo Halpern and John Lakos to make Scoped allocators with non-scoped ones (basically shared memory). Scoped allocators are basically mainly related to allocator propagation and polymorphic allocators (plus the machinery to implement this). My proposal has 3 different parts: 1) Realloc (expand in-place) and shrink to fit for std::vector. This means reusing the old buffer (and avoiding moving or copying data) if the allocator supports this. 2) Burst allocation: Some node containers (list, map...) just implement range insertion with a loop: for(/*every input element*/){ this->insert(pos, *inpit); } the idea here is to make a single to the allocator to allocate N nodes with a single call. The allocator can just allocate a single big chunk (thus, avoiding N lookups) and divide the big chunk in individually deallocatable nodes. 3) Adaptive pools: Some allocators, (like Boost.Pool) use simple segregated storage to minimize node overhead. Sadly, they don't implicitly return memory to the general-purpose allocator because the operation is costly (Boost.Pool offers a purge() or release_memory() function to explicitly trim allocated nodes). Adaptive pools are similar to simple segregated storage allocators but they use alignment to automatically deallocate memory while maintaining zero per-node payload. These allocators don't need any allocator extension but they work extremely well with burst-allocation because we can simplify bookkeeping operations that have some impact when using single node allocation scheme. In other words, Scoped Allocators and my proposal are completely complementary. Regards, Ion Visit our website at http://www.ubs.com This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mails are not encrypted and cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission. If verification is required please request a hard-copy version. This message is provided for informational purposes and should not be construed as a solicitation or offer to buy or sell any securities or related financial instruments.

Hi Ion! On Mon, Jun 23, 2008 at 2:13 AM, Ion Gaztañaga <igaztanaga@gmail.com> wrote:
The last few months, I've taken another step forward and I've developed a small Boost library based on a modified DLmalloc code. I've developed some allocators above the modified DLmalloc and I've measured the performance benefits that can be obtained with optimized allocators that offer some new features like buffer expansion, burst allocation, etc... Test were performed using Interprocess containers, but instead of shared memory allocators, I've compared DLmalloc based heap allocators against the standard std::allocator<>.
The DLmalloc implementation you compared against was written in C which you adapted to C++, correct? I for one have been looking for a better alternative to both the standard allocator and the Boost.Pool allocator. A DLmalloc based implementation would be interesting to see if not as a Boost-provided/included implementation or one day a standard allocator alternative implementation.
I've written a long article with my explanations and conclusions:
Wow, it is long and very well written. I do have some questions off the top of my head: - have you tested how your implementation performs on multiple threads? - have you tried measuring the direct effect of random-sized, random-timed, random-ordered allocations/deallocations? - i notice that you were using vmware; admittedly the effect of running Linux in a VMWare instance as a guest already causes performance degradation, have you considered using something else that takes advantage of processor virtualization features better (like Virtualbox)? or better yet, have you tried running it on a native Linux implementation?
The library (includes the article) can be downloaded here:
http://www.drivehq.com/web/igaztanaga/allocplus.zip
The library has no docs (the article explains all the basics) and does not surely meet Boost standards. The only goal of this article, the library and this post is to share information with other Boost container authors and discuss possible improvements that maybe could be applied to Boost containers. If those improvements are considered interesting, it would be great if we could write a library with allocators and develop some "existing practice" with boost containers before writing a future proposal to improve allocators.
I plan to publish the article somewhere (supposing someone is crazy enough to publish it ;-) ) but first I wanted to share this with boosters to try to improve the article/library and know if developers are interested in allocator issues. Comments are welcome!
I'm looking forward to the answers to my questions and the further improvement of this article. Thanks very much for sharing this! -- Dean Michael C. Berris Software Engineer, Friendster, Inc.

Dean Michael Berris wrote:
The DLmalloc implementation you compared against was written in C which you adapted to C++, correct?
I really haven't touched DLmalloc source. I've taken the 2.8.4 version of DLmalloc (dlmalloc_pre_2_8_4.c in the archive) I've included that c file in another dlmalloc_ext.c file that implements some new functions (just the ones going to the global heap, I haven't implemented mspaces) that implement expand in place, burst allocation, burst deallocation, that use some internal structures and function of dlmalloc. The "modified" DLMalloc (dlmalloc_ext.c) is (should be) compilable with a C compiler and has C interface through functions and macros. A C programmer could take advantage of burst allocation. The extension is not clean code but dig into DLmalloc internals is not easy. Then, I just built STL-like allocators wrapping those functions. See http://www.drivehq.com/web/igaztanaga/allocplus.zip source for more details.
I for one have been looking for a better alternative to both the standard allocator and the Boost.Pool allocator. A DLmalloc based implementation would be interesting to see if not as a Boost-provided/included implementation or one day a standard allocator alternative implementation.
I think you will find adaptive allocators very interesting. You also have a Boost.Pool-like allocator (simple segregated storage,supporting burst allocation) in the library. It was used to test adaptive pools against classic node allocators.
- have you tested how your implementation performs on multiple threads?
No. I've tested the library *without* locks and burst allocation is still faster (this means that the speedup does not come from locking minimization only). I think means that multithreading allocators like ptmalloc or nedmalloc based on DLmalloc could also benefit from these features. The ideas comes from Interprocess, where one just can't have per-thread pools. That's why I just implemented some classic strategies: buffer reuse (realloc) or pack allocations (burst). These mechanisms are easy to implement, to understand and offer real speed benefits for many applications. Not every application is heavily multi-threaded.
- have you tried measuring the direct effect of random-sized, random-timed, random-ordered allocations/deallocations?
No, I just run out of time ;-). If anyone tests this, I would be glad.
- i notice that you were using vmware; admittedly the effect of running Linux in a VMWare instance as a guest already causes performance degradation, have you considered using something else that takes advantage of processor virtualization features better (like Virtualbox)? or better yet, have you tried running it on a native Linux implementation?
Last time grub broke my startup configuration I decided to erase my linux partition ;-). I know there is some performance impact but that should affect both the standard allocator and DLmalloc.
I'm looking forward to the answers to my questions and the further improvement of this article. Thanks very much for sharing this!
After so much work implementing and testing I needed to share this, at least to think it was worth the effort. It has been already useful to improve Interprocess containers but if this is found useful also with heap allocators, that would be nice. Regards, Ion

On Sun, Jun 22, 2008 at 11:13 AM, Ion Gaztañaga <igaztanaga@gmail.com> wrote:
Hi,
While developing of Interprocess, I started to think about optimizations for STL-like allocators. I even wrote a proposal about some new features for standard containers (N2045, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2045.html) that was not received with much interest by the committee.
I refined those improvements for shared memory containers and I measured performance benefits when containers take advantage of more advanced allocators.
The last few months, I've taken another step forward and I've developed a small Boost library based on a modified DLmalloc code. I've developed some allocators above the modified DLmalloc and I've measured the performance benefits that can be obtained with optimized allocators that offer some new features like buffer expansion, burst allocation, etc... Test were performed using Interprocess containers, but instead of shared memory allocators, I've compared DLmalloc based heap allocators against the standard std::allocator<>.
I've written a long article with my explanations and conclusions:
http://www.drivehq.com/web/igaztanaga/allocplus/
The library (includes the article) can be downloaded here:
http://www.drivehq.com/web/igaztanaga/allocplus.zip
The library has no docs (the article explains all the basics) and does not surely meet Boost standards. The only goal of this article, the library and this post is to share information with other Boost container authors and discuss possible improvements that maybe could be applied to Boost containers. If those improvements are considered interesting, it would be great if we could write a library with allocators and develop some "existing practice" with boost containers before writing a future proposal to improve allocators.
I plan to publish the article somewhere (supposing someone is crazy enough to publish it ;-) ) but first I wanted to share this with boosters to try to improve the article/library and know if developers are interested in allocator issues. Comments are welcome!
Looks very nice! I have some code that could make good use of this - I'll be playing around with it for the next few days to see how it goes. One thing I'd be interested in is a zero-overhead scoped arena. Sometimes you just need a read-only collection, and don't want a pointer to a common arena (the overhead) being stored in all the dynamically sized objects. I've been using Intrusive for this purpose, but it doesn't call destructors on its own and there's no vector that lets you one-time sizing. -- Cory Nelson

Cory Nelson wrote:
Looks very nice! I have some code that could make good use of this - I'll be playing around with it for the next few days to see how it goes.
Thanks!
One thing I'd be interested in is a zero-overhead scoped arena. Sometimes you just need a read-only collection, and don't want a pointer to a common arena (the overhead) being stored in all the dynamically sized objects. I've been using Intrusive for this purpose, but it doesn't call destructors on its own and there's no vector that lets you one-time sizing.
I don't fully understand what you mean. Are you talking about an allocator that just allocates some big chunks of memory and increases a pointer to allocate memory (deallocation is no-op)? Could you elaborate a bit more? Regards, Ion

On Mon, Jun 23, 2008 at 9:35 AM, Ion Gaztañaga <igaztanaga@gmail.com> wrote:
Cory Nelson wrote:
Looks very nice! I have some code that could make good use of this - I'll be playing around with it for the next few days to see how it goes.
Thanks!
One thing I'd be interested in is a zero-overhead scoped arena. Sometimes you just need a read-only collection, and don't want a pointer to a common arena (the overhead) being stored in all the dynamically sized objects. I've been using Intrusive for this purpose, but it doesn't call destructors on its own and there's no vector that lets you one-time sizing.
I don't fully understand what you mean. Are you talking about an allocator that just allocates some big chunks of memory and increases a pointer to allocate memory (deallocation is no-op)? Could you elaborate a bit more?
Say you have a large structure that you only need to build once, and won't modify after that except to delete it. Ideally: - One arena (what you described) could service all the objects in the entire structure. This gives you high locality and no bookkeeping overhead. - The objects wouldn't hold an allocator (reference to the arena). They only need to be built once, so keeping a reference around for future allocations is needless overhead. - The arena would not be in global/static storage. - The objects would all still have their destructors called. This is impossible with std:: containers. I've been using C strings, wrappers around intrusive (to call destructors), and a tr1::array-ish class that lets me set pointer & size. A lot of the time the overhead of std:: containers is acceptable for the convenience they give, but in my situation (Windows Mobile, need to be very tight on usage) it isn't. I think the problem is fairly common and even desktop apps would benefit from such a design IF it where made convenient. I admit, I can't think of a convenient way to do this (though intrusive comes quite close). -- Cory Nelson

Cory Nelson wrote:
Say you have a large structure that you only need to build once, and won't modify after that except to delete it. Ideally:
- One arena (what you described) could service all the objects in the entire structure. This gives you high locality and no bookkeeping overhead.
- The objects wouldn't hold an allocator (reference to the arena). They only need to be built once, so keeping a reference around for future allocations is needless overhead.
- The arena would not be in global/static storage.
- The objects would all still have their destructors called.
This is impossible with std:: containers. I've been using C strings, wrappers around intrusive (to call destructors), and a tr1::array-ish class that lets me set pointer & size.
A lot of the time the overhead of std:: containers is acceptable for the convenience they give, but in my situation (Windows Mobile, need to be very tight on usage) it isn't.
I think the problem is fairly common and even desktop apps would benefit from such a design IF it where made convenient. I admit, I can't think of a convenient way to do this (though intrusive comes quite close).
It does not sound easy to solve, specially if you don't want the arena to be global/static. Interesting, though. Regards, Ion

On Sun, Jun 22, 2008 at 11:13 AM, Ion Gaztañaga <igaztanaga@gmail.com> wrote:
Hi,
While developing of Interprocess, I started to think about optimizations for STL-like allocators. I even wrote a proposal about some new features for standard containers (N2045, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2045.html) that was not received with much interest by the committee.
.. snip...
The library (includes the article) can be downloaded here:
I was just about to try it out, but the zip doesn't seem to include the headers - the boost/ folder appears the same as the libs/ folder? -- Cory Nelson
participants (5)
-
Cory Nelson
-
David Abrahams
-
Dean Michael Berris
-
Ion Gaztañaga
-
Lance.Diduck@ubs.com