[gsoc16] Re: Potential contribution scenarios?
It is not hard to find either myself or Antony. Both of us are here, and both of us are at the top of Google search results for our names.
Probably my bad, I only bumped into LinkedIn references. I'll wait for a reply from Antony as well, thanks. I will explore other ideas as well.
1. A thing Boost really could do with is a decent (optionally mmappable) dense hash map implementation, so the hash map could contain billions of items and mostly reside on disk yet lookups would still be very fast.
I have some questions regarding this first idea. I believe a concept and implementation already exist and are actually in use at Google: http://goog-sparsehash.sourceforge.net/doc/dense_hash_map.html Quoting from that page: ** dense_hash_map is distinguished from other hash-map implementations by its speed and by the ability to save and restore contents to disk. On the other hand, this hash-map implementation can use significantly more space than other hash-map implementations, and it also has requirements -- for instance, for a distinguished "empty key" -- that may not be easy for all applications to satisfy. This class is appropriate for applications that need speedy access to relatively small "dictionaries" stored in memory, and for applications that need these dictionaries to be persistent." ** So what you want is something like this, but able to "contain billions of items and mostly reside on disk yet lookups would still be very fast"? I assume the "billions of items" is what is lacking from that dense_hash_map already in use by Google? Is this the difference between that and your idea? Assuming this is indeed what you were proposing, would the work encompass defining an API and also exploring mmap for lazy iteration for performance gains in the context of handling billions of items? Kind regards, Miguel E. Coimbra
On 5 Jan 2016 at 17:52, Miguel Coimbra wrote:
1. A thing Boost really could do with is a decent (optionally mmappable) dense hash map implementation, so the hash map could contain billions of items and mostly reside on disk yet lookups would still be very fast.
I have some questions regarding this first idea. I believe a concept and implementation already exist and are actually in use at Google:
http://goog-sparsehash.sourceforge.net/doc/dense_hash_map.html
[snip]
So what you want is something like this, but able to "contain billions of items and mostly reside on disk yet lookups would still be very fast"?
I assume the "billions of items" is what is lacking from that dense_hash_map already in use by Google? Is this the difference between that and your idea?
Not exactly. Google's dense hash map just happens to be at the top of the search results which reveal many implementations of dense hash maps, but the algorithm itself is much older and could even be considered widely used. I've certainly written a few of them in my professional career, and I would assume so have most people here because they are the easiest quick and dirty map implementation. The reason a dense hash map isn't already in Boost I would suggest is probably due to lack of genericity, or rather, once you implement a dense hash map of the form: template<class Key, class T, class Hash, class Pred, class Allocator> class dense_hash_map; ... where Key and T can be any type, you end up with something with very similar characteristics to std::unordered_map<>, and therefore adding it to Boost is not compelling compared to say a bitwise or judy trie STL container which has at least a few compelling differences in characteristics to a hash table. Anyone can make a container which beats the STL containers for some limited use case. Few can make a totally generic general purpose container which clearly beats the STL containers in a majority of use cases unless you radically rethink semantics (like say Eric is doing in Ranges v3, or Boost.Intrusive).
Assuming this is indeed what you were proposing, would the work encompass defining an API and also exploring mmap for lazy iteration for performance gains in the context of handling billions of items?
Personally speaking - and please anyone with an opinion reading can chime in here - I don't find a straight swap of unordered_map for a dense_hash_map compelling. I *do* find the dense hash map algorithm applied to something it's uniquely good at much more compelling. One idea is really, really huge maps which greatly exceed physical RAM. Another idea is a constexpr compile time generated static map which is also a dense hash map generated via C++ 14 constexpr programming. There are other interesting applications of the dense hash map algorithm which maybe list members might suggest here. Niall --- Boost C++ Libraries Google Summer of Code 2016 admin https://svn.boost.org/trac/boost/wiki/SoC2016
participants (2)
-
Miguel Coimbra
-
Niall Douglas