[review] hash functions

Thorsten Ottosen

8 Mar 2005 8 Mar '05

6:34 p.m.

Dear All, It's a pleasure to announce that the review of Daniel James' hash functions library starts today (8th of March) and ends the 12th of March. The library is fairly small and can be seen as an extension to boost.functional. Therefore the review is a fast-track review.

...

From the introduction of the library:

"boost::hash is an implementation of the hash function object specified by the C++ Standard Library Technical Report. It is being proposed as an addition to Boost.Functional" "The hash functions have several possible uses, but the main intention of boost::hash is for indexing a hash table such as the unordered associate containers in the Technical Report, or the hash index for the Boost Multi-Index Containers Library. A hash table stores its values in buckets which are numbered by their keys' hash values. Given a key, the bucket the object lies in can be quickly found be calculating its hash value and looking in the corresponding bucket. This requires the hash function to be fast - since the point is to find the object quickly. Collisions need to be minimised, to reduce the number o objects in each bucket, and thus the number of comparisons required to find objects in a bucket." . If you decide to do a review, please follow the guidelines described in http://www.boost.org/more/formal_review_process.htm#Comments The library can be downloaded from http://boost-sandbox.sourceforge.net/vault/index.php?directory=Hash best regards Thorsten, review manager -- Thorsten Ottosen ---------------------------- www.dezide.com www.cs.aau.dk/index2.php?content=Research/bss www.boost.org www.open-std.org/JTC1/SC22/WG21/

Show replies by date

Bronek Kozicki

9 Mar 9 Mar

10:28 a.m.

...

Documentation mentions two overloaded functions: std::size_t hash_value(std::string); std::size_t hash_value(std::wstring); while hash.hpp actually delivers one template function: template <class E, class T, class A> std::size_t hash_value(std::basic_string<E, T, A> const& v); I'd prefer documentation updated. Overall I consider this functionality desired in Boost; I also like design (single template class hash calling overloaded hash_function). I implemented similar functionality in the past (as many of us did, I'm pretty sure), and this design is something I would not be ashamed of, if I wrote it myself. I would however consider implementing most overloads of hash_value inline; I know that from technical perspective compiler is free to inline function call anyway, but putting implementation in header will make optimizer job easier and more obvious to user. B. -- Bronek Kozicki brok@rubikon.pl http://b.kozicki.pl/

Bronek Kozicki

10:31 a.m.

...

[blush] it is inlined ... now I see. My vote : yes B. -- Bronek Kozicki brok@rubikon.pl http://b.kozicki.pl/

Daniel James

10:42 a.m.

Bronek Kozicki wrote:

...

Oh dear, it's actually the header file that's out of date. I changed it because it won't work for custom character traits. But it seems the change didn't make it into the header in the tarball. Thanks for your review. Daniel

Peter Dimov

12:46 p.m.

Daniel James wrote:

...

Why do you think so?

Daniel James

1:02 p.m.

Peter Dimov wrote:

...

That's issue 6.17 from n1756. I think you can define char_traits to create a case-insensitive string, so you'd want to get the same hash values for "foo" as "FOO". I was thinking of adding overloads for hash_combine and hash_range to use custom hash functions to deal with this sort of thing, but I thought it'd be best to stick with your design for the initial version (I might propose such changes as part of the review submission of the unordered containers). Daniel

Peter Dimov

2:25 p.m.

Daniel James wrote:

...

Do not give much weight to such char_traits definitions, issue 6.17 notwithstanding. The proposed extensions treat std::string as a container, not as... whatever the other aspect of std::string is supposed to be. That is, the hash value of a string s is defined as the hash value of the range s.begin(), s.end(). If you look at the container requirements table, you'll see that container equality is defined in terms of operator== on the elements. The (bold enough) author of a string that is not a container can provide an overload of hash_value (in the namespace of the corresponding traits class) with the appropriate semantics.

Daniel James

3:33 p.m.

Peter Dimov wrote:

...

Well, partly I just feel like I should go along with the current opinion of the standards comittee. But here I do agree with them. 'std::equal_to<std::string>' is defined as using char_traits. And if the hash function doesn't match std::equal_to, then I believe it's broken. Regardless of whether using char_traits was a bad idea in the first place. Having said that, it is your design, so your opinion is very important.

Peter Dimov

9:57 p.m.

Daniel James wrote:

...

Peter Dimov wrote:

...
Do not give much weight to such char_traits definitions, issue 6.17 notwithstanding. The proposed extensions treat std::string as a container, not as... whatever the other aspect of std::string is supposed to be. That is, the hash value of a string s is defined as the hash value of the range s.begin(), s.end().

If you look at the container requirements table, you'll see that container equality is defined in terms of operator== on the elements.

The (bold enough) author of a string that is not a container can provide an overload of hash_value (in the namespace of the corresponding traits class) with the appropriate semantics.

Well, partly I just feel like I should go along with the current opinion of the standards comittee. But here I do agree with them.

Do not give much weight to the opinion of the committee, either. The resolution of the hash<string> issue follows the path of the least resistance. It's not that other basic_string instances should not have a hash function, it's that defining this function exposes the problems with std::string and unordered_*.

...

'std::equal_to<std::string>' is defined as using char_traits. And if the hash function doesn't match std::equal_to, then I believe it's broken.

Example of the above. The hash function isn't broken. Using this hash function in combination with std::equal_to is broken. Unfortunately, the unordered containers do not give us a way to replace the default equality relation. shared_ptr and weak_ptr encounter the same problem.

...

Regardless of whether using char_traits was a bad idea in the first place.

Nobody uses nonstandard character traits. They are fundamentally broken for a variety of reasons. Their only purpose is to generate discussions such as this one. ;-) But if this will make you feel better, define the overload as taking basic_string<Ch, char_traits<Ch>, A>. Nobody will spot the difference.

Daniel James

10 Mar 10 Mar

9:35 a.m.

Peter Dimov wrote:

...

A typical comittee then?

...

Oh, people on this list will. ;) You're probably right, and I'm sure it wouldn't be long before someone asked for support for that.

Peter Dimov

11:22 a.m.

Daniel James wrote:

...

Not really. You have to keep in mind the context: the TR is being finalized, this is not the time or place to introduce major changes. The idea is that the TR is a work in progress and is still open for improvement before it becomes part of the official standard.

Pavel Vozenilek

9 Mar 9 Mar

11:33 a.m.

"Thorsten Ottosen" wrote:

...

It's a pleasure to announce that the review of Daniel James' hash functions library...

Just few comments and questions from quick playing. /Pavel ______________________________________________ 1. Woudn't it be better to have separate headers as hash_for_sets.hpp, hash_for_maps.hpp etc instead of one header including the whole world? This way other Boost containers could be added w/o affecting compile time. I remember long debates about this during Serialization review. These resulted in solution with separate headers + common header for all. And if possible, other containers (slist, few Boost ones) should be added. It may be considered whether support for shared_ptr etc could be added. ______________________________________________ 2. The reference documentation says string and wstring are passed by value, while they are by const&. ______________________________________________ 3. __int64/long long should be supported. ______________________________________________ 4. What if I would like hash result other than size_t, frex unsigned char? I may be memory constrained or may want to save data on disk. ______________________________________________ 5. namespace detail would be better hash_detail to avoid possible clashes. ______________________________________________ 6. Maybe the struct hash::operator() could use boost::call_traits<T>::param_type to optimize value passing for primitive types. ______________________________________________ 7. compiling hash_float_test.cpp with Intel C++ 7.0 plugged in VC6 IDE (with old Dinkumware) I get: hash_float_test.cpp C:\work\reviews\hash\hash\libs\functional\hash\test\hash_float_test.cpp(56): error: identifier "denorm_present" is undefined if(std::numeric_limits<T>::has_denorm == denorm_present) { BCB 6.4 with STLport 4.5.1 says: [Linker Error] Unresolved external '_STL::_LimG<bool>::_F_inf' referenced from C:\WORK\REVIEWS\HASH\HASH\LIBS\FUNCTIONAL\HASH\TEST\HASH_FLOAT_TEST.OBJ [Linker Error] Unresolved external '_STL::_LimG<bool>::_F_qNaN' referenced from C:\WORK\REVIEWS\HASH\HASH\LIBS\FUNCTIONAL\HASH\TEST\HASH_FLOAT_TEST.OBJ [Linker Error] Unresolved external '_STL::_LimG<bool>::_D_inf' referenced from C:\WORK\REVIEWS\HASH\HASH\LIBS\FUNCTIONAL\HASH\TEST\HASH_FLOAT_TEST.OBJ [Linker Error] Unresolved external '_STL::_LimG<bool>::_D_qNaN' referenced from C:\WORK\REVIEWS\HASH\HASH\LIBS\FUNCTIONAL\HASH\TEST\HASH_FLOAT_TEST.OBJ [Linker Error] Unresolved external '_STL::_LimG<bool>::_L_inf' referenced from C:\WORK\REVIEWS\HASH\HASH\LIBS\FUNCTIONAL\HASH\TEST\HASH_FLOAT_TEST.OBJ [Linker Error] Unresolved external '_STL::_LimG<bool>::_L_qNaN' referenced from C:\WORK\REVIEWS\HASH\HASH\LIBS\FUNCTIONAL\HASH\TEST\HASH_FLOAT_TEST.OBJ ______________________________________________ 8. old Dinkumware fix: hash_custom_test.cpp: return hasher(std::tolower(x)); ==>> using namespace std; return hasher(tolower(x)); ______________________________________________ 9. Win32 Intel fix: by default I get error: C:\work\reviews\hash\hash\boost/functional/hash.hpp(226): error: no instance of overloaded function "boost::hash_value" matches the argument list argument types are: (const test::custom) return hash_value(val); one needs to add compiler option: /Qoption,c,--arg_dep_lookup to enable ADL with Intel when emulating VC6. ______________________________________________ 10. performance: wouldn't it be better to provide hand written specialization for std::string/wstring instead of relying on hash_range/combine? The compiler may not optimize chain of functions out/pass parameters in most efficient way/ keep most data in registers. It's just feeling, I didn't do any measurements. ______________________________________________ 11. VC6 warning: just for completeness hash_number_test.cpp c:\boost\boost_1_32_0\boost\test\test_tools.hpp(428) : warning C4805: '==' : unsafe mix of type 'const int' and type 'const bool' in operation c:\work\reviews\hash\hash\libs\functional\hash\test\hash_number_test.cpp(48) : see reference to function template instantiation 'bool __cdecl boost::test_tools::tt_detail::equal_and_continue_impl(const int &,const bool &,class boost::basic_w rap_stringstream<char> &,class boost::unit_test::basic_cstring<char const

...

,unsigned int,enum boost::unit_test::log_level,unsigned int)' being compiled

______________________________________________ 12. BCB compiles and runs all tests (w/o floats). Intel 7 dtto (with the fix from [9] for custom test). VC6 doesn't floats and custom test (due to lack ADL). ______________________________________________ 13. The headers may have #if (defined _MSC_VER) && (_MSC_VER >= 1200) # pragma once #endif on the top. ______________________________________________ 14. some rationale may be given for hash function seed ^= hash_value(v) + (seed<<6) + (seed>>2); - link where does it come from or so. If one creates different algorithm the code may be prepared for such situation by: #ifdef BOOST_HASH_USER_HASH_COMBINE_OFFERED BOOST_HASH_USER_HASH_COMBINE_OFFERED; #else seed ^= hash_value(v) + (seed<<6) + (seed>>2); #endif or something like that. ______________________________________________ 15. Complete example how to write e.g. case insensitive hash and how to plug it in should be shown in docs. ______________________________________________ EOF

Daniel James

6:18 p.m.

Pavel Vozenilek wrote:

...

______________________________________________ 1. Woudn't it be better to have separate headers as hash_for_sets.hpp, hash_for_maps.hpp etc instead of one header including the whole world?

This way other Boost containers could be added w/o affecting compile time.

I remember long debates about this during Serialization review. These resulted in solution with separate headers + common header for all.

This is a popular view. I'll see what people say in response to other mails.

...

And if possible, other containers (slist, few Boost ones) should be added.

Well, this is possibly beyond the scope of the library (as it's a TR1 implementation) and, thanks to Peter's design, it's very easy to do this yourself. But, extra headers could be made available.

...

It may be considered whether support for shared_ptr etc could be added.

I'm not sure what the semantics would be for shared_ptr, I think if it was based on the pointer this could suprise some people.

...

______________________________________________ 2. The reference documentation says string and wstring are passed by value, while they are by const&.

I am obviously a very clumsy user of boost book. Thanks.

...

______________________________________________ 3. __int64/long long should be supported.

I want to put this of for now. I'm sticking to what's specified by the standard and Peter's proposal. The idea is to get this into boost as quickly as possible, so that it can be used by MultiIndex. I'm trying to keep things simple. Also, it will need a different hash function to the other integer type as the current one won't take account of the higher bits. I want to take my time over things like this and get it right.

...

______________________________________________ 4. What if I would like hash result other than size_t, frex unsigned char?

I may be memory constrained or may want to save data on disk.

Then you'll probably want to use your own hash function ;) This library is not going to be appropriate for every possible use of hash functions. It is only intended to meet the needs specified by TR1.

...

______________________________________________ 5. namespace detail would be better hash_detail to avoid possible clashes.

OK. Will do something like this.

...

______________________________________________ 6. Maybe the struct hash::operator() could use boost::call_traits<T>::param_type to optimize value passing for primitive types.

Maybe, I'll think about that.

...

______________________________________________ 7. compiling hash_float_test.cpp with Intel C++ 7.0 plugged in VC6 IDE (with old Dinkumware) I get:

hash_float_test.cpp C:\work\reviews\hash\hash\libs\functional\hash\test\hash_float_test.cpp(56): error: identifier "denorm_present" is undefined if(std::numeric_limits<T>::has_denorm == denorm_present) {

BCB 6.4 with STLport 4.5.1 says:

[Linker Error] Unresolved external '_STL::_LimG<bool>::_F_inf' referenced from C:\WORK\REVIEWS\HASH\HASH\LIBS\FUNCTIONAL\HASH\TEST\HASH_FLOAT_TEST.OBJ

Getting the float functions to work on different compilers is not going to be easy, they vary a lot. Borland seems to have particularly bad support for floats. That stlport link error is a little odd - it looks like it might be an stlport error. The best thing to do is probably to add appropriate macros and tests to config.

...

______________________________________________ 8. old Dinkumware fix: hash_custom_test.cpp:

return hasher(std::tolower(x));

==>>

using namespace std; return hasher(tolower(x));

Will do.

...

______________________________________________ 9. Win32 Intel fix: by default I get error:

C:\work\reviews\hash\hash\boost/functional/hash.hpp(226): error: no instance of overloaded function "boost::hash_value" matches the argument list argument types are: (const test::custom) return hash_value(val);

one needs to add compiler option: /Qoption,c,--arg_dep_lookup

to enable ADL with Intel when emulating VC6.

The custom test is only going to work on compilers with ADL - that's deliberate so that the test results will show what works and what doesn't. On compilers without ADL, you'll have to add the overload to the boost namespace.

...

______________________________________________ 10. performance: wouldn't it be better to provide hand written specialization for std::string/wstring instead of relying on hash_range/combine?

The compiler may not optimize chain of functions out/pass parameters in most efficient way/ keep most data in registers.

It's just feeling, I didn't do any measurements.

Maybe. I would want to do some profiling before doing that sort of optimisation.

...

______________________________________________ 12. BCB compiles and runs all tests (w/o floats). Intel 7 dtto (with the fix from [9] for custom test). VC6 doesn't floats and custom test (due to lack ADL).

I think VC6 worked for floats on my computer, but I'll double check that.

...

______________________________________________ 13. The headers may have

#if (defined _MSC_VER) && (_MSC_VER >= 1200) # pragma once #endif

on the top.

OK.

...

______________________________________________ 14. some rationale may be given for hash function

seed ^= hash_value(v) + (seed<<6) + (seed>>2);

- link where does it come from or so.

There's a link in Peter's proposal, I'll add it.

...

If one creates different algorithm the code may be prepared for such situation by:

#ifdef BOOST_HASH_USER_HASH_COMBINE_OFFERED BOOST_HASH_USER_HASH_COMBINE_OFFERED; #else seed ^= hash_value(v) + (seed<<6) + (seed>>2); #endif

or something like that.

If you want to use a different algorithm, I think you should just use your own function.

...

______________________________________________ 15. Complete example how to write e.g. case insensitive hash and how to plug it in should be shown in docs.

That would be an inappropriate use for this library. To do something like this you should use your own hash function. I've written an example of this for the unordered associative container library where it is appropriate. What might be appropriate here is to write versions of hash_range & hash_combine which let you use a different hash function. thanks for your review, Daniel

Peter Dimov

6:57 p.m.

Daniel James wrote:

...

I'll add a hash_value overload for shared_ptr and weak_ptr if/when the library is accepted. In general, it should be the responsibility of the class author to provide the appropriate hash_value overload.

Pavel Vozenilek

7:01 p.m.

______________________________________________

...

Could there be page describing what works where and what workarounds may be needed? ______________________________________________

...

If possible within larger app. Compiler may get under time pressure there. /Pavel

Daniel James

10 Mar 10 Mar

10:41 a.m.

Pavel Vozenilek wrote:

...

Yes. I'm tempted to say that as customisation is designed to use ADL, it isn't supported on compilers that don't provide ADL. But that would be unrealistic. So, I'll add this. Daniel

Joaquín Mª López Muñoz

9 Mar 9 Mar

11:36 a.m.

Thorsten Ottosen ha escrito:

...

Here's my review: * What is your evaluation of the design? The design follows Peter Dimov's proposal at http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1756.pdf (item 6.18.) I think the design is fine, and Peter explains its advantages over the laxer TR1 hash<> spec better than I would do. * What is your evaluation of the implementation? The implementation follows the aforementioned spec and it's up to Boost coding standards. I've got some comments on it, later. * What is your evaluation of the documentation? Fine in general, though I've got some comments below. * What is your evaluation of the potential usefulness of the library? I need it! I think this is a general purpose utility that must be in Boost, perhaps most importantly with regard to the ongoing effort to provide a Boost.TR1 implementation. * Did you try to use the library? With what compiler? Did you have any problems? Yes, with GCC 3.2, MSVC++6.5 (+STLport), ICC 7.1. No problems. * How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? I've been following this lib closely for at least a month. * Are you knowledgeable about the problem domain? Not really an expert in hashing algorithms, but I'm familiar with hashed containers, TR1 hash<> proposal and Dimov's extension proposal. * Do you think the library should be accepted as a Boost library? Yes, definitely. My comments on impl and docs do not change my voting (though I'd like them to be addressed, of course.) Comments on the implementation. 1. The overloads of hash_value for STL containers forces the implementation to include <vector>, <list>, etc., which makes the header a little heavier than desired. Of course, this could be avoided if only there exist <vector_fwd> style headers :( I propose that, instead of including the STL containers headers, they are forward declared: namespace std{ template<typenameT,typename A> class std::vector<T, A>; ... } As is known, stdlib implementors are free to add extra template params, so this is not 100% safe, but: * All stdlibs that I'm aware of do not include extra template params * The forward declarations can be tweaked for those stdlibs for which they don't match. 2. It is said in hash.cpp that BCB has problems with the overload of hash_value for bool. This should be fixed in one way or another (possibly by not defining it for this compiler, I guess.) 3. Possible future enhancement: compatibility of hash_range with Boost.Range. Comments on the documentation. 1. If this lib is going to claim adherence to Peter Dimov's proposal, the reference should state what the algortihms are for overloads of hash_value for: int unsigned int long unsigned long std::pair std::basic_string STL containers Currently all these overloads appear without a mention of what their exact behavior is. A mention to Peter's proposal is included, but I think it'd be better to replicate the spec entirely rather than rely on an external paper. 2. The overload of hash_value for bool is not documented. Its exact behavior should probably be documented, also. 3. The navigation buttons look cluttered (at least in my browser, IE 5.0.) 4. Introduction: the "It is being proposed" thing should be removed if the lib is accepted. 5. The docs talk of overloads for string and wstring, yet Peter's proposal and the implementation itself overload for basic_string<E,T,A>. 6. hash_combine is incorrecly documented to return a std::size_t (it doesn't return anything.) Thanks to Daniel for his contribution! Best, Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

Daniel James

12:16 p.m.

Joaquín Mª López Muñoz wrote:

...

Comments on the implementation.

1. The overloads of hash_value for STL containers forces the implementation to include <vector>, <list>, etc., which makes the header a little heavier than desired. Of course, this could be avoided if only there exist <vector_fwd> style headers :( I propose that, instead of including the STL containers headers, they are forward declared:

namespace std{ template<typenameT,typename A> class std::vector<T, A>; .... }

As is known, stdlib implementors are free to add extra template params, so this is not 100% safe, but: * All stdlibs that I'm aware of do not include extra template params * The forward declarations can be tweaked for those stdlibs for which they don't match.

I'm not happy with this either. Another possible solution is to just use Boost.Range on anything that 'looks' like a container. But this will break on unordered containers (as equivalent containers can have different sequences).

...

2. It is said in hash.cpp that BCB has problems with the overload of hash_value for bool. This should be fixed in one way or another (possibly by not defining it for this compiler, I guess.)

It actually says that BCB has problems without it. There was an overload ambiguity error for boost::hash<bool> (although, calling boost::hash<bool> would be a very odd thing to do) and that was added to work around it.

...

3. Possible future enhancement: compatibility of hash_range with Boost.Range.

Yes, Thorsten asked for this as well. I think it's a good idea.

...

Comments on the documentation.

1. If this lib is going to claim adherence to Peter Dimov's proposal, the reference should state what the algortihms are for overloads of hash_value for:

int unsigned int long unsigned long std::pair std::basic_string STL containers

Currently all these overloads appear without a mention of what their exact behavior is. A mention to Peter's proposal is included, but I think it'd be better to replicate the spec entirely rather than rely on an external paper.

I didn't document these because I might change them in the future (I'll consult the list for opinions before I make any such change). But based on the feedback I've received so far, I probably should. Thorsten pointed out that it's important to guarentee that hash_range will give the same result as a string, so that it can be used with equivalent strings of different types. The same goes for sub-ranges of containers.

...

2. The overload of hash_value for bool is not documented. Its exact behavior should probably be documented, also.

That's because it's just there as workaround, I'll change it so that it's only defined for Borland.

...

3. The navigation buttons look cluttered (at least in my browser, IE 5.0.)

When added to boost it'll have the standard BoostBook look and feel.

...

4. Introduction: the "It is being proposed" thing should be removed if the lib is accepted.

Of course.

...

5. The docs talk of overloads for string and wstring, yet Peter's proposal and the implementation itself overload for basic_string<E,T,A>.

I'm following the resolution from issue 6.17 in the same paper.

...

6. hash_combine is incorrecly documented to return a std::size_t (it doesn't return anything.)

OK, I'll fix that. It would have been good to use Doxygen and avoid such documentation bugs, but it doesn't work very well with overloaded functions.

...

Thanks to Daniel for his contribution! Best,

Thanks for your review, and for helping with the library.

Peter Dimov

1:10 p.m.

Daniel James wrote:

...

It is a deliberate design decision that the algorithms are fully specified. The goal is precisely to not allow you (as an implementor) to change them. ;-) As usual, this is a tradeoff between giving more freedom to implementors to improve the behavior and giving more portability guarantees to users. Varying the hash function can lead to dramatic changes in performance and not specifying the exact algorithm will mean that an application using the default hash functions would need to be benchmarked again after every change of the standard library. Of course the fact that hash_range gives the same result as hash_value on the corresponding container is no accident, either.

Daniel James

1:31 p.m.

Peter Dimov wrote:

...

Yes, I agree for the final standard, but we're not there yet. I'm also a little worried about how the hash functions will vary with different platforms. For example, if sizeof(long) > sizeof(std::size_t), then hash_value(long) might be weak for values which only vary in the higher bits. But I do want to stay close to your design for now. It's far better than what I would have come up with and it'll be good to get more experience of how well it works before considering any changes.

...

Of course the fact that hash_range gives the same result as hash_value on the corresponding container is no accident, either.

Yes, the reason that this came up, is because this wasn't true for Jeremy's original implementation. But I think his hash function for strings was faster, which might be desirable. Daniel

Thorsten Ottosen

2 p.m.

"Daniel James" <daniel@calamity.org.uk> wrote in message news:d0mtqf$98h$1@sea.gmane.org... | Peter Dimov wrote: | | > It is a deliberate design decision that the algorithms are fully | > specified. The goal is precisely to not allow you (as an implementor) to | > change them.. ;-) | > Of course the fact that hash_range gives the same result as hash_value | > on the corresponding container is no accident, either. | | Yes, the reason that this came up, is because this wasn't true for | Jeremy's original implementation. But I think his hash function for | strings was faster, which might be desirable. Peter, I don't like that hasing a string is suboptimal. Likewise, I would like hasing of sub_range<string> or sub_range<vector<char>> to be really fast. So if the goal is portability, then why don't we just find a suitable *fast* way to do it whenever the value-type is char or wchar_t and specify all the details of that function? -Thorsten

Peter Dimov

2:15 p.m.

Thorsten Ottosen wrote:

...

See my other post. When talking about hash functions, you really need to have a clear understanding of "suboptimal" in mind. I don't claim to be an expert, but at least I have tried to familiarize myself with the research done in this area. My advice is: if you are going to mess with a (standard) hash function, you better (a) understand what you are doing, (b) have an extensive test suite that benchmarks an unordered_map with a wide range of real-world key sets.

Daniel James

10 Mar 10 Mar

10:35 a.m.

Peter Dimov wrote:

...

Well, these two pages describe algorithms for hashing strings, which apparently are faster with good (or better?) quality (I haven't tested them yet): http://burtleburtle.net/bob/hash/doobs.html http://www.azillionmonkeys.com/qed/hash.html They are faster because they process several characters at a time. The problem is that it while it would be fairly easy to specialise hash_range for this, it would be hard to get hash_combine to support them. It would probably have to be replaced with an object that maintains hashing state. Something like: struct hash_combiner { hash_combiner(); template <class T> void combine(T const&); std::size_t value() const; }; And the internal state would be complex, especially since it has to deal with arbitrary types. So it'd probably be slow, and loose its simple definition, which is an important part of the design. An alternative would be to drop the requirement that hash_range is defined as returning the same result as using hash_combine. I think that would be a bad idea. A hash function that is optimised for specific types will probably require reducing the libraries generic functionality, and make it less transparent. This could be done within the current draft standard, but not Peter's proposal. And I think that Peter's extensions are more useful than the extra speed, which in practise might not be that great (I haven't tested this, so I could be wrong). I would like to investigate this kind of thing further. But it'll take time. Daniel

Peter Dimov

11:48 a.m.

Daniel James wrote:

...

These pages are good. But as I said, benchmarking hash functions is pointless. Real programs do not compute the hash value of a random 256 byte buffer five million times. They usually need to do something with the hash value. This something typically involves a division on a prime to obtain a modulus, several pointer chasings, and at least one equality comparison (which on a 256 byte buffer will have considerable impact).

Daniel James

12:06 p.m.

Peter Dimov wrote:

...

Actually, that's one of the problems with them. They're designed to be used with hash tables that have a 2^n buckets (to avoid calculating the modulus), which isn't practical for a generic hash table that supports user supplied hash functions. Daniel

Peter Dimov

12:11 p.m.

Peter Dimov wrote:

...

... or if they do, they use different 256 byte buffers and do not operate from the L1 cache. ;-)

Thorsten Ottosen

2:10 p.m.

"Peter Dimov" <pdimov@mmltd.net> wrote in message news:019c01c5256a$561efe40$6601a8c0@pdimov... | Peter Dimov wrote: | > But as I said, benchmarking hash functions is pointless. Real programs | > do not compute the hash value of a random 256 byte buffer five million | > times. | | ... or if they do, they use different 256 byte buffers and do not operate | from the L1 cache. ;-) no, but AFAICT we cannot exclude that a certain hash-function is much faster *and* otherwise behaves as good as the other "naive" ones. Can anybody answer this question: is it always optimal to use the whole range to compute the hash value; for eg. strings, if I know the average length of my strings, I can't see that it would make sense to process much more the the average length (or perhaps even shorter). maybe hash_range should be specified as temaplate < unsigned max, // max length of range to consider class Iter

...

hash_range( Iter, Iter ); ? -Thorsten

Peter Dimov

2:48 p.m.

Thorsten Ottosen wrote:

...

Simplistic benchmarks can't be used to evaluate the performance of a component; it should be benchmarked in its usual role. In most real world scenarios, the performance of a hash function is memory access bound, because the hashed data isn't already in the L1 cache. The proper way to evaluate the performance of a hash function is to benchmark unordered_map lookups. Which certain hash function is much faster and otherwise has all the desirable characteristics of the proposed design and implementation?

...

The default hash function must be good enough if no extra information about the properties of the key set is available. When extra information _is_ available, you shouldn't use the default. Excluding some parts of the range leads to pathological cases when the key set only differs in the excluded parts. This leads to orders of magnitude slowdowns. (This is not speculation, it is a fact.) To add insult to injury, because CPU reads are typically a cache line wide (16/32 bytes), it can turn out that the "optimized" hash function isn't even much faster.

Thorsten Ottosen

3:53 p.m.

"Peter Dimov" <pdimov@mmltd.net> wrote in message news:000901c52580$359f96a0$6601a8c0@pdimov... | Thorsten Ottosen wrote: | > "Peter Dimov" <pdimov@mmltd.net> wrote in message | > news:019c01c5256a$561efe40$6601a8c0@pdimov... | >> Peter Dimov wrote: | >>> But as I said, benchmarking hash functions is pointless. Real | >>> programs do not compute the hash value of a random 256 byte buffer | >>> five million times. | >> | >> ... or if they do, they use different 256 byte buffers and do not | >> operate from the L1 cache. ;-) | > | > no, but AFAICT we cannot exclude that a certain hash-function is much | > faster *and* otherwise behaves as good as the other "naive" ones. | | Simplistic benchmarks can't be used to evaluate the performance of a | component; it should be benchmarked in its usual role. In most real world | scenarios, the performance of a hash function is memory access bound, | because the hashed data isn't already in the L1 cache. The proper way to | evaluate the performance of a hash function is to benchmark unordered_map | lookups. right; then maybe somebody should make those test? | Which certain hash function is much faster and otherwise has all the | desirable characteristics of the proposed design and implementation? one fo the two links that James mentioned mentioned something about the distribution being as good as normal. | > Can anybody answer this question: is it always optimal to use the | > whole range to compute the hash value; for eg. strings, if I know the | > average length of my strings, I can't see that it would make sense to | > process much more the the average length (or perhaps even shorter). | | The default hash function must be good enough if no extra information about | the properties of the key set is available. When extra information _is_ | available, you shouldn't use the default. | | Excluding some parts of the range leads to pathological cases when the key | set only differs in the excluded parts. This leads to orders of magnitude | slowdowns. (This is not speculation, it is a fact.) so basically you're saying that any collision is much worse than a slower hash function because we must use == to go through the bucket ? -Thorsten

Chris Jefferson

4:30 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thorsten Ottosen wrote: | so basically you're saying that any collision is much worse than a slower | hash function because we must use == to go through the bucket ? | I would say that if you are designing your own hash function for your general use, then you can do whatever you find most convient, and try to optimise. In general you never want to risk having too many hash collisions (or even worse, having everything hashing to the same thing), espically for just a constant time speed increase, as suddenly all your operations can go from being mostly O(1) (on average) to mostly O(n).. and you know someone is going to trigger that (in particular, I've made hash tables of strings where they mostly don't different at the start, while doing genetic evolution..) Chris -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (MingW32) iD8DBQFCMHY33FpDzErifpIRAmrvAJ4whcYByCk3fnxOTZmPod5scku16QCgjxFt NL+fARQjyVyTmgNpWpKYZ+E= =e6A+ -----END PGP SIGNATURE-----

Peter Dimov

5:01 p.m.

Thorsten Ottosen wrote:

...

so basically you're saying that any collision is much worse than a slower hash function because we must use == to go through the bucket ?

Something like that, yes; of course it depends on the level of slowness. Cryptographic hash functions are typically overkill. Increasing the number of collisions slightly is tolerable (but can easily negate the effect of the faster hash function). Hitting a pathological case where a significant percentage of the key set maps to the same bucket is a serious problem.

Thorsten Ottosen

6:53 p.m.

"Peter Dimov" <pdimov@mmltd.net> wrote in message news:000901c52580$359f96a0$6601a8c0@pdimov... | Simplistic benchmarks can't be used to evaluate the performance of a | component; it should be benchmarked in its usual role. right. I just stumpled about this parag. : " Update(1): David C. wrote: I tested your hash function against all of the popular ones, including Bob Burtles. It turns out it was not only the quickest but had the best distribution (I created histograms of the chain lengths). The architecture I work on is IBM Z/OS (S/390 mainframes). Well done mate, will be using your code from now on! Ok, not exactly RISC, but at least this demonstrates that this function is good beyond the x86 architecture. " from http://www.azillionmonkeys.com/qed/hash.html -Thorsten

Peter Dimov

9 Mar 9 Mar

2:09 p.m.

Daniel James wrote:

...

It doesn't matter. The choice is between documented algorithms and undocumented algorithms, not between the specific algorithms I specified and something else. I argue that the algorithms need to be documented.

...

A hash function is only faster when the corresponding operation, built around it, is faster. Benchmarking a hash function in isolation is pointless. For example, one well-known implementation has (had?) a tremendously fast hash function which leads (lead?) to their hash_map being four times slower than a std::map for a real-world set of keys. That said, I don't see why this is an argument in favor of making hash_range return something else. The whole point of providing hash_range is to allow the user to write a hash function for their custom string (or container) type that returns the same result as the standard hash function for strings (or other containers).

Daniel James

3:22 p.m.

Peter Dimov wrote:

...

Well, the algorithms are documented. They're in your proposal. I didn't put them in my documentation because I was worried that if I changed them in the future (and I don't mean the near future, I'm going to be very cautious about such changes), that there would be complaints because I'd changed from what was in my documentation. But I have changed my mind on this (as I mentioned earlier). Although, I'm still not sure if it's appropriate to fully specify the functions for integers.

...

It wasn't, I was just trying to give some background on discussions that had happened off list.

Joaquín Mª López Muñoz

1:39 p.m.

Hi Daniel, Daniel James ha escrito:

...

Why not? Sure it has the overhead of checking validity against every new stdlib that comes around, but you've already commited to this kind of activity with the workarounds on float_functions.hpp Besides, the amount of work needed to accommodate a new library is 0 in most cases and negligible in the rare cases where extra template parms are used.

...

Too broad IMHO.

...

Oh, sorry, I misread it. I agree with you hashing a bool makes little sense, and agree that it probably shouldn't be made part of the public interface (for the same reasons, for instance, that there's no overload for char.) Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

Daniel James

3:41 p.m.

Joaquín Mª López Muñoz wrote:

...

Sorry, I meant: 'I'm not happy with the current situation either'. I suppose if there are any real problems with providing our own declarations then someone will point them out soon. Anyone? Daniel

Rob Stewart

6:12 p.m.

From: Daniel James <daniel@calamity.org.uk>

...

Such forwarding headers would be a real boon as a separate Boost library, if avoiding the unnecessary inclusion of standard library headers is a compile-time performance win. Peter suggests that it really doesn't matter much. I guess you'd have to create them and run tests against a large body of code to see what impact it actually has on build times. -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

Tobias Schwinger

11 Mar 11 Mar

11:40 a.m.

________________________________________________________________________ - What is your evaluation of the design? There is not much to say about the class boost::hash itself, as there is not much to say about functors besides they are common. On top of that this very functor been fought out by several smart minds. The extensions to the plain TR1 requirements, hash_range and ADL-based support of user-defined hashing, are most useful because they are important to allow arbitrary data to be hashed (e.g. a std::string may be not too well-suited to implement LZW coding for binary data). I am missing a way to parametrize how hash values are combined. This is especially desirable because the skeleton could then be exploited to implement double hashing for compound data (a comon technique to avoid collisions by varying the hash function) and can be useful in other situations (e.g. a small hash table in memory vs. a large one on disk). If we wanted to allow this, however, it is questionable if the proposed extensions are optimal, already. ________________________________________________________________________ - What is your evaluation of the documentation? Good. Easily understandable. Nitpicking: a hash table is not required to store its elements in buckets in general - this is a property of the most common way to implement hash tables. ________________________________________________________________________ - What is your evaluation of the implementation? Here are my notes from reading the source: The implementation of hash_combine does not need conditional jumps as a probably slightly slower variant, which I've seen quite often does. I know this variant works empirically well in practice, so will this one, I figure. However, I believe that hashing speed is generally overrated and that there are numerous situations where it's worth to trade some of it for fewer collisions (as memory access generally gives more room for performance optimization than CPU instructions do). -- One idea that immediately came to my mind was some "syntactic sugar" to ease hashing of class members: std::size_t hash_value(point const& p) { return (boost::hash_value = p.x, p.y, p.z); } Note: It's just a spontanous idea I am not convinced of myself... -- The for loop in hash_float_value is a nice one to be unrolled via metaprogramming. ________________________________________________________________________ - What is your evaluation of the potential usefulness of the library? Very useful. It's surprising it has been missing in both C++ and Boost for so long. ________________________________________________________________________ - Did you try to use the library? Not yet. ________________________________________________________________________ - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? Read the documentation and the code. About 2 hours total time. ________________________________________________________________________ - Are you knowledgeable about the problem domain? Not more than a general-educated and curious programmer should be... ________________________________________________________________________ - Do you think the library should be accepted as a Boost library? Yes. Though there should be some research if it is possible to add to its genericity as mentioned in the first paragraph (just saw the latest discussions and seems we're already at it). Regards, Tobias

Daniel James

12:56 p.m.

Tobias Schwinger wrote:

...

I am missing a way to parametrize how hash values are combined. This is especially desirable because the skeleton could then be exploited to implement double hashing for compound data (a comon technique to avoid collisions by varying the hash function) and can be useful in other situations (e.g. a small hash table in memory vs. a large one on disk).

I think if you want to combine hash values in a different manner, it's best just write your own version of hash_combine. It could save a little code if we supplied a version of hash_range that works for different combining functions, but I don't think it'll be that useful (and would require that your hash_combine function had the same signature as ours, you might want to do it differently).

...

- What is your evaluation of the documentation?

Good. Easily understandable.

Nitpicking: a hash table is not required to store its elements in buckets in general - this is a property of the most common way to implement hash tables.

Good point. I've been spending too much time looking at the draft standard for hash based containers (which require buckets).

...

- What is your evaluation of the implementation?

Here are my notes from reading the source:

The implementation of hash_combine does not need conditional jumps as a probably slightly slower variant, which I've seen quite often does. I know this variant works empirically well in practice, so will this one, I figure.

However, I believe that hashing speed is generally overrated and that there are numerous situations where it's worth to trade some of it for fewer collisions (as memory access generally gives more room for performance optimization than CPU instructions do).

I do need to spend more time looking into alternative hashing algorithms for a future version.

...

One idea that immediately came to my mind was some "syntactic sugar" to ease hashing of class members:

std::size_t hash_value(point const& p) { return (boost::hash_value = p.x, p.y, p.z); }

Note: It's just a spontanous idea I am not convinced of myself...

A hash function for boost::tuple would be a good way to do this, although something along those lines might be more efficient.

...

The for loop in hash_float_value is a nice one to be unrolled via metaprogramming.

Yes, for anyone who hasn't looked at the code it loops a fixed number of times based on the accuracy of the float type. Compilers could unroll that loop themselves, but that's probably too much to expect.

...

- Do you think the library should be accepted as a Boost library?

Yes. Though there should be some research if it is possible to add to its genericity as mentioned in the first paragraph (just saw the latest discussions and seems we're already at it).

Thanks for your review. Daniel

Tobias Schwinger

3:03 p.m.

Daniel James wrote:

...

Well, this was not exactly the point I was trying to make: Example: let's say we have a disk-stored hash table and an in-memory hash table for caching. The same objects are stored but it is desirable to have two different hashing functions. Unfortunately boost::hash can only be used for at most one of them. If I wanted to implement double hashing I'ld have to duplicate the whole facility (which may include writing a lot of "hash_value2" functions for a grown project) because there can be only one 'hash_value' per type which can only use one algorithm for combining the hash values, as the current design does not provide a way to get parametrization in there. Looking at the source underlines this impression: hash_combine seems like a forgotten hard wired strategy in something that actually wants to be a facade.

...

Especially since it is very questionable if these global 'unroll loops' options do any good in the big picture. Regards, Tobias

Peter Dimov

3:53 p.m.

Tobias Schwinger wrote:

...

The next generalization level isn't really about hash functions anymore. You are asking for a generic visitation interface where you give a visitor to a container (or a compound type with nonambiguous state*) and the container applies the visitor to each of its elements. This facility is very useful (not only for hashing, but for serialization, stream output, and others) but I don't believe that it belongs in a utility hash function library. (*) For some types, the choice what to visit (exact private representation, virtual value representation, exact representation without transient state, or something in between) may be nontrivial and depend on the overall context in which the visitation occurs.

Tobias Schwinger

4:59 p.m.

Peter Dimov wrote:

...

I agree. That is, given that the "result" (or should I better say "context" here ?) is not restricted to std::size_t. If it's generic enough to factor it out, it's even better, isn't it ?

...

One could argue that it belongs into Boost to implement things like hashing on top of it before they are written... ...but I won't ;-). Besides we should have hashing in ASAP, at least for political (Java) and housekeeping (so other libraries have a default for their template parameter lists) reasons. Regards, Tobias

Alberto Barbati

8:01 p.m.

Hi, I have a doubt about the proposed implementation of hash_value() for pointers, which is: template <class T> inline std::size_t hash_value(T* v) { return static_cast<std::size_t>(v - (T*)0); } this code calls for undefined behaviour according to §5.7/6, each time v is a non-null pointer. IMHO, the only reasonable way to map generic pointer values to integer values is through reinterpret_cast<>. The mapping would be implementation-defined, but it would be no worse than the intent of the proposed implementation (if it worked). To be picky, in order to accomodate with implementations that may have sizeof(size_t) < sizeof(T*), we should use two casts: template <class T> inline std::size_t hash_value(T* v) { return static_cast<std::size_t>( reinterpret_cast<boost::uintmax_t>(v)); } (the right thing would be to use a C9X-ish uintptr_t type, but boost/cstdint.hpp does not provide such a type). However, as the double cast may result in suboptimal code if size_t is large enough, a boost::enable_if trick might be used to provide an optimized version with just a single use of reinterpret_cast. The result would be something similar to: #include <boost/utility.hpp> #include <boost/cstdint.hpp> namespace detail { template <class T> struct fits_in_size_t { static const bool value = sizeof(T) <= sizeof(size_t); }; } template <class T> inline typename boost::enable_if< detail::fits_in_size_t<T*>, std::size_t>::type hash_value(T* v) { return reinterpret_cast<std::size_t>(v); } template <class T> inline typename boost::disable_if< detail::fits_in_size_t<T*>, std::size_t>::type hash_value(T* v) { return static_cast<std::size_t>( reinterpret_cast<boost::uintmax_t>(v)); } Unfortunately, there's one more thing... even if p == q, reinterpret_cast<uintmax_t>(p) is not guaranteed to be equal to reinterpret_cast<uintmax_t>(q) on implementations that have multiple internal representations for the same address. I don't think much can be done (in a portable way) for such platforms. HTH, Alberto

David Abrahams

12 Mar 12 Mar

1:10 a.m.

Alberto Barbati <abarbati@iaanus.com> writes:

...

Furthermore, you can't subtract addresses that aren't in the same array. 0 isn't in the same array as anything.

...

Once you're into reinterpret_cast you've left the land of strict portability anyway. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Alberto Barbati

11:01 p.m.

David Abrahams wrote:

...

So we agree that this code has to be fixed in some way. I don't think Boost could accept the code as it is.

...

I agree with that literally, but I don't fully understand what do you mean with such remark, so I will re-formulate my thought hoping to make it clearer. Probably we are saying the same thing. Apart from the very special case described above, using reinterpret_cast<> as I describe will provide a viable hash function. The fact that its values are implementation-defined is irrelevant, I think (but... see below). However, *in that special case above*, the basic requirement of hash functions "equal arguments yield the same result" is not satisfied so even reinterpret_cast<> does not provide a viable hash function. As the standard does not provide any other portable mean to map generic pointers to integers, there's *no way* we can obtain a hash for pointers in this case. Ah! And there's one more case in which we are helpless, although I'm curious to see an implementation that exploits such latitude: the wording used in the standard does not guarantee that the mapping from pointers to integers always produce the same value over time. In other words: char* p = new char; size_t i1 = reinterpret_cast<size_t>(p); size_t i2 = reinterpret_cast<size_t>(p); assert(i1 == i2); // ??? char* q1 = reinterpret_cast<char*>(i1); char* q2 = reinterpret_cast<char*>(i2); assert(p == q1 && q1 == q2); The standard guarantees that the second assert passes, but says nothing about the first assert. Coming back to the hash library, I think we have only three choices: 1) use reinterpret_cast<> as I proposed, but documenting that it will produce a viable hash function only if the underlying implementation satisfy certain requirements (each specific implementation /should/ document the behaviour of reinterpret_cast anyway, so the user may able to check) 2) provide the hash function (with reintepret_cast<>) only for those platforms that we are able to check in advance that will satisfy those requirements and make any use of such function produce a static assert for all other platforms 3) do not provide a hash function for pointers at all I would vote for 1, but 2 and 3 also make sense. Is there a fourth choice? Alberto

Peter Dimov

11:23 p.m.

Alberto Barbati wrote:

...

I don't see why we should loose sleep over this. If the implementation doesn't work (meet the requirements for a hash function) on a platform that is supported by Boost, this is a bug and should be handled as any other bug.

David Abrahams

11:52 p.m.

Alberto Barbati <abarbati@iaanus.com> writes:

...

There are other cases too. reinterpret_cast could map all pointers to the same integer. I'm saying what Peter's saying. This is non-portable territory; don't lose sleep over it; practically speaking it will work most places; everywhere else we'll fix it. -- Dave Abrahams Boost Consulting www.boost-consulting.com

John Maddock

13 Mar 13 Mar

10:57 a.m.

...

I would also go with (1), and probably a cast to ptrdiff_t is sufficient (rather than int_max_t). There is a forth option: treat the pointer as a "bag of bytes" and hash those bytes, that's completely portable, and no reinterpret_cast is required (it's probably overkill as well) John.

Peter Dimov

12:03 p.m.

John Maddock wrote:

...

It's portable, but possibly inferior. The implementation may normalize pointers on reinterpret_cast so that two equal pointers that do not have the same byte representation reinterpret_cast to the same value.

7429

Age (days ago)

7434

Last active (days ago)

List overview

Download

48 comments

12 participants

participants (12)

Alberto Barbati
Bronek Kozicki
Chris Jefferson
Daniel James
David Abrahams
Joaquín Mª López Muñoz
John Maddock
Pavel Vozenilek
Peter Dimov
Rob Stewart
Thorsten Ottosen
Tobias Schwinger