[rfc] Preliminary Hash Library

I've boostified, cleaned up, and documented the Hash Library I mentioned here about a month ago. Comments, complaints, suggestions, and questions appreciated. Source code is in the sandbox at http://svn.boost.org/svn/boost/sandbox/hash/ Documentation can be read online at http://www.cs.mcgill.ca/~smcmur/hash/ I was very pleased by how the implementation came together, as templates mean that all the hash algorithms share a substantial portion of their code. Those details are all hidden from the user, though. Here's an example showing the simplest use of the library: #include <boost/hash.hpp> #include <iostream> #include <string> int main() { std::string s = "Hello World!"; typedef boost::hash::sha2<256> HashPolicy; HashPolicy::digest_type digest = boost::hash::compute_digest<HashPolicy>(s); std::cout << digest << "\n"; } Thanks, ~ Scott McMurray

_____________________ Vicente Juan Botet Escribá http://viboes.blogspot.com/ ----- Original Message ----- From: "Scott McMurray" <me22.ca+boost@gmail.com> To: "Boost Developers List" <boost@lists.boost.org> Sent: Thursday, May 06, 2010 2:55 AM Subject: [boost] [rfc] Preliminary Hash Library
I've boostified, cleaned up, and documented the Hash Library I mentioned here about a month ago. Comments, complaints, suggestions, and questions appreciated.
Source code is in the sandbox at http://svn.boost.org/svn/boost/sandbox/hash/
Documentation can be read online at http://www.cs.mcgill.ca/~smcmur/hash/
I was very pleased by how the implementation came together, as templates mean that all the hash algorithms share a substantial portion of their code.
Those details are all hidden from the user, though. Here's an example showing the simplest use of the library:
#include <boost/hash.hpp>
#include <iostream> #include <string>
int main() { std::string s = "Hello World!"; typedef boost::hash::sha2<256> HashPolicy; HashPolicy::digest_type digest = boost::hash::compute_digest<HashPolicy>(s); std::cout << digest << "\n"; }
Thanks, ~ Scott McMurray _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Hi, thanks for sharing this library. After reading some pages, I have two question: * Where is documented the MessageDigest concept ? * How compare the optimized and not optimized code (efficiency) showed in the examples? #ifdef BOOST_HASH_NO_OPTIMIZATION return boost::hash::compute_digest<hash_policy>( std::istreambuf_iterator<char>(sbuf), std::istreambuf_iterator<char>() ); #else hash_policy::stream_hash<8>::type hash; for (;;) { boost::array<char, 8*1024> buf; std::streamsize n = sbuf->sgetn(&buf[0], buf.size()); if (!n) break; hash.update_n(buf.begin(), n); } return hash.end_message(); #endif It would be possible to define a template function that provides the optimized code? Best, Vicente

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Scott McMurray Sent: Thursday, May 06, 2010 1:56 AM To: Boost Developers List Subject: [boost] [rfc] Preliminary Hash Library
I've boostified, cleaned up, and documented the Hash Library I mentioned here about a month ago. Comments, complaints, suggestions, and questions appreciated.
Source code is in the sandbox at http://svn.boost.org/svn/boost/sandbox/hash/
Documentation can be read online at http://www.cs.mcgill.ca/~smcmur/hash/
I was very pleased by how the implementation came together, as templates mean that all the hash algorithms share a substantial portion of their code.
Those details are all hidden from the user, though. Here's an example showing the simplest use of the library:
#include <boost/hash.hpp>
#include <iostream> #include <string>
int main() { std::string s = "Hello World!"; typedef boost::hash::sha2<256> HashPolicy; HashPolicy::digest_type digest = boost::hash::compute_digest<HashPolicy>(s); std::cout << digest << "\n"; }
Looks neat. This is going to be used by lots of total amateurs (it might be me!). So the start is a bit steep? That this is a *cryptographic* hash and a few words and/or links to Wikipedia etc might help? http://en.wikipedia.org/wiki/Cryptographic_hash_function say? A *default* hash policy would also be nice? So the quickstart could be even quicker ;-) MD5 is most suitable for 'amateur' use? - and it is most popular for file digests. (SHA will be chosen by anyone serious). And the quickstart is fine, but a simple hash-my-file example would be good too? Paul --- Paul A. Bristow Prizet Farmhouse Kendal, UK LA8 8AB +44 1539 561830, mobile +44 7714330204 pbristow@hetp.u-net.com

On 6 May 2010 05:46, Paul A. Bristow <pbristow@hetp.u-net.com> wrote:
Looks neat.
Thanks.
This is going to be used by lots of total amateurs (it might be me!).
So the start is a bit steep?
Good point. I've added a short introduction <http://www.cs.mcgill.ca/~smcmur/hash/doc/html/introduction.html>; What else would you like to see there?
A *default* hash policy would also be nice? So the quickstart could be even quicker ;-)
Because of the range of possible uses, I'd rather not provide a default, especially since future papers will likely necessitate changing it. I've added some guidance below the table with your SHA/MD5 suggestions (which I agree with).
And the quickstart is fine, but a simple hash-my-file example would be good too?
You're in luck; That's the other example included. I added a link to it from the quickstart to make it easier to find. ~ Scott McMurray

Scott,
I've boostified, cleaned up, and documented the Hash Library I mentioned here about a month ago. Comments, complaints, suggestions, and questions appreciated.
This looks good! One question: I think all the hash functions you have so far are *cryptographic* hashes, right? I'd be interested in seeing non-cryptographic hashes like Bob Jenkin's hashlittle2 in the same framework. It looks like that would be possible? Best regards, Gareth ************************************************************************ The information contained in this message or any of its attachments may be confidential and is intended for the exclusive use of the addressee(s). Any disclosure, reproduction, distribution or other dissemination or use of this communication is strictly prohibited without the express permission of the sender. The views expressed in this email are those of the individual and not necessarily those of Sony or Sony affiliated companies. Sony email is for business use only. This email and any response may be monitored by Sony to be in compliance with Sony's global policies and standards

On 6 May 2010 05:52, Sylvester-Bradley, Gareth <Gareth.Sylvester-Bradley@eu.sony.com> wrote:
One question: I think all the hash functions you have so far are *cryptographic* hashes, right? I'd be interested in seeing non-cryptographic hashes like Bob Jenkin's hashlittle2 in the same framework. It looks like that would be possible?
Yes, they're all cryptographic hash functions for now. The concepts would definitely allow models that are not cryptographic. One problem I foresee there is that I didn't have enough models of a message digest to distill a concept, so they currently return an instance of a specific digest template. Because that digest template is algorithm-independent, there is a slight cost involved in the conversion from the internal state. That cost is lost in the noise for the more expensive cryptographic hashes, but might be a problem for the much faster non-cryptographic hashes. I'd love to generalize it further; Do you have a list of non-cryptographic hashes you'd like to see? Thanks, ~ Scott McMurray

On 6 May 2010 06:01, Daniel James <dnljms@gmail.com> wrote:
This will collide with the existing 'boost::hash' (an implementation of 'std::hash').
Good point. unordered_map's hash definitely has stronger claim. The simplest fix would be just changing the namespace to boost::hashes. Alternatively, the library could be renamed to Cryptographic_Hash and use boost::cryptographic_hash as its namespace, though that would cut off the possibility of following Paul Bristow's suggestion. Any preferences or other ideas? ~ Scott McMurray

Scott McMurray wrote:
On 6 May 2010 06:01, Daniel James <dnljms@gmail.com> wrote:
This will collide with the existing 'boost::hash' (an implementation of 'std::hash').
The simplest fix would be just changing the namespace to boost::hashes. Alternatively, the library could be renamed to Cryptographic_Hash and use boost::cryptographic_hash as its namespace, though that would cut off the possibility of following Paul Bristow's suggestion.
Adding just one letter at the end makes for a rather slim differentiator, but I don't think it will hurt. We already have boost::tuple and boost::tuples, right? Unless you intend to create the Boost Cryptographic Hashing Library versus one that supports a wider range of hashing algorithms, don't include "cryptographic" in the name or namespace. What about Boost Hashing Algorithms Library with namespace boost::hashing_algorithms? A handy namespace alias would be "hal" or "ha." ;) _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On 7 May 2010 23:37, Scott McMurray <me22.ca+boost@gmail.com> wrote:
On 6 May 2010 06:01, Daniel James <dnljms@gmail.com> wrote:
This will collide with the existing 'boost::hash' (an implementation of 'std::hash').
Good point. unordered_map's hash definitely has stronger claim.
Only because it came first. If we were making the choice today, something like 'boost::container::hash' would be fine. I'm now wondering if it might be a good idea to move it - I think it's rarely used directly, so most wouldn't even notice. How long should the gap be between moving 'boost::hash', and allowing something else to use the name? I think it'd have to be a least a year.
The simplest fix would be just changing the namespace to boost::hashes. Alternatively, the library could be renamed to Cryptographic_Hash and use boost::cryptographic_hash as its namespace, though that would cut off the possibility of following Paul Bristow's suggestion.
Any preferences or other ideas?
I dislike plural namespaces as they can be confusing (I can never remember when to use 'iostream' vs. 'iostreams'), but I can't think of anything better. Daniel

Scott McMurray wrote:
I've boostified, cleaned up, and documented the Hash Library I mentioned here about a month ago. Comments, complaints, suggestions, and questions appreciated.
Source code is in the sandbox at http://svn.boost.org/svn/boost/sandbox/hash/
Documentation can be read online at http://www.cs.mcgill.ca/~smcmur/hash/
I won't point out the various typos I noticed during my reading, but will focus on bigger issues at this time. Links from one section to the next (and the reverse) would be helpful. The first page of the documentation refers to "message digests" without defining the term. That's off-putting. Furthermore, the association between "hash," which is in the library name, and "message digest" is weak. Not everyone considering your library knows about cryptographic hashes and the terminology relating to them. __________ Objectives This section is, apparently, a Motivation section, but doesn't serve that role well. You need to focus on the person just considering your library. Why should they use it? What makes it better, easier, faster than something else? Listing reasons for having created the library should be part of a Rationale section, or similar. Thus, you might include something more like the following: "Computing hashes (message digests?) is useful for verifying the integrity of data. The creator of the data can compute the hash and the user of the data can, upon receipt of the data, compute a new hash and compare it with the creator's hash, to ensure the data received is the data created originally. "Hashes can be computed using various algorithms of differing resistance to compromise from random data permutations and malicious alterations. The increased resistance usually comes at a runtime cost of computation or of hash size. A common alternative for this sort of checking is a cyclic redundancy check (CRC), such as provided by Boost.CRC. The hashes in MaybeBoost.Hash, however, are stronger and, thus, more secure. "The Boost.UUID library needed a random number generator based upon a strong hashing function to ensure ID uniqueness. While Boost.UUID provided one hash that served well, it didn't include support for an important alternative. MaybeBoost.Hash provides the original hash, the one Boost.UUID was missing, and more, all in a reusable and extensible framework. "MaybeBoost.Hash uses generic programming to reuse common algorithms with hash-specific policies to account for differences in input sizes, hashing algorithms, etc. The library provides flexible access to the hashing process and provides convenient output options." ___________ Quickstart In the bullet 2 table, the SHA-256 policy should be sha2<256>, right? Why include CubeHash when it is merely a candidate and you recommend not using it? Including some of the others is useful merely because there are existing demands for their use. Is the same true of CubeHash? ___________ Functions To what bits does, "All the bits in the provided values are used," refer? Did you mean something like, "All bits in the input range values are used?" Bullet 1: "As a pair on input iterators" Bullet 2: "As an input iterator and a length" Do you support Boost.Range/RangeEx? I don't understand to what the Portability Note applies. Where is the uint_t<N>::exact to be used? Do you mean the data type in the input range? The Alternative may make more sense when you clarify the "all bits" part I raised above. However, if I'm right, then this text could be clearer as, "If you need to use a subset of bits from the input range's values, then use StreamHash. The equivalent, albeit verbose, expression would be HashPolicy::stream_hash<N>::type().update(b, e).end_message()." Also, this should be paired with the previous "all bits" sentence by making it a footnote. __________ Concepts HashPolicy Concept Is "HashPolicy" a good name? Is "Hash" not enough? That is, the Hash Concept? You don't have "MessageDigestPolicy" or "StreamHashPolicy," so why "HashPolicy?" T::stream_hash<N>::type - A model of the StreamHash concept (link to that concept section) - ... T::digest_type - A model of the MessageDigest concept (link to that concept section) - The same type as T::stream_hash<N>::type::digest_type for all allowed values of N StreamHash Concept Notice that I removed the "<ValueBits>" part. I don't think that should be part of the concept name. T::digest_type - A model of the MessageDigest concept (link to that concept section) Rather than spelling out the operations that must be supported, why not list the concepts that must be supported, when reasonable: DefaultConstructible, CopyConstructible, CopyAssignable? What does, "Provided models are accessible through HashPolicy models," mean? _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On 6 May 2010 07:54, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Links from one section to the next (and the reverse) would be helpful.
Added.
Not everyone considering your library knows about cryptographic hashes and the terminology relating to them. [...] Listing reasons for having created the library should be part of a Rationale section, or similar.
I've added an Introduction defining terms and suggesting usages, and moved the history to a Rationale.
In the bullet 2 table, the SHA-256 policy should be sha2<256>, right?
Yes.
Why include CubeHash when it is merely a candidate and you recommend not using it? Including some of the others is useful merely because there are existing demands for their use. Is the same true of CubeHash?
I knew I needed to include MD5 and the SHAs as they seem to be the most used. They are, however, constructed in the same way out of block cyphers with similar implementations, which is part of the reason NIST is currently holding its SHA-3 competition. I included CubeHash to ensure that different styles of hashes would still fit nicely in the concepts. (CubeHash is not based around a block cypher at all.) I'd like to add more second-round SHA-3 candidates as well (such as the sponge-construction-based Keccak and the tweakable-block-cypher-based Skein), but decided to get input on what I had before implementing more models.
To what bits does, "All the bits in the provided values are used," refer? Did you mean something like, "All bits in the input range values are used?"
Yes. I've changed it to use your phrasing.
Do you support Boost.Range/RangeEx?
Yes. The documentation now has a "single-pass range of readable iterators".
I don't understand to what the Portability Note applies. [...] The Alternative may make more sense when you clarify the "all bits" part I raised above.
I've rephrased both these sections; Let me know whether the changes are any better.
Is "HashPolicy" a good name? Is "Hash" not enough? That is, the Hash Concept? You don't have "MessageDigestPolicy" or "StreamHashPolicy," so why "HashPolicy?"
I put "Policy" in that name to emphasize that its usual usage would be as a policy template argument, rather than providing any functions. I've spelt that out explicitly now, and changed the concept name to HashAlgorithm. I'd prefer not to use just "Hash" since some people refer to digests as hashes, so I think being more explicit would be better.
[...] link to that concept section [...]
Added.
StreamHash Concept
Notice that I removed the "<ValueBits>" part. I don't think that should be part of the concept name.
The thing is that a StreamHash that looks at the bottom 4 bits of a value in update_one is semantically very different one one that looks at all 8 bits, so I considered that important enough to be part of the concept name. It also simplifies the description of HashAlgorithm, since it needs to say somehow that stream_hash<4>::type is different from stream_hash<8>::type.
Rather than spelling out the operations that must be supported, why not list the concepts that must be supported, when reasonable: DefaultConstructible, CopyConstructible, CopyAssignable?
Done.
What does, "Provided models are accessible through HashPolicy models," mean?
I've changed that part to read, "Each HashAlgorithm model provides access to all its associated StreamHash models; Those StreamHash models are generally not accessible in other ways." It's not essential though, and could be removed if it's more confusing than helpful. Thanks for the extensive feedback, ~ Scott McMurray

Scott McMurray wrote:
On 6 May 2010 07:54, Stewart, Robert <Robert.Stewart@sig.com> wrote:
Not everyone considering your library knows about cryptographic hashes and the terminology relating to them. [...] Listing reasons for having created the library should be part of a Rationale section, or similar.
I've added an Introduction defining terms and suggesting usages, and moved the history to a Rationale.
You launch right into describing cryptographic hashes rather than the library, making the Introduction an introduction to cryptographic hashes. Perhaps this would work well: "The MaybeBoost Hash library provides concept-based hashing algorithms of various kinds for use in cryptography, mostly as part of authentication schemes, plus corruption detection, digital fingerprinting, and duplication detection. For more information on cryptographic hashing, see the <Wikipedia article>. "A hashing algorithm produces a small encoded representation of a message called a _digest_ (sometimes known as a _hash_). The digest is most useful when it is unique to that message alone and cannot be used to recreate the original message. There are numerous hashing algorithms that satisfy these criteria to differing degrees, with variations in digest size and processing time required. "To be more specific, a hashing algorithm H takes a message M and produces a B-bit digest, D = H(M), in such a way that the following properties hold:" Notice that I removed the emphasis on "cryptographic" in the foregoing as I think you intend the library to include more than those algorithms. For clarity, perhaps use boldface on the terms or capitalize all words in the terms (when defining them, at least). "Second preimage resistance" and "Collision resistance" seem alike to me, based upon the explanations. Since both assert difficulty in H(M) = H(M'), where M and M' are different messages, what is different? In, "As their name suggests," the antecedent for "their" is vague. I suggest, "As the name cryptographic hash suggests, such algorithms have many uses...." As you can tell, I've reordered some of the content anyway.
To what bits does, "All the bits in the provided values are used," refer? Did you mean something like, "All bits in the input range values are used?"
Yes. I've changed it to use your phrasing.
Do you support Boost.Range/RangeEx?
Yes. The documentation now has a "single-pass range of readable iterators".
I don't understand to what the Portability Note applies. [...] The Alternative may make more sense when you clarify the "all bits" part I raised above.
I've rephrased both these sections; Let me know whether the changes are any better.
Much better.
Is "HashPolicy" a good name? Is "Hash" not enough? That is, the Hash Concept? You don't have "MessageDigestPolicy" or "StreamHashPolicy," so why "HashPolicy?"
I put "Policy" in that name to emphasize that its usual usage would be as a policy template argument, rather than providing any functions. I've spelt that out explicitly now, and changed the concept name to HashAlgorithm. I'd prefer not to use just "Hash" since some people refer to digests as hashes, so I think being more explicit would be better.
I like the new name and I used the hash <=> digest part in my Introduction rewrite.
StreamHash Concept
Notice that I removed the "<ValueBits>" part. I don't think that should be part of the concept name.
The thing is that a StreamHash that looks at the bottom 4 bits of a value in update_one is semantically very different one one that looks at all 8 bits, so I considered that important enough to be part of the concept name. It also simplifies the description of HashAlgorithm, since it needs to say somehow that stream_hash<4>::type is different from stream_hash<8>::type.
While those variations are semantically different, they all support the same concept, don't they? That is, StreamHash<2> is the same *concept* as StreamHash<4>, even if the implementation of the two differs.
What does, "Provided models are accessible through HashPolicy models," mean?
I've changed that part to read, "Each HashAlgorithm model provides access to all its associated StreamHash models; Those StreamHash models are generally not accessible in other ways." It's not essential though, and could be removed if it's more confusing than helpful.
s/models; Those/modesl; those/ That statement suggests that one shouldn't instantiate a stream_hash<N> directly, yet you don't say so. Perhaps you mean, "StreamHash models are typically generated from a HashAlgorithm by the expression ha::stream_hash<N>::type." That's a little more concrete and implies your intention of not creating some stream_hash<N> directly. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

Scott McMurray wrote:
I've boostified, cleaned up, and documented the Hash Library I mentioned here about a month ago. Comments, complaints, suggestions, and questions appreciated.
Source code is in the sandbox at http://svn.boost.org/svn/boost/sandbox/hash/
Documentation can be read online at http://www.cs.mcgill.ca/~smcmur/hash/
I was very pleased by how the implementation came together, as templates mean that all the hash algorithms share a substantial portion of their code.
Those details are all hidden from the user, though. Here's an example showing the simplest use of the library:
#include <boost/hash.hpp>
#include <iostream> #include <string>
int main() { std::string s = "Hello World!"; typedef boost::hash::sha2<256> HashPolicy; HashPolicy::digest_type digest = boost::hash::compute_digest<HashPolicy>(s); std::cout << digest << "\n"; }
Thanks, ~ Scott McMurray
To be honest, I just looked at the quickstart (very nice) and the performance: Performance is an issue, I think. You may also want to look at small entities. We hash many millions of small data entities each day (MD5). Also, adding comparison with openssl or cryptopp might be useful. Regards, Roland

On 6 May 2010 08:48, Roland Bock <rbock@eudoxos.de> wrote:
To be honest, I just looked at the quickstart (very nice)
Thanks.
Performance is an issue, I think. You may also want to look at small entities. We hash many millions of small data entities each day (MD5).
Also, adding comparison with openssl or cryptopp might be useful.
I did put a few of optimizations in. MD5, for example, used to be 300% slower than it is now. There is definitely plenty of opportunity to improve performance, however. Per-message overhead will definitely be a good thing to test. ~ Scott McMurray
participants (7)
-
Daniel James
-
Paul A. Bristow
-
Roland Bock
-
Scott McMurray
-
Stewart, Robert
-
Sylvester-Bradley, Gareth
-
vicente.botet