Message Hashing Interface (SHA-1/256/384/512, MD4/5)

I wrote up some SHA message hashers recently, which came out rather nicely using templates to handle the commonalities between SHA-1/256/384/512. The one issue that came up is how to handle retrieving the hash from the accumulator. When working with just the raw SHA compressor it's not a problem, but when hashing messages, getting the hash of the message requires appending padding and a length. That means that SHA and MD[45] hashing cannot use the same simple interface of Boost.CRC that, in essence, consists of calling process_byte() and inspecting the cheap, const member function hash() as needed. Here are a couple of options: 1) The hash() function is non-const, and it appends the passing and length, calculates the hash, and resets the accumulator to be ready for the next message. This is perhaps the most efficient interface. 2) The hash() function is const, and copies the accumulator, then pads and length-appends the copied version, getting the hash from there instead. That gives perhaps the nicest interface, but the extra copying would be somewhat expensive. That said, it's overhead per message, not dependant on the length of the message, so it may be acceptable. Also, the hash function would still be quite expensive, contrary to usual expectations for const member functions. 3) The hash() function is just a cheap const that returns the current state of the compressor. This would require an additional end_of_message() function that would apply the padding and probably things up such that the next process_byte function would start a new message by resetting the compressor. That would allow the hash() function to be cheaply called as many times as desired between calling end_of_message() and providing the data for the next message. One big problem with this one is that hash() can be called after providing data but before calling end_of_message(), at which point is return value is essentially useless. I think I like (1) best, because it allows the user to copy it and get (2) if they'd rather, and doesn't have the potential call sequencing problem of (3). Any opinions, suggestions, alternatives, or comments? ~ Scott McMurray P.S. Yes, the point here is to create Boost.MD and Boost.SHA libraries. The first client would be Boost.UUID to allow it to offer the MD5 name-based UUIDs it currently cannot. It needs a hashing concept so that it can parametrise name_generator similarly to how it currently parametrises basic_random_generator [1]. [1] http://www.boost.org/doc/libs/1_42_0/libs/uuid/uuid.html#Random%20Generator

Hi Scott, I, too, think that option 1 is best. Several other streaming message digest libraries use this pattern whereby the hash calculation can be updated with a given string of bytes as many times as required and a finalization routine produces the hash. Java's `MessageDigest` interface (http://java.sun.com/javase/6/docs/api/java/security/MessageDigest.html), for example, calls its "update" routines `update` and finalization routines `digest`. Using option 1 would probably make the library more intuitive for those who are used to this scheme and you can still accommodate users who would like to compute hashes at several points along the stream by ensuring that the copy constructor clones the state. Daniel On Sun, 04 Apr 2010 10:09:09 -0600, Scott McMurray <me22.ca+boost@gmail.com> wrote:
I wrote up some SHA message hashers recently, which came out rather nicely using templates to handle the commonalities between SHA-1/256/384/512.
The one issue that came up is how to handle retrieving the hash from the accumulator. When working with just the raw SHA compressor it's not a problem, but when hashing messages, getting the hash of the message requires appending padding and a length. That means that SHA and MD[45] hashing cannot use the same simple interface of Boost.CRC that, in essence, consists of calling process_byte() and inspecting the cheap, const member function hash() as needed.
Here are a couple of options:
1) The hash() function is non-const, and it appends the passing and length, calculates the hash, and resets the accumulator to be ready for the next message. This is perhaps the most efficient interface.
2) The hash() function is const, and copies the accumulator, then pads and length-appends the copied version, getting the hash from there instead. That gives perhaps the nicest interface, but the extra copying would be somewhat expensive. That said, it's overhead per message, not dependant on the length of the message, so it may be acceptable. Also, the hash function would still be quite expensive, contrary to usual expectations for const member functions.
3) The hash() function is just a cheap const that returns the current state of the compressor. This would require an additional end_of_message() function that would apply the padding and probably things up such that the next process_byte function would start a new message by resetting the compressor. That would allow the hash() function to be cheaply called as many times as desired between calling end_of_message() and providing the data for the next message. One big problem with this one is that hash() can be called after providing data but before calling end_of_message(), at which point is return value is essentially useless.
I think I like (1) best, because it allows the user to copy it and get (2) if they'd rather, and doesn't have the potential call sequencing problem of (3).
Any opinions, suggestions, alternatives, or comments?
~ Scott McMurray
P.S. Yes, the point here is to create Boost.MD and Boost.SHA libraries. The first client would be Boost.UUID to allow it to offer the MD5 name-based UUIDs it currently cannot. It needs a hashing concept so that it can parametrise name_generator similarly to how it currently parametrises basic_random_generator [1].
[1] http://www.boost.org/doc/libs/1_42_0/libs/uuid/uuid.html#Random%20Generator _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 04/04/2010 09:40 AM, Daniel Trebbien wrote:
Hi Scott,
I, too, think that option 1 is best. Several other streaming message digest libraries use this pattern whereby the hash calculation can be updated with a given string of bytes as many times as required and a finalization routine produces the hash. Java's `MessageDigest` interface (http://java.sun.com/javase/6/docs/api/java/security/MessageDigest.html), for example, calls its "update" routines `update` and finalization routines `digest`.
Using option 1 would probably make the library more intuitive for those who are used to this scheme and you can still accommodate users who would like to compute hashes at several points along the stream by ensuring that the copy constructor clones the state.
This is also the scheme used by crypto++ (C++), the Perl API, PyCrypto (Python) and openssl (C). I think the Boost crypto hash API should more closely mimic the existing de facto standard interfaces as much as possible. This will make it easier for those that program in multiple languages to make use of the Boost library. The MessageDigest interface (adapted for C++ & Boost) is the one I recommend be targeted for crypto hash functions rather than Boost CRC. The CRC interface is not robust enough to properly handle cryptographic hashes without a lot of extension in any case. References: http://www.dlitz.net/software/pycrypto/doc/#crypto-hash-hash-functions https://www.hasustorm.com/books/English/OReilly.Secure.Programming.Cookbook.... http://www.cryptopp.com/docs/ref/class_s_h_a1.html Rob

On 6 April 2010 10:55, Rob Riggs <rob@pangalactic.org> wrote:
This is also the scheme used by crypto++ (C++), the Perl API, PyCrypto (Python) and openssl (C). I think the Boost crypto hash API should more closely mimic the existing de facto standard interfaces as much as possible.
Thanks for the references. They're not completely consistent, though.
From Python: "digest(): Return the hash value of this hashing object, as a string containing 8-bit data. The object is not altered in any way by this function; you can continue updating the object after calling this function."
From Java6: "The digest method can be called once for a given number of updates. After digest has been called, the MessageDigest object is reset to its initialized state."
I do like the idea that digest() be const, since conceptually it's just inspecting the result of the previous updates. Here's a hybrid idea, with definitions to show equivalences: class hash { public: void update(byte); void reset() { *this = hash(); } digest_type digest() const { return hash(*this).end_of_message(); } digest_type end_message() { digest_type h = digest(); reset(); return h; } }; That way it provides for rolling or message-oriented digests, and adapts for the other as best it can if needed. Is that a reasonable compromise? ~ Scott P.S. Interestingly, with 64-bit processors, SHA-512 is faster for me than SHA-1.

On 04/06/2010 10:55 AM, Scott McMurray wrote:
On 6 April 2010 10:55, Rob Riggs<rob@pangalactic.org> wrote:
This is also the scheme used by crypto++ (C++), the Perl API, PyCrypto (Python) and openssl (C). I think the Boost crypto hash API should more closely mimic the existing de facto standard interfaces as much as possible.
Thanks for the references. They're not completely consistent, though.
I do like the idea that digest() be const, since conceptually it's just inspecting the result of the previous updates.
Here's a hybrid idea, with definitions to show equivalences:
class hash { public: void update(byte); void reset() { *this = hash(); } digest_type digest() const { return hash(*this).end_of_message(); } digest_type end_message() { digest_type h = digest(); reset(); return h; } };
That way it provides for rolling or message-oriented digests, and adapts for the other as best it can if needed.
Is that a reasonable compromise?
I like it! Rob

Here's a hybrid idea, with definitions to show equivalences:
class hash { public: void update(byte); void reset() { *this = hash(); } digest_type digest() const { return hash(*this).end_message(); } digest_type end_message() { digest_type h = digest(); reset(); return h; } };
That way it provides for rolling or message-oriented digests, and adapts for the other as best it can if needed.
Is that a reasonable compromise?
I like it!
Rob
+1 and one, minor suggestion: why not name `end_message` simply `end`? I am of the opinion that naming should follow the language's standard library conventions closely. And, as the STL uses short, concise member function names, it is best to mimic this.

On 10 April 2010 10:45, Daniel Trebbien <dtrebbien@gmail.com> wrote:
+1 and one, minor suggestion: why not name `end_message` simply `end`? I am of the opinion that naming should follow the language's standard library conventions closely. And, as the STL uses short, concise member function names, it is best to mimic this.
If you intend on shortening the name of this function (and I don't feel strongly about it one way or the other), I'd say pick a different word, such as finish(). end() has too many connotations connected with the STL. -- Nevin Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

If you intend on shortening the name of this function (and I don't feel strongly about it one way or the other), I'd say pick a different word, such as finish(). end() has too many connotations connected with the STL.
True. `end` might be needed later for a pair of members `begin` and `end`. `finish` seems good to me.

On 10 April 2010 11:45, Daniel Trebbien <dtrebbien@gmail.com> wrote:
one, minor suggestion: why not name `end_message` simply `end`? I am of the opinion that naming should follow the language's standard library conventions closely. And, as the STL uses short, concise member function names, it is best to mimic this.
I picked that one thinking of the ASCII control codes END OF TEXT/TRANSMISSION/MEDIUM and the email convention of <eom>. And I realize it's not a member function, but STL isn't afraid of calling something lexicographical_compare if it needs to. As part of the http://en.wikipedia.org/wiki/Merkle%E2%80%93Damg%C3%A5rd_construction it's called "Finalization", and most precedents seem to use something like that. Crypto++ calls it "Final", the MD5 reference implementation uses "MD5Final", and the implementation provided for the "best understood" (according to NIST) SHA-3 candidate CubeHash (http://cubehash.cr.yp.to/simple.c) also calls it Final. To get an actual verb, I guess the most logical name would be "finalize()", but I suppose most libraries can't use that since their GC stole the name.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 04/04/2010 12:09 PM, Scott McMurray wrote:
I wrote up some SHA message hashers recently, which came out rather nicely using templates to handle the commonalities between SHA-1/256/384/512.
I realize I'm still new here, but having written a large library of hashing functions myself several years ago (purely for fun, never used... I was thinking of proposing it for Boost inclusion after XInt), I feel that I have a little experience to contribute.
The one issue that came up is how to handle retrieving the hash from the accumulator. [...]
Here are a couple of options:
1) The hash() function is non-const, and it appends the passing and length, calculates the hash, and resets the accumulator to be ready for the next message. This is perhaps the most efficient interface. [...]
I think I like (1) best, because it allows the user to copy it and get (2) if they'd rather, and doesn't have the potential call sequencing problem of (3).
Any opinions, suggestions, alternatives, or comments?
I agree with your opinion, with a slight variation: don't reset the accumulator until the user explicitly requests it. It seems counterintuitive, though I'm sure it could be made obvious with the proper function naming.
P.S. Yes, the point here is to create Boost.MD and Boost.SHA libraries. [...]
Is there a particular reason to separate the two (conceptually)? A Boost.CryptographicHash library, covering those and several other commonly used ones, seems more appealing. They would, of course, still be *logically* separate, they'd just be under the same umbrella name, and share the same interface. I'll contribute code for several more formats (MD2, RIPEMD 128/160/256/320, Tiger 128/160/192, Tiger2 128/160/192, and Whirlpool, all byte-oriented only), if you're interested. They probably need some work (they were written several years ago), but the core code is sound. I'm pretty sure none of them are patent-encumbered, but of course I'd make certain. - -- Chad Nelson Oak Circle Software, Inc. * * * -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAku58wkACgkQp9x9jeZ9/wT/BACcCF6oLl4NdoDfIGYT3//V9bof 5egAoJPJaZxIGBggCJlioBYLEdQeJF+R =7am8 -----END PGP SIGNATURE-----

On 5 April 2010 10:26, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
I agree with your opinion, with a slight variation: don't reset the accumulator until the user explicitly requests it. It seems counterintuitive, though I'm sure it could be made obvious with the proper function naming.
Why do you prefer that method? Since get_digest adds padding, calling get_digest or process_byte again without a reset doesn't provide anything useful. Also, the reset is just a few copies, completely negligible compared to the cost of the one or two block compression, so omitting it saves nothing.
Is there a particular reason to separate the two (conceptually)? A Boost.CryptographicHash library, covering those and several other commonly used ones, seems more appealing. They would, of course, still be *logically* separate, they'd just be under the same umbrella name, and share the same interface.
They should definitely use the same interface. As mentioned, they need a Concept to which generic code can be written. I mostly just had them separate because the precedent is Boost.CRC, not Boost.ErrorDetectingCode.
I'll contribute code for several more formats (MD2, RIPEMD 128/160/256/320, Tiger 128/160/192, Tiger2 128/160/192, and Whirlpool, all byte-oriented only), if you're interested. They probably need some work (they were written several years ago), but the core code is sound.
More formats would be great, and is part of the reason for separate libraries, since I'm certain that others will want to add more formats, and shouldn't have to go through the first person to propose some, as Boost guidelines prefer for additions to existing libraries. Speaking of byte-oriented, one great bit about templating is that the preprocessing stage can be written generically and fed with different-size types (so bit, byte, and different words all work) and to different backends, so SHA-1/256/384/512 all use the same preprocessor. It's still efficient, too -- it works out the same or a few percent faster than the version Andy found for boost/uuid/sha1.hpp. Adding a few more algorithms will be an important test of this genericity, though.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 04/05/2010 12:44 PM, Scott McMurray wrote:
I agree with your opinion, with a slight variation: don't reset the accumulator until the user explicitly requests it. It seems counterintuitive, though I'm sure it could be made obvious with the proper function naming.
Why do you prefer that method? Since get_digest adds padding, calling get_digest or process_byte again without a reset doesn't provide anything useful. Also, the reset is just a few copies, completely negligible compared to the cost of the one or two block compression, so omitting it saves nothing.
Agreed, on all points. But unless the function is named such that it's very obvious that it resets the object as well as returns the hash value, people will be surprised that it reset itself when they didn't tell it to.
Is there a particular reason to separate the two (conceptually)? A Boost.CryptographicHash library, covering those and several other commonly used ones, seems more appealing. They would, of course, still be *logically* separate, they'd just be under the same umbrella name, and share the same interface.
They should definitely use the same interface. As mentioned, they need a Concept to which generic code can be written. I mostly just had them separate because the precedent is Boost.CRC, not Boost.ErrorDetectingCode.
Off the top of my head, I think Boost.CRC supports my argument, since I believe it includes Adler codes. They're a form of CRC, but aren't what most programmers think of when you say CRC.
I'll contribute code for several more formats (MD2, RIPEMD 128/160/256/320, Tiger 128/160/192, Tiger2 128/160/192, and Whirlpool, all byte-oriented only), if you're interested. They probably need some work (they were written several years ago), but the core code is sound.
More formats would be great, and is part of the reason for separate libraries, since I'm certain that others will want to add more formats, and shouldn't have to go through the first person to propose some, as Boost guidelines prefer for additions to existing libraries.
I don't know the details, but there may be precedent for an umbrella library name with different but related parts by different authors: Boost.Math.
Speaking of byte-oriented, one great bit about templating is that the preprocessing stage can be written generically and fed with different-size types (so bit, byte, and different words all work) and to different backends, so SHA-1/256/384/512 all use the same preprocessor. It's still efficient, too -- it works out the same or a few percent faster than the version Andy found for boost/uuid/sha1.hpp.
Adding a few more algorithms will be an important test of this genericity, though.
Just let me know if you'd like to see the code... and don't laugh too hard when you do, I was young(er) and foolish when I wrote it. :-) - -- Chad Nelson Oak Circle Software, Inc. * * * -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAku6WeMACgkQp9x9jeZ9/wSd0gCfSn/pvVjHfuCws1dtAWErhPCS wGcAn2l89EdKajIUiaE6s8saKlTXtO04 =Pc0g -----END PGP SIGNATURE-----

On 4 April 2010 11:09, Scott McMurray <me22.ca+boost@gmail.com> wrote:
2) The hash() function is const, and copies the accumulator, then pads and length-appends the copied version, getting the hash from there instead. That gives perhaps the nicest interface, but the extra copying would be somewhat expensive.
I prefer this interface. This is the same as Boost.CRC (and heck, std::string::c_str() for that matter). If you still wish to provide a faster but destructive one, just use another function. -- Nevin Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

On 5 April 2010 20:01, Nevin Liber <nevin@eviloverlord.com> wrote:
On 4 April 2010 11:09, Scott McMurray <me22.ca+boost@gmail.com> wrote:
2) The hash() function is const, and copies the accumulator, then pads and length-appends the copied version, getting the hash from there instead. That gives perhaps the nicest interface, but the extra copying would be somewhat expensive.
I prefer this interface. This is the same as Boost.CRC (and heck, std::string::c_str() for that matter). If you still wish to provide a faster but destructive one, just use another function.
So x.hash() would then be copy(x).end_of_message()? I could go for that.
participants (5)
-
Chad Nelson
-
Daniel Trebbien
-
Nevin Liber
-
Rob Riggs
-
Scott McMurray