Data-processing and serialization design questions

13 Aug 2008

      If no one has noticed, I'm trying out a MD5 system on our Subversion  
server under "$ROOT/sandbox/md5/", hopefully to succeed Boost.CRC.   
(Let's use the skills & experience gained over the past... 7 years.)   
I'm sharing my design ideas to make sure I'm not missing anything.

1.  The results of a MD5 run are encapsulated in the md5_digest  
class.  It's actually a POD type, and the only supported operations  
are equality and streaming standard I/O.  I couldn't think of  
anything else you would/could want to do with this type.

2.  Although "everyone" writes MD5 functions as byte-running  
algorithms, it's actually a bit-running one.  I've constructed my  
classes around bit-oriented operation, so maybe this'll bring a new  
perspective.  I also use 64-bit integer types directly, instead of a  
pair on 32-bit integers like the "standard" implementations.  (All  
the recent work on Boost.Integer was to enable this library.)

3.  The computation of MD5 runs was completely encapsulated in the  
md5_computer class. Like Boost.CRC's design, this lets a run be done  
in piecemeal.  There's also a function that does a single run over a  
buffer, that internally calls the computation class.  Besides the  
actual computation work, the class had accessors to the current state  
of computation.  I noticed that a lot of this code could be reused  
for other coding schemes, so I tried to create a hierarchy that  
spread the functionality.  I felt that it got too unwieldy.  I  
determined that the problem was that presentation and computation  
parts of the class burden each other.  So I divided the class into  
the current md5_computer (presentation) and md5_context (computation)  
classes.  The presentation class has a different hierarchy behind it,  
with more generics over OOP, but still contains a computation  
object.  The computation class publicly has a producer-generator &  
consumer-functor interface, and its attributes cannot be accessed  
except for the associated presentation class, which is a friend.  The  
system is like the I/O-streams' separation into stream and stream- 
buffer classes.  Is this separation a good idea?

4.  The presentation and computation classes have Boost.Serialization  
functionality.  I originally planned to have serialization for the  
back-up classes, but I read a thread from May 2007 suggesting that  
the serialization model should match the user's model, not the  
implementation details, so those were skipped.

5.  Note that the presentation and computation types only support  
Boost.S11n while the digest type only supports standard streams.   
Besides keeping the digest type POD, I didn't see a need to make the  
p/c types printable.  (Now I just realized that you may want to save  
a MD5 in a data file.  Maybe I'll work on serializing digests.)

6.  Should the serialization routines be in the classes' headers, or  
move to separate implementation somewhere within Boost.Serialization  
(assuming this gets accepted, of course)?  I think that Boost.Multi- 
Index puts s11n in its own headers, but that's because the design  
must be intrusive.  It could be non-intrusive for the digest and  
presentation classes, but probably not the computation class.

7.  I'll probably try out the framework on at least one more coding  
type.  (I've read that a framework with only one concrete class is  
probably locked to just that class due to the programmer never having  
to confirm separation of concerns during testing.)

8.  The usual byte-wise MD5 implementation probably does have a speed  
advantage over this library, since it can just dump bytes directly  
into a buffer until hashing time, then compute everything a byte at a  
time.  This library forces CHAR_BIT calls for each byte submission,  
getting worse for buffer submissions.  The library currently wastes  
space in the computation class, since it stores a 512-Boolean array.   
(A "bool" could rip off an unsigned-char, wasting CHAR_BIT - 1 bits,  
or an "int," wasting more space!  This is probably why  
std::vector<bool> was invented.)  Maybe switching to an unsigned-char  
array can fix both problems.

8a.  After the switch at the end of [8], a byte-oriented wrapping  
variant of md5_context could be made that copies bytes directly into  
the inner object's buffer, and calls the hash-updater as needed.   
Note that the wrapping type would still do hash updates bit-wise;  
could/should they be byte-wise too?  A byte-wise hashing optimization  
only works if CHAR_BIT is 8 (not even for higher integral powers of  
two), so is the potential effort worth it?  How would I test both  
cases (octet-sized bytes and not)?  Should I test both cases?  Note  
that the direct-byte-copying part also has problems if 512 %  
CHAR_BIT != 0.

9.  There's only Doxygen comments (which take up most of each file)  
for documentation.  I guess that any Quickbook files would be more  
like general user guides.

-- 
Daryle Walker
Mac, Internet, and Video Game Junkie
darylew AT hotmail DOT com

Daryle Walker

tags

participants (1)