Data-processing and serialization design questions

If no one has noticed, I'm trying out a MD5 system on our Subversion server under "$ROOT/sandbox/md5/", hopefully to succeed Boost.CRC. (Let's use the skills & experience gained over the past... 7 years.) I'm sharing my design ideas to make sure I'm not missing anything. 1. The results of a MD5 run are encapsulated in the md5_digest class. It's actually a POD type, and the only supported operations are equality and streaming standard I/O. I couldn't think of anything else you would/could want to do with this type. 2. Although "everyone" writes MD5 functions as byte-running algorithms, it's actually a bit-running one. I've constructed my classes around bit-oriented operation, so maybe this'll bring a new perspective. I also use 64-bit integer types directly, instead of a pair on 32-bit integers like the "standard" implementations. (All the recent work on Boost.Integer was to enable this library.) 3. The computation of MD5 runs was completely encapsulated in the md5_computer class. Like Boost.CRC's design, this lets a run be done in piecemeal. There's also a function that does a single run over a buffer, that internally calls the computation class. Besides the actual computation work, the class had accessors to the current state of computation. I noticed that a lot of this code could be reused for other coding schemes, so I tried to create a hierarchy that spread the functionality. I felt that it got too unwieldy. I determined that the problem was that presentation and computation parts of the class burden each other. So I divided the class into the current md5_computer (presentation) and md5_context (computation) classes. The presentation class has a different hierarchy behind it, with more generics over OOP, but still contains a computation object. The computation class publicly has a producer-generator & consumer-functor interface, and its attributes cannot be accessed except for the associated presentation class, which is a friend. The system is like the I/O-streams' separation into stream and stream- buffer classes. Is this separation a good idea? 4. The presentation and computation classes have Boost.Serialization functionality. I originally planned to have serialization for the back-up classes, but I read a thread from May 2007 suggesting that the serialization model should match the user's model, not the implementation details, so those were skipped. 5. Note that the presentation and computation types only support Boost.S11n while the digest type only supports standard streams. Besides keeping the digest type POD, I didn't see a need to make the p/c types printable. (Now I just realized that you may want to save a MD5 in a data file. Maybe I'll work on serializing digests.) 6. Should the serialization routines be in the classes' headers, or move to separate implementation somewhere within Boost.Serialization (assuming this gets accepted, of course)? I think that Boost.Multi- Index puts s11n in its own headers, but that's because the design must be intrusive. It could be non-intrusive for the digest and presentation classes, but probably not the computation class. 7. I'll probably try out the framework on at least one more coding type. (I've read that a framework with only one concrete class is probably locked to just that class due to the programmer never having to confirm separation of concerns during testing.) 8. The usual byte-wise MD5 implementation probably does have a speed advantage over this library, since it can just dump bytes directly into a buffer until hashing time, then compute everything a byte at a time. This library forces CHAR_BIT calls for each byte submission, getting worse for buffer submissions. The library currently wastes space in the computation class, since it stores a 512-Boolean array. (A "bool" could rip off an unsigned-char, wasting CHAR_BIT - 1 bits, or an "int," wasting more space! This is probably why std::vector<bool> was invented.) Maybe switching to an unsigned-char array can fix both problems. 8a. After the switch at the end of [8], a byte-oriented wrapping variant of md5_context could be made that copies bytes directly into the inner object's buffer, and calls the hash-updater as needed. Note that the wrapping type would still do hash updates bit-wise; could/should they be byte-wise too? A byte-wise hashing optimization only works if CHAR_BIT is 8 (not even for higher integral powers of two), so is the potential effort worth it? How would I test both cases (octet-sized bytes and not)? Should I test both cases? Note that the direct-byte-copying part also has problems if 512 % CHAR_BIT != 0. 9. There's only Doxygen comments (which take up most of each file) for documentation. I guess that any Quickbook files would be more like general user guides. -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com
participants (1)
-
Daryle Walker