On 19 Apr 2014, at 19:46, Steven Watanabe
AMDG
On 04/18/2014 02:00 AM, Thijs van den Berg wrote:
This algorithm is what they call the 4x64 variant. The paper also talks about a 2x64 and 4x32 variants (but e.g. not 8x32, 2x32). However I’m not sure if we need to implement all the variant the’ve studied (maybe yes, maybe no). They recommend 4x64 with 20 rounds as the best choice for CPU’s and that’s the main one I want to provide.
If the algorithm is basically the same and only differs by some constants in the mixing function, then these constants can be turned into template parameters. If the differences are more significant, then it should be handled by a separate engine template, and there's no need to worry about it at this point. Right now I’m having performance issues with just this 4x64 version & an arbitrary rounds extension.
* I have a un-rolled version “encrypt_counter()” that handles up to 20 rounds (or less), and which has all the constants hard coded in the code. This version times 13.8 CPU cycles for 20 rounds. * I have an arbitrary round version “encrypt_counter_0()” which looks much nicer from a code point of view, it has a loop and static const arrays containing the rotation constants (that repeat every 8 round). However, it takes 42 CPU cycles for the same 20 rounds. I’ve tried many variants and I just can’t make the loop version at fast as the un-rolled one… Perhaps someone can explain why that is? https://github.com/sitmo/threefry/blob/master/random/include/boost/random/th... * initially I had a 13 and 20 rounds version. The 13 round version was supposed to be the lowest rounds version to pass all the BigCrush test, but in my tests it failed one out of the 160 tests. That’s perhaps still better than MT which failed 2 or 3 tests (depending on the version of MT) but "failing only 1 test" is not a perfect selling point. Perhaps 14 round would solve that? The 20 rounds version advised in the paper has 7 extra round as "a big safety margin" and that one passes indeed all BigCrunch tests. My view now is: * I think we should just provide the 20 rounds version by default and allow developers to pick less rounds -and gain a little speed- if they know what they are doing. * I like the "big safety margin” for a default option, I gives you trust that you can pick this engine if you want the “the highest quality random numbers”. I think we shouldn’t predefine the 13 rounds version with a typedef. * The factor 2.5 in speed drop for writing an arbitrary round loop version is not worth it IMO. So I propose: * to only typedef the 20 rounds version, allow for less round via a template argument: * the predefined typedef engines could be named threefry4x64_20 (20 rounds, 32 bit integers result_type) and threefry4x64_20_64 (20 rounds, 64 bit integers result_type). This naming conventions allows all the other variant to be added later if we want to. * have an upper bound of 20 rounds instead of arbitrary rounds because of performance issues I can’t solve. ..this give an engine with the follow properties.. * It is the author’s pick * it is very fast, 13.8 CPU cycles, I don’t expect I can improve the speed of the un-rolled code any further * it has the highest random quality * it has great features for large scale parallel computing What’s your view on limiting the round to <=20 for the template and providing only the 20 round as a typedef? I have addressed most other points you’ve mentioned, but the performance issue of a generic rounds version has failed me.
AFAIK the state and the return_type need not be the same. Any 32 bit engine can be turned into a 64 engine by patching two returns together.
This can be done outside of threefry by independent_bits_engine.
In Christ, Steven Watanabe
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost