There is now a minimal PoC at https://github.com/Ulfgard/aBLAS . Minimal in the sense that i took my existing LinAlg, ripped it apart and rewrote it partially to fit the new needs of the library. Only cpu is implemented, yet. I am open for suggestions, helpful advice and of course people who are interested to work on it. I am not too happy with the scheduling interface right now and its implementation looks a bit slower than necessary, but i think this will evolve over time.
The scheduling should not be part of your library, it's a pure implementation detail.
two basic examples showing what the library already can do are given in examples/ . For ublas users this should not look too foreign.
I don't think that you should make it explicit to the user whether a certain operation has to be async or not. Just make every operation async by default and let the library decide when to synchronize (i.e. whenever the actual matrix data is accessed). Any expressions (like r += 2*x+1 + prod(x,y) - btw: why prod(x,y) and not x * y)?) should not execute any code directly, just construct an expression dependency tree which starts scheduling the asynchronous sub-tasks. You need to wait for the result only when somebody accesses an element of r. The easiest would be to let your matrix type wrap a future which references the actual matrix. Additional transformations of the expression tree could be used to merge the operations, if needed.
For all who are interested, here is the basic design and the locations in include/aBLAS/:
1. computational kernels are implemented in kernels/ and represent typical bindings to the BLAS1-3 functionality as well as a default implementation (currently only dot,gemv,gemm and assignment are tested and working, no explicit bindings included, yet). kernels are enqueued in the scheduler via the expression template mechanisms and kernels are not allowed to enqueue kernels recursively.
Why do you implement your own BLAS functions?
2. a simple PoC scheduler is implemented in scheduling/scheduling.hpp. It implements a dependency graph between work packages and work is enqueued into a boost::thread::basic_thread_pool when all its dependencies are resolved. A kernel is enqueued together with a set of dependency_node objects which encapsulate dependencies of variables the kernel uses (i.e. every variable keeps track about what its latest dependencies are and whether these dependencies read from it or write to it). The current interface should be abstracted enough to allow implementation using different technologies(e.g. it should be possible to implement the scheduler in terms of HPX.).
One of the tasks of the scheduler is to allow creation of closures where variables are guaranteed to exist until all kernels using them are finished as well as moving a variable into a closure. This is used to prevent an issue similar to the blocking destructor of std::future<T>. Instead of blocking, the variable is moved into the scheduler which then ensures lifetime, until all kernels are finished. This of course requires the kernels to be called in a way that they can cope with the moving of types.
As said, the scheduler should be a pure implementation detail and not exposed to the user. I'd even suggest to rely on executors for the scheduling needs, leaving the scheduler out of the picture completely. Also, if you need dependencies, those shouldn't be handled by the executor, but be better imposed by the code using it (via futures or similar).
what is currently missing is "user created dependencies" to be used in conjunction with the gpu (as gpus are fully asynchronous, we have to register a callback that notifies the scheduler when the gpu is done with its computations just as the worker threads do).
3. basic matrix/vector classes are implemented in matrix.hpp and vector.hpp. The implementation is a bit convoluted for the "move into closure" to work. Basically they introduce another indirection. When a kernel is created, it references a special closure type of the variable (vector<T>::closure_type), which references that indirection.
4. the remaining files in include/aBLAS*.hpp implement the expression templates, which are similar to uBLAS. There are two types distinguished using the CRTP classes matrix_expression
and vector_expression . Type is the exact type of the expression, Device marks the Device this is working on (cpu_tag or gpu_tag). The second template parameter ensures that one can not mix gpu and cpu expresions unless the operation explicitly allows this. While this looks clumsy, it removes code duplication between the different device types as most of the code can be used for both implementations. assignment.hpp implements the basic assignment operators (Except op=). A += B either calls kernel::assign<>(A,B) if B can be evaluated efficiently elementwise (e.g. A+= 2*A1+A2-A3) or calls B.plus_assign_to(A) which then assigns the terms one-by-one using their specialized kernels (e.g. A+=prod(B,C)+D is evaluated as kernels::gemm(B,C,A); kernels::assign<...>(A,D);)
matrix/vector_proxy.hpp implement subranges, rows-of-matrix-operations, etc and matrix/vector_expression the algebraic operations.
Can I use other BLAS implementations, like for instance Intel MKL? General note: I have the impression, that - just like any decently sized system eventually embeds a half-baked and bug-ridden implementation of LISP - any asynchronous library nowadays contains its own half-assed implementation of a parallel runtime system. So let me ask again: why not use a full blown, mature, and thoroughly tested runtime system for your needs instead of worrying about how to schedule 5 dependent tasks your library creates? Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu