
Hi, giving a small heads-up. There is now a minimal PoC at https://github.com/Ulfgard/aBLAS . Minimal in the sense that i took my existing LinAlg, ripped it apart and rewrote it partially to fit the new needs of the library. Only cpu is implemented, yet. I am open for suggestions, helpful advice and of course people who are interested to work on it. I am not too happy with the scheduling interface right now and its implementation looks a bit slower than necessary, but i think this will evolve over time. two basic examples showing what the library already can do are given in examples/ . For ublas users this should not look too foreign. For all who are interested, here is the basic design and the locations in include/aBLAS/: 1. computational kernels are implemented in kernels/ and represent typical bindings to the BLAS1-3 functionality as well as a default implementation (currently only dot,gemv,gemm and assignment are tested and working, no explicit bindings included, yet). kernels are enqueued in the scheduler via the expression template mechanisms and kernels are not allowed to enqueue kernels recursively. 2. a simple PoC scheduler is implemented in scheduling/scheduling.hpp. It implements a dependency graph between work packages and work is enqueued into a boost::thread::basic_thread_pool when all its dependencies are resolved. A kernel is enqueued together with a set of dependency_node objects which encapsulate dependencies of variables the kernel uses (i.e. every variable keeps track about what its latest dependencies are and whether these dependencies read from it or write to it). The current interface should be abstracted enough to allow implementation using different technologies(e.g. it should be possible to implement the scheduler in terms of HPX.). One of the tasks of the scheduler is to allow creation of closures where variables are guaranteed to exist until all kernels using them are finished as well as moving a variable into a closure. This is used to prevent an issue similar to the blocking destructor of std::future<T>. Instead of blocking, the variable is moved into the scheduler which then ensures lifetime, until all kernels are finished. This of course requires the kernels to be called in a way that they can cope with the moving of types. what is currently missing is "user created dependencies" to be used in conjunction with the gpu (as gpus are fully asynchronous, we have to register a callback that notifies the scheduler when the gpu is done with its computations just as the worker threads do). 3. basic matrix/vector classes are implemented in matrix.hpp and vector.hpp. The implementation is a bit convoluted for the "move into closure" to work. Basically they introduce another indirection. When a kernel is created, it references a special closure type of the variable (vector<T>::closure_type), which references that indirection. 4. the remaining files in include/aBLAS*.hpp implement the expression templates, which are similar to uBLAS. There are two types distinguished using the CRTP classes matrix_expression<Type,Device> and vector_expression<Type,Device>. Type is the exact type of the expression, Device marks the Device this is working on (cpu_tag or gpu_tag). The second template parameter ensures that one can not mix gpu and cpu expresions unless the operation explicitly allows this. While this looks clumsy, it removes code duplication between the different device types as most of the code can be used for both implementations. assignment.hpp implements the basic assignment operators (Except op=). A += B either calls kernel::assign<>(A,B) if B can be evaluated efficiently elementwise (e.g. A+= 2*A1+A2-A3) or calls B.plus_assign_to(A) which then assigns the terms one-by-one using their specialized kernels (e.g. A+=prod(B,C)+D is evaluated as kernels::gemm(B,C,A); kernels::assign<...>(A,D);) matrix/vector_proxy.hpp implement subranges, rows-of-matrix-operations, etc and matrix/vector_expression the algebraic operations. Best, Oswin