Re: [boost] interest in an asynchronous linear algebra library?

3 Dec 2015

      Hi,

giving a small heads-up.

There is now a minimal PoC at https://github.com/Ulfgard/aBLAS . Minimal 
in the sense that i took my existing LinAlg, ripped it apart and rewrote 
it partially to fit the new needs of the library. Only cpu is 
implemented, yet. I am open for suggestions, helpful advice and of 
course people who are interested to work on it. I am not too happy with 
the scheduling interface right now and its implementation looks a bit 
slower than necessary, but i think this will evolve over time.

two basic examples showing what the library already can do are given in 
examples/ . For ublas users this should not look too foreign.

For all who are interested, here is the basic design and the locations 
in include/aBLAS/:

1. computational kernels are implemented in kernels/ and represent 
typical bindings to the BLAS1-3 functionality as well as a default 
implementation (currently only dot,gemv,gemm and assignment are tested 
and working, no explicit bindings included, yet). kernels are enqueued 
in the scheduler via the expression template mechanisms and kernels are 
not allowed to enqueue kernels recursively.

2. a simple PoC scheduler is implemented in scheduling/scheduling.hpp. 
It implements a dependency graph between work packages and work is 
enqueued into a boost::thread::basic_thread_pool when all its 
dependencies are resolved. A kernel is enqueued together with a set of 
dependency_node objects which encapsulate dependencies of variables the 
kernel uses (i.e. every variable keeps track about what its latest 
dependencies are and whether these dependencies read from it or write to 
it). The current interface should be abstracted enough to allow 
implementation using different technologies(e.g. it should be possible 
to implement the scheduler in terms of HPX.).

One of the tasks of the scheduler is to allow creation of closures where 
variables are guaranteed to exist until all kernels using them are 
finished as well as moving a variable into a closure. This is used to 
prevent an issue similar to the blocking destructor of std::future<T>. 
Instead of blocking, the variable is moved into the scheduler which then 
ensures lifetime, until all kernels are finished. This of course 
requires the kernels to be called in a way that they can cope with the 
moving of types.

what is currently missing is "user created dependencies" to be used in 
conjunction with the gpu (as gpus are fully asynchronous, we have to 
register a callback that notifies the scheduler when the gpu is done 
with its computations just as the worker threads do).

3. basic matrix/vector classes are implemented in matrix.hpp and 
vector.hpp. The implementation is a bit convoluted for the "move into 
closure" to work. Basically they introduce another indirection. When a 
kernel is created, it references a special closure type of the variable 
(vector<T>::closure_type), which references that indirection.

4. the remaining files in include/aBLAS*.hpp implement the expression 
templates, which are similar to uBLAS. There are two types distinguished 
using the CRTP classes matrix_expression<Type,Device> and 
vector_expression<Type,Device>. Type is the exact type of the 
expression, Device marks the Device this is working on (cpu_tag or 
gpu_tag). The second template parameter ensures that one can not mix gpu 
and cpu expresions unless the operation explicitly allows this. While 
this looks clumsy, it removes code duplication between the different 
device types as most of the code can be used for both implementations.

assignment.hpp implements the basic assignment operators (Except op=).  
A += B either calls kernel::assign<>(A,B) if B can be evaluated 
efficiently elementwise (e.g. A+= 2*A1+A2-A3) or calls 
B.plus_assign_to(A) which then assigns the terms one-by-one using their 
specialized kernels (e.g. A+=prod(B,C)+D is evaluated as 
kernels::gemm(B,C,A); kernels::assign<...>(A,D);)

matrix/vector_proxy.hpp implement subranges, rows-of-matrix-operations, 
etc and matrix/vector_expression the algebraic operations.

Best,
Oswin

Re: [boost] interest in an asynchronous linear algebra library?

Oswin Krause