Re: [boost] interest in an asynchronous linear algebra library?

3 Dec 2015

...
There is now a minimal PoC at https://github.com/Ulfgard/aBLAS . Minimal
in the sense that i took my existing LinAlg, ripped it apart and rewrote
it partially to fit the new needs of the library. Only cpu is
implemented, yet. I am open for suggestions, helpful advice and of
course people who are interested to work on it. I am not too happy with
the scheduling interface right now and its implementation looks a bit
slower than necessary, but i think this will evolve over time.
The scheduling should not be part of your library, it's a pure
implementation detail.
...
two basic examples showing what the library already can do are given in
examples/ . For ublas users this should not look too foreign.
I don't think that you should make it explicit to the user whether a certain
operation has to be async or not. Just make every operation async by default
and let the library decide when to synchronize (i.e. whenever the actual
matrix data is accessed). Any expressions (like r += 2*x+1 + prod(x,y) -
btw: why prod(x,y) and not x * y)?) should not execute any code directly,
just construct an expression dependency tree which starts scheduling the
asynchronous sub-tasks. You need to wait for the result only when somebody
accesses an element of r. The easiest would be to let your matrix type wrap
a future which references the actual matrix. Additional transformations of
the expression tree could be used to merge the operations, if needed.
...
For all who are interested, here is the basic design and the locations
in include/aBLAS/:
1. computational kernels are implemented in kernels/ and represent
typical bindings to the BLAS1-3 functionality as well as a default
implementation (currently only dot,gemv,gemm and assignment are tested
and working, no explicit bindings included, yet). kernels are enqueued
in the scheduler via the expression template mechanisms and kernels are
not allowed to enqueue kernels recursively.
Why do you implement your own BLAS functions?
...
2. a simple PoC scheduler is implemented in scheduling/scheduling.hpp.
It implements a dependency graph between work packages and work is
enqueued into a boost::thread::basic_thread_pool when all its
dependencies are resolved. A kernel is enqueued together with a set of
dependency_node objects which encapsulate dependencies of variables the
kernel uses (i.e. every variable keeps track about what its latest
dependencies are and whether these dependencies read from it or write to
it). The current interface should be abstracted enough to allow
implementation using different technologies(e.g. it should be possible
to implement the scheduler in terms of HPX.).
One of the tasks of the scheduler is to allow creation of closures where
variables are guaranteed to exist until all kernels using them are
finished as well as moving a variable into a closure. This is used to
prevent an issue similar to the blocking destructor of std::future<T>.
Instead of blocking, the variable is moved into the scheduler which then
ensures lifetime, until all kernels are finished. This of course
requires the kernels to be called in a way that they can cope with the
moving of types.
As said, the scheduler should be a pure implementation detail and not
exposed to the user. I'd even suggest to rely on executors for the
scheduling needs, leaving the scheduler out of the picture completely.

Also, if you need dependencies, those shouldn't be handled by the executor,
but be better imposed by the code using it (via futures or similar).
...
what is currently missing is "user created dependencies" to be used in
conjunction with the gpu (as gpus are fully asynchronous, we have to
register a callback that notifies the scheduler when the gpu is done
with its computations just as the worker threads do).
3. basic matrix/vector classes are implemented in matrix.hpp and
vector.hpp. The implementation is a bit convoluted for the "move into
closure" to work. Basically they introduce another indirection. When a
kernel is created, it references a special closure type of the variable
(vector<T>::closure_type), which references that indirection.
4. the remaining files in include/aBLAS*.hpp implement the expression
templates, which are similar to uBLAS. There are two types distinguished
using the CRTP classes matrix_expression<Type,Device> and
vector_expression<Type,Device>. Type is the exact type of the
expression, Device marks the Device this is working on (cpu_tag or
gpu_tag). The second template parameter ensures that one can not mix gpu
and cpu expresions unless the operation explicitly allows this. While
this looks clumsy, it removes code duplication between the different
device types as most of the code can be used for both implementations.
assignment.hpp implements the basic assignment operators (Except op=).
A += B either calls kernel::assign<>(A,B) if B can be evaluated
efficiently elementwise (e.g. A+= 2*A1+A2-A3) or calls
B.plus_assign_to(A) which then assigns the terms one-by-one using their
specialized kernels (e.g. A+=prod(B,C)+D is evaluated as
kernels::gemm(B,C,A); kernels::assign<...>(A,D);)
matrix/vector_proxy.hpp implement subranges, rows-of-matrix-operations,
etc and matrix/vector_expression the algebraic operations.
Can I use other BLAS implementations, like for instance Intel MKL?

General note: I have the impression, that - just like any decently sized
system eventually embeds a half-baked and bug-ridden implementation of LISP
- any asynchronous library nowadays contains its own half-assed
implementation of a parallel runtime system. 

So let me ask again: why not use a full blown, mature, and thoroughly tested
runtime system for your needs instead of worrying about how to schedule 5
dependent tasks your library creates?

Regards Hartmut
---------------
http://boost-spirit.com
http://stellar.cct.lsu.edu

Re: [boost] interest in an asynchronous linear algebra library?

Hartmut Kaiser