I've written a small package, einsum2, for parallelizing some of the functionality
einsum function of
einsum is a fantastic function for manipulating multidimensional arrays.
In a nutshell, it's a generalization of matrix multiplication that allows for easy and readable vectorization
for all sorts of operations.
Unfortunately, it is entirely single-threaded, even though many einsum operations
are embarrassingly parallel.
By contrast, vanilla matrix or tensor operation (
numpy.tensordot) will be automatically
parallel if you link
numpy against a parallel implementation of BLAS (such as Intel MKL or OpenBLAS).
einsum in terms of
tensordot can lead to less readable
and more verbose code, and in many cases is not even possible.
One common example is "batched matrix multiplication", such as
np.einsum("ijk,ikl->ijl", a, b),
where we perform a matrix multiplication for each value of the index
i may be an index ranging over each datapoint in a dataset,
and we are outputting a value for each point.
Batched matrix multiplication cannot be expressed as a
It can be expressed in terms of another function,
numpy.matmul, but this operation is still single-threaded (though this may eventually change –
last year, Intel introduced batched GEMM operations).
Under the hood,
einsum2 is just a parallel implementation of batched matrix multiplication,
einsum expressions in terms of batched matrix multiplies.
einsum2 can handle
einsum expressions that can be rewritten as
matmul, but cannot handle
einsum operations that involve repeated subscripts
on the same array (these correspond to summing over diagonals in
It also only works for two input arrays at a time.
I just wrote
einsum2 today and it is still rather rough.
Most notably, compiling is a bit awkward, especially on OSX.
Any feedback or comments on how to improve would be appreciated!