A parallel einsum

I've written a small package, einsum2, for parallelizing some of the functionality in the einsum function of numpy.

einsum is a fantastic function for manipulating multidimensional arrays. In a nutshell, it's a generalization of matrix multiplication that allows for easy and readable vectorization for all sorts of operations. Unfortunately, it is entirely single-threaded, even though many einsum operations are embarrassingly parallel. By contrast, vanilla matrix or tensor operation (numpy.dot and numpy.tensordot) will be automatically parallel if you link numpy against a parallel implementation of BLAS (such as Intel MKL or OpenBLAS).

Unfortunately, expressing einsum in terms of dot or tensordot can lead to less readable and more verbose code, and in many cases is not even possible. One common example is "batched matrix multiplication", such as np.einsum("ijk,ikl->ijl", a, b), where we perform a matrix multiplication for each value of the index i. For example, i may be an index ranging over each datapoint in a dataset, and we are outputting a value for each point.

Batched matrix multiplication cannot be expressed as a dot or tensordot. It can be expressed in terms of another function, numpy.matmul, but this operation is still single-threaded (though this may eventually change – last year, Intel introduced batched GEMM operations).

With that in mind, I've written einsum2 to parallelize a subset of the functionality in einsum. It is also compatible with the awesome autograd package, and allows for automatic differentiation.

Under the hood, einsum2 is just a parallel implementation of batched matrix multiplication, and rearranges einsum expressions in terms of batched matrix multiplies. einsum2 can handle einsum expressions that can be rewritten as dot, tensordot, or matmul, but cannot handle einsum operations that involve repeated subscripts on the same array (these correspond to summing over diagonals in einsum). It also only works for two input arrays at a time.

I just wrote einsum2 today and it is still rather rough. Most notably, compiling is a bit awkward, especially on OSX. Any feedback or comments on how to improve would be appreciated!


Comments powered by Disqus