Hsin-Po's Website

Distributed Computation Paper

I have once worked on distributed computation of matrix-matrix multiplication.

[PlutoCharon20] deals distributed matrix multiplication (DMM), where the workers might straggle or crash, by combining ideas from fast matrix multiplication (FMM). By MM we mean the computation of $C = A \times B$, where $A, B$ are huge matrices with compatible dimensions. (Do not confuse MM with matrix-vector multiplication, as the latter does not have a fast version.) By distributed we mean that there will be several workers, each computes a single entry multiplication $A_{ij} \times B_{jk}$. By straggling and crashing we mean that some workers might not respond timely, or they might just hang there indefinitely.

Straggling and crashing is a real issue in real world because, spontaneously, the network may be busy, the CPU may be overheat, or the circuit board may be hit by cosmic radiation and cannot recover from it. This makes the overall computation slow because we have to wait for the last worker to tell us the product $A_{ij}B_{jk}$ it is responsible for.

To compensate, we can hire more workers and ask them to carry out redundant computations. A possible way to create redundancy is to draw random row vector $g$ and random column vector $h$ and then ask extra workers to compute $(gA) \times (Bh)$ on top of the usual routine that computes $A \times B$. Once this is done, the associativity equation $(gA)(Bh) = gCh$ will give us some parity checks that help recover the missing entries of $C$. The overhead of using parity checks to recover missing entries of $C$ is usually faster than waiting for the straggling workers to recover. So we can actually save time by paying for more CPU times.

The contribution of [PlutoCharon20] is three-fold.

The smallest working example of the Charon construction is when $g$, $A$, $B$, and $h$ are $2 \times 4$, $4 \times 4$, $4 \times 4$, and $4 \times 2$, respectively. The computation of $A \times B$ costs 49 workers, the computation of $(gA) \times (Bh)$ costs 14 workers. Together we need 63 workers, one fewer than the naive algorithm uses (64 workers). So we are using less workers, yet we can recover from four erasures with high probability.

Here is a figure I made to explain the tensor structure of Pluto.

A 3-dimensional arraay of cubes, each representing a CPU core