2 Introduction and problem setup
An assignment matrix for \(N\) data points distributed over \(m\) worker machines is a matrix \(A \in \mathbb {R}^{N \times m}\) with \(A_{ij} \neq 0\) if and only if the \(i\)th data point is held by machine \(j\). Writing \(f_i(\theta ) := L(x_i, y_i; \theta )\) for the per-point loss, each non-straggling machine \(j\) returns the single vector \(g_j := \sum _{i=1}^N A_{ij} \nabla f_i(\theta )\) to the parameter server.
The replication factor of an assignment matrix \(A \in \mathbb {R}^{N \times m}\) is the average number of times a data point is replicated, that is, the number of non-zero entries of \(A\) divided by \(N\).
At each round \(t\) the parameter server broadcasts \(\theta _t\), each non-straggling machine \(j\) returns \(g_j\), and the server picks a decoding coefficient vector \(w \in \mathbb {R}^m\) with \(w_j = 0\) whenever machine \(j\) straggles, and updates
If a scheme can always achieve \(\alpha = \mathbb {1}\) (the all-ones vector) it recovers full-batch gradient descent; otherwise we are in the regime of approximate gradient coding.
Given an assignment matrix \(A\) and a set of straggling machines, the optimal decoding coefficient vector is any
and we write \(\alpha ^* := Aw^*\). Here \(|\cdot |_2\) is the Euclidean norm. The vector \(w^*\) need not be unique but \(\alpha ^*\) is.
For a straggler rate \(p\), let \(A(p)\) be the random matrix obtained from \(A\) by replacing each column independently with the zero column with probability \(p\) (a straggling machine). Then \(\alpha ^*\) is the orthogonal projection of \(\mathbb {1}\) onto the column space of \(A(p)\),
where \(M^{\dagger }\) denotes the Moore–Penrose pseudoinverse of \(M\).
Given an assignment matrix \(A \in \mathbb {R}^{N \times m}\), the random decoding error under a \(p\) fraction of random stragglers is
where \(S \subseteq [m]\) is a random subset including each index independently with probability \(p\). We drop the subscript and write \(\mathbb {E}\) for the expectation over the random straggler set.
Given \(A \in \mathbb {R}^{N \times m}\), the adversarial decoding error under a \(p\) fraction of adversarial stragglers is
A coding scheme is unbiased if \(\mathbb {E}[\alpha ] = c\mathbb {1}\) for some constant \(c\) (so that \(\mathbb {E}\sum _i \alpha _i \nabla f_i(\theta ) = c\nabla f(\theta )\) where \(f := \sum _i f_i\)). For an unbiased scheme we define the normalized vector \(\overline{\alpha } := \alpha /c\), so that \(\mathbb {E}[\overline{\alpha }] = \mathbb {1}\).