Approximate Gradient Coding with Optimal Decoding

2 Introduction and problem setup

Definition 10 Assignment matrix
#

An assignment matrix for \(N\) data points distributed over \(m\) worker machines is a matrix \(A \in \mathbb {R}^{N \times m}\) with \(A_{ij} \neq 0\) if and only if the \(i\)th data point is held by machine \(j\). Writing \(f_i(\theta ) := L(x_i, y_i; \theta )\) for the per-point loss, each non-straggling machine \(j\) returns the single vector \(g_j := \sum _{i=1}^N A_{ij} \nabla f_i(\theta )\) to the parameter server.

Definition 11 Replication factor [Definition I.1]

The replication factor of an assignment matrix \(A \in \mathbb {R}^{N \times m}\) is the average number of times a data point is replicated, that is, the number of non-zero entries of \(A\) divided by \(N\).

Definition 12 Coded gradient descent update

At each round \(t\) the parameter server broadcasts \(\theta _t\), each non-straggling machine \(j\) returns \(g_j\), and the server picks a decoding coefficient vector \(w \in \mathbb {R}^m\) with \(w_j = 0\) whenever machine \(j\) straggles, and updates

\[ \theta _{t+1} \leftarrow \theta _t - \gamma \sum _{j=1}^m w_j g_j = \theta _t - \gamma \sum _{i=1}^N \alpha _i \nabla f_i(\theta _t), \qquad \alpha := Aw . \]

If a scheme can always achieve \(\alpha = \mathbb {1}\) (the all-ones vector) it recovers full-batch gradient descent; otherwise we are in the regime of approximate gradient coding.

Definition 13 Optimal decoding coefficients

Given an assignment matrix \(A\) and a set of straggling machines, the optimal decoding coefficient vector is any

\[ w^* \in \arg \min _{w :\ w_j = 0 \text{ if machine } j \text{ straggles}} \ |Aw - \mathbb {1}|_2 , \]

and we write \(\alpha ^* := Aw^*\). Here \(|\cdot |_2\) is the Euclidean norm. The vector \(w^*\) need not be unique but \(\alpha ^*\) is.

Definition 14 Moore–Penrose form of \(\alpha ^*\)

For a straggler rate \(p\), let \(A(p)\) be the random matrix obtained from \(A\) by replacing each column independently with the zero column with probability \(p\) (a straggling machine). Then \(\alpha ^*\) is the orthogonal projection of \(\mathbb {1}\) onto the column space of \(A(p)\),

\[ \alpha ^* = A(p)\big(A(p)^T A(p)\big)^{\dagger } A(p)^T \mathbb {1}, \]

where \(M^{\dagger }\) denotes the Moore–Penrose pseudoinverse of \(M\).

Definition 15 Decoding error under random stragglers [Definition I.2]

Given an assignment matrix \(A \in \mathbb {R}^{N \times m}\), the random decoding error under a \(p\) fraction of random stragglers is

\[ \mathbb {E}_S\! \left[\, |\alpha ^* - \mathbb {1}|_2^2\, \right] = \mathbb {E}_S\! \left[\ \min _{w \in \mathbb {R}^m,\ w_j = 0\ \forall j \in S} |Aw - \mathbb {1}|_2^2\ \right], \]

where \(S \subseteq [m]\) is a random subset including each index independently with probability \(p\). We drop the subscript and write \(\mathbb {E}\) for the expectation over the random straggler set.

Definition 16 Decoding error under adversarial stragglers [Definition I.3]

Given \(A \in \mathbb {R}^{N \times m}\), the adversarial decoding error under a \(p\) fraction of adversarial stragglers is

\[ \max _{S \subset [m] :\ |S| \le pm}\ \big[\, |\alpha ^* - \mathbb {1}|_2^2\, \big] = \max _{S \subset [m] :\ |S| \le pm} \ \left[\ \min _{w \in \mathbb {R}^m,\ w_j = 0\ \forall j \in S} |Aw - \mathbb {1}|_2^2\ \right]. \]
Definition 17 Unbiased coding scheme and the normalization \(\overline{\alpha }\)

A coding scheme is unbiased if \(\mathbb {E}[\alpha ] = c\mathbb {1}\) for some constant \(c\) (so that \(\mathbb {E}\sum _i \alpha _i \nabla f_i(\theta ) = c\nabla f(\theta )\) where \(f := \sum _i f_i\)). For an unbiased scheme we define the normalized vector \(\overline{\alpha } := \alpha /c\), so that \(\mathbb {E}[\overline{\alpha }] = \mathbb {1}\).