Approximate Gradient Coding with Optimal Decoding

10 Appendix B: Convergence with biased schemes

Proposition 59 Debiasing a biased scheme [Proposition B.1]

Suppose there is an assignment matrix \(A\) with computational load \(\ell \) on \(m\) machines and \(N\) data blocks, and a decoding strategy \(w\) with \(\alpha \) such that \(\mathbb {E}[|\alpha - \mathbb {1}|_2^2] \le \epsilon N\). Then there exists an assignment matrix \(\hat A\) with computational load at most \(2\ell \) on \(m\) machines and \(N\) data blocks, and a decoding strategy \(\hat w\) with \(\hat\alpha \) such that

\[ \mathbb {E}[|\hat\alpha - \mathbb {1}|_2^2] \le \tfrac {2\epsilon }{(1 - \sqrt{2\epsilon })^2}N \qquad \text{and}\qquad \mathbb {E}[\hat\alpha ] = \mathbb {1}. \]
Proof

Let \(\delta = 1 - \sqrt{2\epsilon }\) and \(S = \{ i \in [N] : \mathbb {E}[\alpha _i] \ge \delta \} \). By \(\mathbb {E}[|\alpha - \mathbb {1}|_2^2] \le \epsilon N\) and Markov-type counting, \(|S| \ge N(1 - \tfrac {\epsilon }{(1-\delta )^2}) = \tfrac N2\). Let \(s = |S|\), \(t = N - s\), WLOG \(S = [s]\). Let \(A_S\) be the rows of \(A\) in \(S\), \(D\) the \(s \times s\) diagonal matrix with \(D_{ii} = \tfrac 1{\mathbb {E}[A_S w]_i}\), and define \(\hat A := [(DA_S)^T \mid (DA_S)^T[:,1{:}t]]^T\), i.e. \(DA_S\) with its first \(t\) rows duplicated, so \(\mathbb {E}[\hat A w] = \mathbb {1}\). The replication factor of \(\hat A\) is at most \(d\), but duplication at most doubles the load to \(2\ell \). For each straggler pattern set \(\hat w = w\), \(\hat\alpha = \hat A\hat w\). For \(i \in S\),

\[ \mathbb {E}[(\hat\alpha _i - 1)^2] = \Big(\tfrac 1{\mathbb {E}[\alpha _i]}\Big)^2\mathbb {E}[(\alpha _i - \mathbb {E}[\alpha _i])^2] \le \tfrac 1{\delta ^2}\mathbb {E}[(\alpha _i - 1)^2], \]

so \(\sum _{i\in [N]}\mathbb {E}[(\hat\alpha _i - 1)^2] \le 2\tfrac 1{\delta ^2}N\epsilon = \tfrac {2\epsilon }{(1-\sqrt{2\epsilon })^2}N\), as desired.

Proposition 60 Convergence with a debiased scheme [Proposition B.2]

Let \(f = \sum _{i=1}^N f_i\) be \(\mu \)-strongly convex with \(L\)-Lipschitz gradient, each \(f_i\) convex with \(L'\)-Lipschitz gradient, \(\theta ^*\) the minimizer, \(\sigma ^2 = \sum _i|\nabla f_i(\theta ^*)|_2^2\). Suppose a possibly biased coding scheme with computational load \(\ell \) has \(\tfrac 1N\mathbb {E}[|\alpha - \mathbb {1}|_2^2] \le \zeta \) for some \(\zeta \le \tfrac 14\). Debias it via Proposition 59 (load \(\le 2\ell \)) and run gradient coding as in Algorithm 43 with \(w\) from the original scheme, with straggling probability \(p\). Then for any \(\epsilon \) there is a step size \(\gamma \) such that after

\[ k = 2\log (\epsilon /\epsilon _0)\Big(\tfrac {L}{\mu } + 8n\zeta \tfrac {L’}{\mu } + \tfrac {\zeta \sigma ^2}{\mu ^2\epsilon }\Big) \]

steps, \(\mathbb {E}[|\theta _k - \theta ^*|_2^2] \le \epsilon \), where \(\epsilon _0 = |\theta _0 - \theta ^*|^2\).

Proof

Apply Proposition 59 to obtain an unbiased scheme with \(\tfrac 1N\mathbb {E}[|\hat\alpha - \mathbb {1}|_2^2] = O(\zeta )\) and \(\mathbb {E}[\hat\alpha ] = \mathbb {1}\), hence \(r = O(\zeta )\) and \(s = O(n\zeta )\) in the notation of Proposition 45. Substituting into the iteration count (Corollary 48, equivalently Proposition 45) gives the stated \(k\).