Approximate Gradient Coding with Optimal Decoding

7 Convergence with random stragglers

Definition 43 GCOD: gradient coding with optimal decoding [Algorithm 2]

Algorithm GCOD\((A, p, \theta _0, \gamma , \{ f_i\} , k)\): in the distribution phase draw a uniform permutation \(\rho \sim \mathrm{Uniform}(\mathcal{S}_n)\) and send \(f_{\rho (i)}\) to all machines \(j\) with \(A_{ij} \neq 0\). In the computation phase, for \(t = 1,\dots ,k\): the server sends \(x_{t-1}\); each machine \(j\) straggles with \(B_j \sim \mathrm{Bernoulli}(p)\) and, if not straggling, returns \(g_j = \sum _i A_{ij}\nabla f_{\rho (i)}(\theta _{t-1})\); the server computes \(w^* \in \arg \min _{w :\ w_j = 0 \text{ if } B_j = 0}|Aw - \mathbb {1}|_2\) and updates \(\theta _t = \theta _{t-1} - \gamma \sum _{j : B_j = 1} w^*_j g_j\).

Definition 44 SGD-ALG: stochastically equivalent SGD [Algorithm 3]

Algorithm SGD-ALG\((P_\beta , \theta _0, \gamma , \{ f_i\} , k)\): draw a uniform permutation \(\rho \); for \(t = 1,\dots ,k\), draw \(\beta \sim P_\beta \) and update \(\theta _t = \theta _{t-1} - \gamma \sum _i \beta _i \nabla f_{\rho (i)}(\theta _{t-1})\). For an unbiased scheme with \(\alpha ^* = A(p)(A(p)^TA(p))^{\dagger }A(p)^T\mathbb {1}\) (Definition 14) and \(P_{\overline{\alpha ^*}}\) the distribution of the normalized \(\overline{\alpha ^*}\), the algorithm GCOD\((A,p,x_0,\gamma ,\{ f_i\} ,k)\) is stochastically equivalent to SGD-ALG\((P_{\overline{\alpha ^*}}, x_0, \gamma \mathbb {E}[\alpha _1], \{ f_i\} , k)\).

Proposition 45 Convergence of SGD-ALG [Proposition VI.1]

Let \(f = \sum _{i=1}^n f_i\) be \(\mu \)-strongly convex with \(L\)-Lipschitz gradient, each \(f_i\) convex with \(L'\)-Lipschitz gradient, and let \(x^*\) be the minimizer. Run SGD-ALG\((P_\beta , x_0, \gamma , \{ f_i\} , k)\) from \(x_0\) for \(k\) iterations with step size \(\gamma \le \tfrac {1}{sL’+L}\) and a distribution \(P_\beta \) with \(\mathbb {E}[\beta ] = \mathbb {1}\). Let \(\sigma ^2 = \sum _i|\nabla f_i(x^*)|_2^2\), \(r = \tfrac 1n\mathbb {E}[|\beta - \mathbb {1}|_2^2]\), and \(s = |\mathbb {E}[(\beta - \mathbb {1})(\beta - \mathbb {1})^T]|_2\). Then

\[ \mathbb {E}[|x_k - x^*|_2^2] \le (1 - 2\gamma \mu (1 - \gamma (sL'+L)))^k |x_0 - x^*|_2^2 + \frac{\gamma r(1 + \tfrac 1{n-1})\sigma ^2}{\mu (1 - \gamma (sL'+L))}, \]

the expectation being over \(\rho \) and \(\{ \beta ^{(j)} : j {\lt} k\} \).

Proof

Write \(g_i(x) = \nabla f_i(x)\), let \(G(x)\) be the matrix with \(i\)th column \(g_i(x)\), and \(y_k = x_k - x^*\). With \(\rho \) uniform and \(\beta \sim P_\beta \), expanding the update,

\[ |y_{k+1}|_2^2 = |y_k|_2^2 - 2\gamma y_k^T G(x_k)\rho ^{-1}(\beta ) + \gamma ^2|G(x_k)\rho ^{-1}(\beta )|_2^2 . \]

Adding and subtracting \(G(x^*)\) inside the last term, \(|y_{k+1}|_2^2 \le |y_k|_2^2 - 2\gamma y_k^T G(x_k)\rho ^{-1}(\beta ) + 2\gamma ^2|(G(x_k) - G(x^*))\rho ^{-1}(\beta )|_2^2 + 2\gamma ^2|G(x^*)\rho ^{-1}(\beta )|_2^2\). Taking \(\mathbb {E}_\beta \) (with \(\mathbb {E}[\beta ] = \mathbb {1}\)) and using Claim 46 for the third term and strong convexity (\(y_k^T\nabla f(x_k) \ge \mu |y_k|^2\)) plus \(\gamma \le \tfrac 1{sL’+L}\),

\[ \mathbb {E}_{\beta ^{(k)}}[|y_{k+1}|^2 \mid \rho ] \le |y_k|^2(1 - 2\gamma \mu (1 - \gamma (sL'+L))) + 2\gamma ^2\mathbb {E}_{\beta ^{(k)}}\big[|G(x^*)\rho ^{-1}(\beta )|_2^2 \mid \rho \big]. \]

Bounding the second term in expectation over \(\rho \) by Claim 47 (\(\mathbb {E}_{\rho ,\beta }[|G(x^*)\rho ^{-1}(\beta )|_2^2] \le r(1+\tfrac 1{n-1})\sigma ^2\)) and recursively applying the contraction, summing the geometric series, yields the proposition.

Lemma 46 Cross-term bound [Claim E.3]

For any permutation \(\rho \in \mathcal{S}_n\), with \(G(x)\) as above,

\[ \mathbb {E}_\beta \big[|(G(x_k) - G(x^*))\rho ^{-1}(\beta )|_2^2\big] \le (sL' + L)\, y_k^T\nabla f(x_k). \]
Proof

Using \(\mathbb {E}[\beta ] = \mathbb {1}\), the cyclic property of trace (Lemma 1) and the PSD trace inequality (Lemma 7), with \(M = G(x_k) - G(x^*)\),

\[ \mathbb {E}_\beta [|M\rho ^{-1}(\beta )|_2^2] = \operatorname {Tr}\big(\mathbb {E}_\beta [(\rho ^{-1}(\beta )-\mathbb {1})(\rho ^{-1}(\beta )-\mathbb {1})^T]\, M^TM\big) + |\nabla f(x_k) - \nabla f(x^*)|_2^2 \le s\sum _i|g_i(x_k) - g_i(x^*)|^2 + |\nabla f(x_k) - \nabla f(x^*)|_2^2 . \]

By co-coercivity (Lemma 8) and convexity of the \(f_i\), \(\sum _i|g_i(x_k) - g_i(x^*)|^2 \le L'\sum _i\langle y_k, g_i(x_k) - g_i(x^*)\rangle = L'\langle y_k, \nabla f(x_k) - \nabla f(x^*)\rangle \), and similarly \(|\nabla f(x_k) - \nabla f(x^*)|_2^2 \le L\langle y_k, \nabla f(x_k) - \nabla f(x^*)\rangle \). Since \(\nabla f(x^*) = 0\), combining gives the claim.

Lemma 47 Noise-term bound [Claim E.4]

With \(\sigma ^2 = \sum _i|\nabla f_i(x^*)|_2^2\) and \(r,s\) as above,

\[ \mathbb {E}_{\rho \sim \mathcal{S}_n, \beta }\big[|G(x^*)\rho ^{-1}(\beta )|_2^2\big] \le r\Big(1 + \tfrac 1{n-1}\Big)\sigma ^2 . \]
Proof

Since \(x^*\) is optimal, \(G(x^*)\mathbb {1}= 0\), so \(|G(x^*)\rho ^{-1}(\beta )|_2^2 = \operatorname {Tr}(\rho ^{-1}(\beta )\rho ^{-1}(\beta )^T G(x^*)^TG(x^*))\). Because \(\rho \) is uniform, the matrix \(\mathbb {E}_{\rho ,\beta }[\rho ^{-1}(\beta )\rho ^{-1}(\beta )^T]\) has all diagonal entries equal to \(1 + r\) and all off-diagonal entries equal to some \(b \ge 1 - \tfrac {r}{n-1}\), hence equals \(aI + b\mathbb {1}\mathbb {1}^T\) with \(a \le r(1 + \tfrac 1{n-1})\). Plugging in and using \(G(x^*)\mathbb {1}= 0\) and \(\operatorname {Tr}(G(x^*)^TG(x^*)) = \sigma ^2\),

\[ \mathbb {E}[|G(x^*)\rho ^{-1}(\beta )|_2^2] = a\operatorname {Tr}(G(x^*)^TG(x^*)) + b|G(x^*)\mathbb {1}|_2^2 \le r\Big(1 + \tfrac 1{n-1}\Big)\sigma ^2 . \qedhere \]
Corollary 48 Step size and iteration count [Corollary VI.2 / E.5]

For any desired accuracy \(\epsilon \), choosing \(\gamma = \tfrac {\mu \epsilon }{2\mu \epsilon (sL’+L) + 2r(1+\frac1{n-1})\sigma ^2}\) gives, after

\[ k = 2\log (2\epsilon _0/\epsilon )\Big(\tfrac {sL’}{\mu } + \tfrac {L}{\mu } + \tfrac {r(1+\frac1{n-1})\sigma ^2}{\mu ^2\epsilon }\Big) \]

steps, \(\mathbb {E}[|x_k - x^*|_2^2] \le \epsilon \), where \(\epsilon _0 = |x_0 - x^*|^2\).

Proof

Plug \(\gamma \) into the convergence bound of Proposition 45. The choice makes the noise summand \(\le \tfrac \epsilon 2\); requiring the contraction term \(\le \tfrac \epsilon 2\) needs \(k \ge \tfrac {\log (\epsilon /2\epsilon _0)}{\log (1 - 2\gamma \mu (1-\gamma (sL’+L)))}\). Since \(\gamma {\lt} \tfrac 1{2(sL’+L)} \le \tfrac 1{2\mu }\), we have \(2\gamma \mu (1-\gamma (sL'+L)) \in (0,1)\); using \(\log (1-x) \le -x\) and substituting \(\gamma \) bounds the denominator, giving the stated \(k\).

Proposition 49 Convergence of coded GD with graph schemes [Proposition VI.3]

Let \(f = \sum _{i=1}^n f_i\) be \(\mu \)-strongly convex with \(L\)-Lipschitz gradient, each \(f_i\) convex with \(L'\)-Lipschitz gradient, \(\theta ^*\) the minimizer, and \(\sigma ^2 = \sum _i|\nabla f_i(\theta ^*)|_2^2\). Perform gradient coding with optimal decoding (Algorithm 43) with an assignment matrix for a \(d\)-regular vertex-transitive graph of spectral gap \(d - o(d)\), so \(m = \tfrac {nd}{2}\), and machine straggling probability \(p\). Then for any accuracy \(\epsilon \) there is a step size \(\gamma \) such that after

\[ k = 2\log (\epsilon _0/\epsilon )\Big(\tfrac {L}{\mu } + \log ^2(n)p^{2d - o(d)}\tfrac {L’}{\mu } + \tfrac {p^{d-o(d)}\sigma ^2}{\mu ^2\epsilon }\Big) \]

steps, \(\mathbb {E}[|\theta _k - \theta ^*|_2^2] \le \epsilon \), where \(\epsilon _0 = |\theta _0 - \theta ^*|^2\).

Proof

By the stochastic equivalence of Algorithm 43 and Algorithm 44, apply Corollary 48 with \(\beta = \overline{\alpha ^*}\). Theorem 27 controls the two statistics: \(r = \tfrac 1n\mathbb {E}[|\overline{\alpha ^*} - \mathbb {1}|_2^2] = O(p^{d-o(d)})\) and \(s = |\mathbb {E}[(\overline{\alpha ^*} - \mathbb {1})(\overline{\alpha ^*}-\mathbb {1})^T]|_2 = O(\log ^2(n)p^{2d-o(d)})\). Substituting these into the iteration count of Corollary 48 yields the result.