Approximate Gradient Coding with Optimal Decoding

8 Convergence with adversarial stragglers

Proposition 50 Convergence to a noise floor [Proposition VII.1]

Let \(f = \sum _i f_i\) be \(\mu \)-strongly convex with \(L\)-Lipschitz gradient, each \(f_i\) convex with \(L'\)-Lipschitz gradient, \(x^*\) the minimizer, and \(\sigma ^2 = \sum _i|\nabla f_i(x^*)|_2^2\). Perform gradient descent with \(x_{k+1} = x_k - \gamma \sum _i \alpha _i^{(k)}\nabla f_i(x_k)\) where at each iteration \(|\alpha ^{(k)} - \mathbb {1}|_2^2 \le r^2\). Let \(a = 1 - \tfrac {r\sqrt{L'}}{\sqrt\mu }\) and suppose \(a {\gt} 0\). For any \(0 {\lt} \epsilon \le 1\), with constant step size \(\gamma = \tfrac {\epsilon a\mu }{4(L^2 + 2rLL’ + 4r^2(L’)^2)}\), after

\[ k \le \frac{4(1+\epsilon )(L^2 + 2rLL' + 4r^2(L')^2) \log \big(\tfrac {2a^2\mu ^2|x_0 - x_*|_2^2}{(1+\epsilon )^2 r^2\sigma ^2}\big)}{3a^2\mu ^2\epsilon ^2} \]

iterations, \(|x_k - x_*|_2 \le (1 + \epsilon )\tfrac {r\sigma }{a\mu }\).

Proof

Define \(y_t = x_t - x^*\). Assume the convergence criterion has not been met, i.e. \(|y_t|_2^2 {\gt} (1+\epsilon )^2\tfrac {r^2\sigma ^2}{a^2\mu ^2}\). Using \(\gamma L {\lt} \tfrac {\epsilon a}{6}\) one gets \(|y_t|_2(2\gamma r\sigma )(1 + \gamma L) \le |y_t|_2^2(\tfrac {2\gamma a\mu }{1+\epsilon })(\tfrac {6+\epsilon a}{6})\). Plugging the contraction of Lemma 51 and using \(a \le 1\), \(\epsilon \le 1\),

\[ |y_{t+1}|_2^2 \le |y_t|_2^2\Big(1 - a\gamma \mu \tfrac {3\epsilon }{2(1+\epsilon )}\Big) + 4\gamma ^2 r^2\sigma ^2 . \]

Iterating \(k\) times, either the criterion is met or \(|y_k|_2^2 \le |y_0|_2^2(1 - a\gamma \mu \tfrac {3\epsilon }{2(1+\epsilon )})^k + \tfrac {4\gamma ^2 r^2\sigma ^2}{a\mu \frac{3\epsilon }{2(1+\epsilon )}}\). Plugging the value of \(\gamma \) shows the second (noise) term equals \(\tfrac 12(1+\epsilon )^2\tfrac {r^2\sigma ^2}{a^2\mu ^2}\), and for \(k\) at least the stated bound the first term is \(\le \tfrac 12(1+\epsilon )^2\tfrac {r^2\sigma ^2}{a^2\mu ^2}\), giving \(|x_k - x_*|_2 \le (1+\epsilon )\tfrac {r\sigma }{a\mu }\).

In the setting of Proposition 50, for any step size \(\gamma \),

\[ |x_{k+1} - x^*|_2^2 \le |x_k - x^*|_2^2\Big(1 - \big(2 - \tfrac {\gamma (L^2 + 2rLL’ + 4r^2(L’)^2)}{a\mu }\big)a\gamma \mu \Big) + |x_k - x^*|_2(2\gamma r\sigma )(1 + \gamma L) + 4\gamma ^2 r^2\sigma ^2 . \]
Proof

Let \(g_i(x) = \nabla f_i(x)\), \(G = G(x_k)\) the matrix with \(i\)th column \(g_i(x_k)\), \(y_k = x_k - x^*\). The gradient step gives \(|y_{k+1}|_2^2 \le \max _{\beta : |\beta |_2 \le r}|y_k - \gamma G(\mathbb {1}+ \beta )|_2^2\) (since \(\alpha ^{(k)} = \mathbb {1}+ \beta \) with \(|\beta |_2 \le r\)). By Lagrange multipliers the maximizing \(\beta _*\) satisfies \(\beta _* = \gamma (\lambda I + \gamma ^2 G^TG)^{-1}G^T(y_k - \gamma G\mathbb {1})\) for some negative \(\lambda \) with \(|\beta _*|_2 = r\), leading to

\[ \max _{\beta : |\beta |_2 \le r}|y_k - \gamma G(\mathbb {1}+\beta )|_2^2 \le |y_k - \gamma G\mathbb {1}|_2^2 + 2\gamma ^2 r^2\sigma _1(G^TG) + 2\gamma r|G^T(y_k - \gamma G\mathbb {1})|_2 . \]

Bounding the three quantities by Claims 52, 53, 54, expanding \(|y_k - \gamma G\mathbb {1}|_2^2 = |y_k|^2 - 2\gamma y_k^T G\mathbb {1}+ \gamma ^2|G\mathbb {1}|_2^2\), using \(a = 1 - \tfrac {r\sqrt{L'}}{\sqrt\mu }\) so that \(-1 + \tfrac {r\sqrt{L'}|y_k|_2}{\sqrt{y_k^T G\mathbb {1}}} \le -a\) (from \(y_k^T G\mathbb {1}\ge \mu |y_k|^2\)), collects to the stated contraction.

Lemma 52 Bound on \(|G^TG\mathbb {1}|\) [Claim F.2]

With \(G = G(x_k)\), \(|G^TG\mathbb {1}|_2 \le LL'|y_k|_2^2 + L\sigma |y_k|_2\).

Proof

Split \(|G^TG\mathbb {1}|_2 \le |(G(x_k) - G(x_*))^TG\mathbb {1}|_2 + |G(x_*)^TG\mathbb {1}|_2\). For the first, \(|(G(x_k) - G(x_*))^TG\mathbb {1}|_2^2 = \sum _i((g_i(x_k) - g_i(x_*))^T G\mathbb {1})^2 \le |G\mathbb {1}|_2^2\sum _i|g_i(x_k) - g_i(x_*)|_2^2 \le |G\mathbb {1}|_2^2|y_k|_2^2(L')^2 \le L^2(L')^2|y_k|_2^4\), using \(|G\mathbb {1}|_2 = |\nabla f(x_k)|_2 \le L|y_k|_2\) and \(L'\)-Lipschitzness of each \(g_i\). For the second, \(|G(x_*)^TG\mathbb {1}|_2^2 = \sum _i(g_i(x_*)^TG\mathbb {1})^2 \le |G\mathbb {1}|_2^2\sigma ^2 \le L^2|y_k|_2^2\sigma ^2\). Taking square roots and summing gives the claim.

Lemma 53 Bound on \(|G^T(y_k - \gamma G\mathbb {1})|\) [Claim F.3]

With \(G = G(x_k)\),

\[ |G^T(y_k - \gamma G\mathbb {1})|_2 \le |y_k|_2\big(\sigma + \sqrt{L'\mathbb {1}^TG^Ty_k}\big) + \gamma \big(LL'|y_k|_2^2 + L\sigma |y_k|_2\big). \]
Proof

\(|G^T(y_k - \gamma G\mathbb {1})|_2 \le |(G(x_k) - G(x_*))^Ty_k|_2 + |G(x_*)^Ty_k|_2 + \gamma |G^TG\mathbb {1}|_2\). Now \(|G(x_*)^Ty_k|_2^2 \le |G(x_*)|_2^2|y_k|_2^2 \le \operatorname {Tr}(G(x_*)G(x_*)^T)|y_k|_2^2 = \sigma ^2|y_k|_2^2\), and \(|(G(x_k)-G(x_*))^Ty_k|_2^2 = \sum _i((g_i(x_k) - g_i(x_*))^Ty_k)^2 \le L'|y_k|_2^2\sum _i(g_i(x_k) - g_i(x_*))^Ty_k = |y_k|_2^2(L'\mathbb {1}^TG^Ty_k)\), using convexity of each \(f_i\) and \(\sum _i g_i(x_*) = 0\). Combining with Claim 52 gives the bound.

Lemma 54 Bound on \(\sigma _1(G^TG)\) [Claim F.4]

With \(G = G(x_k)\), \(\sigma _1(G^TG) \le 2|y_k|_2^2(L')^2 + 2\sigma ^2\).

Proof

\(\sigma _1(G^TG) \le \operatorname {Tr}(G^TG) = \sum _i|g_i|_2^2 \le 2\sum _i|g_i(x_k) - g_i(x_*)|_2^2 + 2\sum _i|g_i(x_*)|_2^2 \le 2|y_k|_2^2(L')^2 + 2\sigma ^2\), using \(L'\)-Lipschitzness and the definition of \(\sigma ^2\).

Corollary 55 Noise floor with \(\epsilon = 1\) [Corollary VII.2]

Let \(f, f_i, x^*, \sigma ^2\) be as in Proposition 50. Perform \(\theta _{k+1} = \theta _k - \gamma \sum _i \alpha _i^{(k)}\nabla f_i\) with \(|\alpha ^{(k)} - \mathbb {1}|_2^2 \le r\) at each iteration and assume \(\mu {\gt} rL'\). Then there is a step size \(\gamma \) such that after

\[ k \le \frac{3(L + 2\sqrt r L')^2\log \big(\tfrac {\mu ^2\epsilon _0}{2r\sigma ^2}\big)}{(\mu - \sqrt{\mu rL'})^2} \]

steps, \(|\theta _k - \theta ^*|_2^2 \le \tfrac {4r\sigma ^2}{(\mu - \sqrt{\mu rL'})^2}\), where \(\epsilon _0 = |\theta _0 - \theta ^*|^2\). Combining with the adversarial error bound of Corollary 41, coded gradient descent converges to a noise floor of \(\tfrac {4\sigma ^2 n(2d-\lambda )p}{\mu (\sqrt{2d(1-p)} - \sqrt{n(2d-\lambda )pL'})^2}\).

Proof

Set \(\epsilon = 1\) in Proposition 50 and substitute \(r\) for \(r^2\) (renaming the adversarial-error bound); the \((1+\epsilon )\tfrac {r\sigma }{a\mu }\) floor squares to the stated value with \(a = 1 - \sqrt{rL'/\mu }\). Plugging Corollary 41’s bound \(\tfrac 1n|\alpha ^* - \mathbb {1}|_2^2 \le \tfrac {2d-\lambda }{2d}\tfrac {p}{1-p}\) for the adversarial decoding error gives the explicit noise floor.