Dependency graph

Legend

Boxes: definitions
Ellipses: theorems and lemmas
Blue border: the statement of this result is ready to be formalized; all prerequisites are done
Orange border: the statement of this result is not ready to be formalized; the blueprint needs more work
Blue background: the proof of this result is ready to be formalized; all prerequisites are done
Green border: the statement of this result is formalized
Green background: the proof of this result is formalized
Dark green background: the proof of this result and all its ancestors are formalized
Dark green border: this is in Mathlib

Definition 43GCOD: gradient coding with optimal decoding [Algorithm 2]

Algorithm GCOD\((A, p, \theta _0, \gamma , \{ f_i\} , k)\): in the distribution phase draw a uniform permutation \(\rho \sim \mathrm{Uniform}(\mathcal{S}_n)\) and send \(f_{\rho (i)}\) to all machines \(j\) with \(A_{ij} \neq 0\). In the computation phase, for \(t = 1,\dots ,k\): the server sends \(x_{t-1}\); each machine \(j\) straggles with \(B_j \sim \mathrm{Bernoulli}(p)\) and, if not straggling, returns \(g_j = \sum _i A_{ij}\nabla f_{\rho (i)}(\theta _{t-1})\); the server computes \(w^* \in \arg \min _{w :\ w_j = 0 \text{ if } B_j = 0}|Aw - \mathbb {1}|_2\) and updates \(\theta _t = \theta _{t-1} - \gamma \sum _{j : B_j = 1} w^*_j g_j\).

LaTeX

Lemma 30Growing process [Algorithm 1]

The growing process on input vertex set \(V_0\) explores \(\bigcup _{v \in V_0} C_v\) (the \(G(p)\)-components of \(V_0\)) as follows. Initialize \(t \leftarrow 0\), \(S_0, B_0 \leftarrow \emptyset \), \(F_0 \leftarrow \partial (V_0)\). While \(|F_t| {\gt} 0\): pick \(e \in F_t\) arbitrarily; set \(t \leftarrow t+1\) and draw \(X_t \sim \mathrm{Bernoulli}(1-p)\). If \(X_t = 1\) then \(S_t \leftarrow S_{t-1} \cup e\), \(V_t \leftarrow V_{t-1} \cup \delta (e)\), \(B_t \leftarrow B_{t-1}\); else \(B_t \leftarrow B_{t-1}\cup e\), \(S_t \leftarrow S_{t-1}\), \(V_t \leftarrow V_{t-1}\). Update \(F_t \leftarrow \partial (V_t) \setminus B_t\). Return \(V_t\). Here \(S_t\) are the explored edges in \(G(p)\), \(V_t\) the explored vertices, \(B_t\) the explored edges not in \(G(p)\), and \(F_t\) the frontier.

LaTeX

Definition 44SGD-ALG: stochastically equivalent SGD [Algorithm 3]

Algorithm SGD-ALG\((P_\beta , \theta _0, \gamma , \{ f_i\} , k)\): draw a uniform permutation \(\rho \); for \(t = 1,\dots ,k\), draw \(\beta \sim P_\beta \) and update \(\theta _t = \theta _{t-1} - \gamma \sum _i \beta _i \nabla f_{\rho (i)}(\theta _{t-1})\). For an unbiased scheme with \(\alpha ^* = A(p)(A(p)^TA(p))^{\dagger }A(p)^T\mathbb {1}\) (Definition 14) and \(P_{\overline{\alpha ^*}}\) the distribution of the normalized \(\overline{\alpha ^*}\), the algorithm GCOD\((A,p,x_0,\gamma ,\{ f_i\} ,k)\) is stochastically equivalent to SGD-ALG\((P_{\overline{\alpha ^*}}, x_0, \gamma \mathbb {E}[\alpha _1], \{ f_i\} , k)\).

LaTeX

Corollary 41Adversarial error for good expanders [Corollary V.3]

Let \(A\) be a graph assignment scheme for a \(d\)-regular graph \(G\) with spectral gap \(\lambda = d - o(d)\). For any set of \(\lfloor pm\rfloor \) stragglers there exist decoding coefficients \(w\) with

\[ \tfrac 1n|\alpha - \mathbb {1}|_2^2 \le \tfrac {1+o(1)}{2}\cdot \tfrac {p}{1-p}. \]

LaTeX

Corollary 40Adversarial error for graph schemes [Corollary V.2]

Let \(A\) be a graph assignment scheme for a \(d\)-regular graph \(G\) with spectral expansion \(\lambda \). For any set of \(\lfloor pm\rfloor \) stragglers there exist decoding coefficients \(w\) with

\[ \tfrac 1n|\alpha - \mathbb {1}|_2^2 \le \tfrac {2d-\lambda }{2d}\cdot \tfrac {p}{1-p}. \]

LaTeX

Corollary 55Noise floor with \(\epsilon = 1\) [Corollary VII.2]

Let \(f, f_i, x^*, \sigma ^2\) be as in Proposition 50. Perform \(\theta _{k+1} = \theta _k - \gamma \sum _i \alpha _i^{(k)}\nabla f_i\) with \(|\alpha ^{(k)} - \mathbb {1}|_2^2 \le r\) at each iteration and assume \(\mu {\gt} rL'\). Then there is a step size \(\gamma \) such that after

\[ k \le \frac{3(L + 2\sqrt r L')^2\log \big(\tfrac {\mu ^2\epsilon _0}{2r\sigma ^2}\big)}{(\mu - \sqrt{\mu rL'})^2} \]

steps, \(|\theta _k - \theta ^*|_2^2 \le \tfrac {4r\sigma ^2}{(\mu - \sqrt{\mu rL'})^2}\), where \(\epsilon _0 = |\theta _0 - \theta ^*|^2\). Combining with the adversarial error bound of Corollary 41, coded gradient descent converges to a noise floor of \(\tfrac {4\sigma ^2 n(2d-\lambda )p}{\mu (\sqrt{2d(1-p)} - \sqrt{n(2d-\lambda )pL'})^2}\).

LaTeX

Corollary 37Non-bipartite giant component [Corollary IV.4]

Let \(G\) be any \(d\)-regular \(\lambda \)-spectral expander on \(n\) vertices and suppose for some positive \(\epsilon \) assumptions (1)–(4) of Theorem 27 hold. Then with probability at least \(1 - \tfrac {6}{n}\):

\(G(p)\) has a non-bipartite giant component of size at least \(n\Big(1 - \tfrac {e^2 p^{\lambda (1-\frac{1}{3+\epsilon })}}{(1-pe^{1/\lambda })^2}\Big)\);
every vertex is either in a component of size at most \(k\) (with \(k\) as in Theorem 27) or in a component of size \({\gt} n/2\).

LaTeX

Corollary 48Step size and iteration count [Corollary VI.2 / E.5]

For any desired accuracy \(\epsilon \), choosing \(\gamma = \tfrac {\mu \epsilon }{2\mu \epsilon (sL’+L) + 2r(1+\frac1{n-1})\sigma ^2}\) gives, after

\[ k = 2\log (2\epsilon _0/\epsilon )\Big(\tfrac {sL’}{\mu } + \tfrac {L}{\mu } + \tfrac {r(1+\frac1{n-1})\sigma ^2}{\mu ^2\epsilon }\Big) \]

steps, \(\mathbb {E}[|x_k - x^*|_2^2] \le \epsilon \), where \(\epsilon _0 = |x_0 - x^*|^2\).

LaTeX

Definition 16Decoding error under adversarial stragglers [Definition I.3]

Given \(A \in \mathbb {R}^{N \times m}\), the adversarial decoding error under a \(p\) fraction of adversarial stragglers is

\[ \max _{S \subset [m] :\ |S| \le pm}\ \big[\, |\alpha ^* - \mathbb {1}|_2^2\, \big] = \max _{S \subset [m] :\ |S| \le pm} \ \left[\ \min _{w \in \mathbb {R}^m,\ w_j = 0\ \forall j \in S} |Aw - \mathbb {1}|_2^2\ \right]. \]

LaTeX

Definition 14Moore–Penrose form of \(\alpha ^*\)

For a straggler rate \(p\), let \(A(p)\) be the random matrix obtained from \(A\) by replacing each column independently with the zero column with probability \(p\) (a straggling machine). Then \(\alpha ^*\) is the orthogonal projection of \(\mathbb {1}\) onto the column space of \(A(p)\),

\[ \alpha ^* = A(p)\big(A(p)^T A(p)\big)^{\dagger } A(p)^T \mathbb {1}, \]

where \(M^{\dagger }\) denotes the Moore–Penrose pseudoinverse of \(M\).

LaTeX

Definition 10Assignment matrix

An assignment matrix for \(N\) data points distributed over \(m\) worker machines is a matrix \(A \in \mathbb {R}^{N \times m}\) with \(A_{ij} \neq 0\) if and only if the \(i\)th data point is held by machine \(j\). Writing \(f_i(\theta ) := L(x_i, y_i; \theta )\) for the per-point loss, each non-straggling machine \(j\) returns the single vector \(g_j := \sum _{i=1}^N A_{ij} \nabla f_i(\theta )\) to the parameter server.

LaTeX

Definition 12Coded gradient descent update

At each round \(t\) the parameter server broadcasts \(\theta _t\), each non-straggling machine \(j\) returns \(g_j\), and the server picks a decoding coefficient vector \(w \in \mathbb {R}^m\) with \(w_j = 0\) whenever machine \(j\) straggles, and updates

\[ \theta _{t+1} \leftarrow \theta _t - \gamma \sum _{j=1}^m w_j g_j = \theta _t - \gamma \sum _{i=1}^N \alpha _i \nabla f_i(\theta _t), \qquad \alpha := Aw . \]

If a scheme can always achieve \(\alpha = \mathbb {1}\) (the all-ones vector) it recovers full-batch gradient descent; otherwise we are in the regime of approximate gradient coding.

LaTeX

Definition 6Convex, strongly convex, and Lipschitz-gradient functions

A differentiable function \(f : \mathbb {R}^k \to \mathbb {R}\) is convex if \(f(y) \ge f(x) + \langle \nabla f(x), y - x\rangle \) for all \(x,y\); it is \(\mu \)-strongly convex if \(f - \tfrac {\mu }{2}|\cdot |_2^2\) is convex; and it has an \(L\)-Lipschitz gradient if \(|\nabla f(x) - \nabla f(y)|_2 \le L|x-y|_2\) for all \(x,y\).

LaTeX Lean

Definition 20Graph assignment scheme [Definition II.2]

A graph assignment scheme corresponding to a graph \(G\) with \(n\) vertices and \(m\) edges is the matrix \(A \in \{ 0,1\} ^{n \times m}\) with \(A_{ij} = 1\) iff the \(j\)th edge of \(G\) has vertex \(i\) as an endpoint. Here the \(n\) vertices are the data blocks and the \(m\) edges are the machines, so each machine holds exactly two data blocks. Each block consists of \(\tfrac {dN}{2m}\) data points, the number of blocks is \(n = \tfrac {2m}{d}\), the computational load (points per machine) is \(\ell = \tfrac {dN}{m}\), and the replication factor is \(d = \tfrac {2m}{n}\), independent of block size. Results are stated in terms of the \(n \times m\) block-to-machine matrix.

LaTeX

Definition 13Optimal decoding coefficients

Given an assignment matrix \(A\) and a set of straggling machines, the optimal decoding coefficient vector is any

\[ w^* \in \arg \min _{w :\ w_j = 0 \text{ if machine } j \text{ straggles}} \ |Aw - \mathbb {1}|_2 , \]

and we write \(\alpha ^* := Aw^*\). Here \(|\cdot |_2\) is the Euclidean norm. The vector \(w^*\) need not be unique but \(\alpha ^*\) is.

LaTeX

Definition 15Decoding error under random stragglers [Definition I.2]

Given an assignment matrix \(A \in \mathbb {R}^{N \times m}\), the random decoding error under a \(p\) fraction of random stragglers is

\[ \mathbb {E}_S\! \left[\, |\alpha ^* - \mathbb {1}|_2^2\, \right] = \mathbb {E}_S\! \left[\ \min _{w \in \mathbb {R}^m,\ w_j = 0\ \forall j \in S} |Aw - \mathbb {1}|_2^2\ \right], \]

where \(S \subseteq [m]\) is a random subset including each index independently with probability \(p\). We drop the subscript and write \(\mathbb {E}\) for the expectation over the random straggler set.

LaTeX

Definition 11Replication factor [Definition I.1]

The replication factor of an assignment matrix \(A \in \mathbb {R}^{N \times m}\) is the average number of times a data point is replicated, that is, the number of non-zero entries of \(A\) divided by \(N\).

LaTeX

Definition 21Random sparsification \(G(p)\)

For a graph assignment scheme on \(G = (V,E)\) and straggler probability \(p\), let \(G(p)\) be the random subgraph in which each edge of \(G\) is deleted independently with probability \(p\) (a straggling machine). The decoding vector \(w^*\) has one (possibly nonzero) coordinate \(w^*_e\) per edge \(e\) of \(G(p)\), and \(\alpha ^*_v = \sum _{e \ni v} w^*_e\) is the sum of weights of edges incident to \(v\).

LaTeX

Definition 18Spectral expansion of a graph

Let \(G = (V,E)\) be a \(d\)-regular graph with adjacency matrix \(\mathcal{A}(G)\), whose eigenvalues are \(\lambda _1(\mathcal{A}(G)) = d \ge \lambda _2(\mathcal{A}(G)) \ge \cdots \). The spectral expansion (spectral gap) of \(G\) is \(\lambda := \lambda _1(\mathcal{A}(G)) - \lambda _2(\mathcal{A}(G)) = d - \lambda _2(\mathcal{A}(G))\). \(G\) is a \(\lambda \)-spectral expander if its spectral expansion is at least \(\lambda \).

LaTeX

Definition 17Unbiased coding scheme and the normalization \(\overline{\alpha }\)

A coding scheme is unbiased if \(\mathbb {E}[\alpha ] = c\mathbb {1}\) for some constant \(c\) (so that \(\mathbb {E}\sum _i \alpha _i \nabla f_i(\theta ) = c\nabla f(\theta )\) where \(f := \sum _i f_i\)). For an unbiased scheme we define the normalized vector \(\overline{\alpha } := \alpha /c\), so that \(\mathbb {E}[\overline{\alpha }] = \mathbb {1}\).

LaTeX

Definition 19Vertex-transitive graph

For a graph \(G = (V,E)\) let \(\operatorname {Aut}(G) \subset \mathcal{S}_{|V|}\) be its group of automorphisms. \(G\) is vertex transitive if for all \(u, v \in V\) there is an automorphism \(\sigma \in \operatorname {Aut}(G)\) with \(\sigma (u) = v\).

LaTeX

Lemma 51Per-step contraction [Lemma F.1]

In the setting of Proposition 50, for any step size \(\gamma \),

\[ |x_{k+1} - x^*|_2^2 \le |x_k - x^*|_2^2\Big(1 - \big(2 - \tfrac {\gamma (L^2 + 2rLL’ + 4r^2(L’)^2)}{a\mu }\big)a\gamma \mu \Big) + |x_k - x^*|_2(2\gamma r\sigma )(1 + \gamma L) + 4\gamma ^2 r^2\sigma ^2 . \]

LaTeX

Lemma 22Edge-sum identity [Equation (4)]

For any edge \(e = (u,v)\) of \(G(p)\), the optimal \(\alpha ^*\) satisfies \(\alpha ^*_u + \alpha ^*_v = 2\).

LaTeX

Lemma 25Bipartite components: balance formula

If a connected component \(C = L \cup R\) of \(G(p)\) is bipartite with \(|L| \ge |R|\), then \(\alpha ^*_u = 1 + \tfrac {|L|-|R|}{|L|+|R|}\) for \(u \in R\) and \(\alpha ^*_v = 1 - \tfrac {|L|-|R|}{|L|+|R|}\) for \(v \in L\).

LaTeX

Lemma 4Multiplicative Chernoff bound for Bernoulli sums

Let \(X_1, \dots , X_t\) be i.i.d. Bernoulli random variables with mean \(q\) and let \(\delta \in (0,1)\). Then

\[ \Pr \! \left[\sum _{i=1}^t X_i \le (1-\delta )\, qt\right] \le \exp \! \left(-\tfrac {\delta ^2 qt}{2}\right). \]

LaTeX

Lemma 32Dichotomy of component sizes [Claim IV.8]

With probability at least \(1 - \tfrac {4}{n}\), simultaneously for all vertices \(v\), either the component of \(v\) has size less than \(k\) or has size strictly greater than \(n/2\), where \(k = \tfrac {2(1+\epsilon )}{\epsilon ^2} (2\log (n) - 2\log (\epsilon ) + 2\log (1+\epsilon ) - \log (1-p))\).

LaTeX

Lemma 46Cross-term bound [Claim E.3]

For any permutation \(\rho \in \mathcal{S}_n\), with \(G(x)\) as above,

\[ \mathbb {E}_\beta \big[|(G(x_k) - G(x^*))\rho ^{-1}(\beta )|_2^2\big] \le (sL' + L)\, y_k^T\nabla f(x_k). \]

LaTeX

Lemma 47Noise-term bound [Claim E.4]

With \(\sigma ^2 = \sum _i|\nabla f_i(x^*)|_2^2\) and \(r,s\) as above,

\[ \mathbb {E}_{\rho \sim \mathcal{S}_n, \beta }\big[|G(x^*)\rho ^{-1}(\beta )|_2^2\big] \le r\Big(1 + \tfrac 1{n-1}\Big)\sigma ^2 . \]

LaTeX

Lemma 52Bound on \(|G^TG\mathbb {1}|\) [Claim F.2]

With \(G = G(x_k)\), \(|G^TG\mathbb {1}|_2 \le LL'|y_k|_2^2 + L\sigma |y_k|_2\).

LaTeX

Lemma 53Bound on \(|G^T(y_k - \gamma G\mathbb {1})|\) [Claim F.3]

With \(G = G(x_k)\),

\[ |G^T(y_k - \gamma G\mathbb {1})|_2 \le |y_k|_2\big(\sigma + \sqrt{L'\mathbb {1}^TG^Ty_k}\big) + \gamma \big(LL'|y_k|_2^2 + L\sigma |y_k|_2\big). \]

LaTeX

Lemma 54Bound on \(\sigma _1(G^TG)\) [Claim F.4]

With \(G = G(x_k)\), \(\sigma _1(G^TG) \le 2|y_k|_2^2(L')^2 + 2\sigma ^2\).

LaTeX

Lemma 31Frontier lower bound [Claim IV.7]

At the end of any step \(t\) with \(|V_t| \le n/2\),

\[ |F_t| \ge \sum _{i=1}^t (X_i(\alpha _t + 1) - 1) + |V_0|\alpha _t, \qquad \alpha _t = \lambda \max \! \Big(\tfrac 12,\ 1 - \tfrac {t+|V_0|}{n}\Big). \]

LaTeX

Lemma 35Moment control of small-component sets [Claim IV.11]

For any set \(S\) of \(c \le \log (n)\) vertices, the probability that every vertex of \(S\) lies in a component of size at most \(k\) is at most \(\Big(\tfrac {p^{\lambda (1-\frac{1}{3+\epsilon })}}{(1-p)^2}\Big)^c\).

LaTeX

Lemma 33Few vertices in small components [Claim IV.9]

Let \(I_v = \mathbf{1}[|C_v| \le k]\) and \(Y = \sum _v I_v\). Then

\[ \Pr \! \left[Y \ge \frac{en\, p^{\lambda (1-\frac{1}{3+\epsilon })}}{(1-p)^2}\right] \le \frac{1}{n}. \]

Consequently, with probability \(\ge 1 - \tfrac {4}{n} - \tfrac 1n\), there is a giant component of size \(\ge n\big(1 - \tfrac {e p^{\lambda (1-\frac{1}{3+\epsilon })}}{(1-p)^2}\big)\) and no vertices in components of size between \(k\) and \(n/2\) (establishing Theorem 29).

LaTeX

Lemma 8Co-coercivity of smooth convex functions

For a convex function \(f\) whose gradient has Lipschitz constant \(L\),

\[ |\nabla f(x) - \nabla f(y)|_2^2 \le L\, \langle x - y,\ \nabla f(x) - \nabla f(y)\rangle . \]

LaTeX

Lemma 23Constancy of \(1-\alpha ^*\) on a component

For all vertices \(v\) in a single connected component of \(G(p)\), the value \(|1 - \alpha ^*_v|\) is the same.

LaTeX

Lemma 9Expander Mixing Lemma

Let \(G\) be a \(d\)-regular graph on \(n\) vertices with spectral expansion \(\lambda \) (Definition 18). For any sets of vertices \(S\) and \(T\),

\[ |E(S,T)| \ \ge \ d\frac{|S||T|}{n} - (d-\lambda )\sqrt{|S||T|\Big(1-\tfrac {|S|}{n}\Big)\Big(1-\tfrac {|T|}{n}\Big)} , \]

where \(E(S,T)\) denotes the set of edges between \(S\) and \(T\).

LaTeX

Lemma 3Jensen’s inequality

If \(\varphi \) is convex and \(X\) is an integrable random variable, then \(\varphi (\mathbb {E}[X]) \le \mathbb {E}[\varphi (X)]\).

LaTeX

Lemma 26Linear-time computation of \(w^*\)

Given the set of non-straggling machines, \(w^*\) can be computed in \(O(m)\) time: perform a breadth-first search of \(G(p)\) to find connected components and the sides \(L,R\) of any bipartite component; set \(\alpha ^*_v\) on each component using Lemmas 24–25; then a depth-first search labels each edge with \(w^*_e\) so that the incident edge weights sum to \(\alpha ^*_v\).

LaTeX

Lemma 2Markov’s inequality

If \(X \ge 0\) is a nonnegative random variable and \(a {\gt} 0\), then \(\Pr [X \ge a] \le \mathbb {E}[X]/a\).

LaTeX

Lemma 28Covariance norm bound [Lemma IV.5 / Lemma C.1]

Under the assumptions of Theorem 27,

\[ \big|\mathbb {E}[(\alpha ^* - r\mathbb {1})(\alpha ^* - r\mathbb {1})^T]\big|_2 \le 2k^2\big(t + \tfrac {6}{n}\big)^2 + 24, \]

where \(r, t, k\) are as in Theorem 27.

LaTeX

Lemma 24Non-bipartite components give \(\alpha ^* = 1\)

If a connected component of \(G(p)\) contains an odd cycle (is not bipartite), then \(\alpha ^*_v = 1\) for all vertices \(v\) in that component.

LaTeX

Lemma 34Random-walk extinction probability [Lemma IV.10]

Let \(S(\alpha ) = \sum _{i=1}^{\infty } X_i\) be a random walk where \(X_i = \alpha \) with probability \(1-p\) and \(X_i = -1\) with probability \(p\). Let \(S^{(\beta )}(\alpha ) = \beta + S(\alpha )\). If \(\alpha {\gt} 1\), then for any positive integer \(c\) the probability that \(S^{(c\alpha )}(\alpha )\) ever goes below zero is at most \(\Big(\tfrac {p^{\alpha }}{(1-p)^2}\Big)^c\).

LaTeX

Lemma 5Stirling’s formula bound

For every positive integer \(k\), \(\left(\dfrac {k!}{2^{k/2}(k/2)!}\right)^{1/k} \le \sqrt{k}\).

LaTeX

Lemma 1Cyclic property of trace

For real matrices \(A \in \mathbb {R}^{n \times m}\) and \(B \in \mathbb {R}^{m \times n}\) we have \(\operatorname {Tr}(AB) = \operatorname {Tr}(BA)\). (The paper calls this the “circular law of trace”.)

LaTeX Lean

Lemma 7Trace inequality for PSD matrices

If \(A\) and \(B\) are positive semidefinite matrices, then \(\operatorname {Tr}(AB) \le |A|_2 \, \operatorname {Tr}(B)\), where \(|A|_2\) is the operator norm.

LaTeX

Proposition 39Adversarial error via singular values [Proposition V.1]

Let \(A \in \mathbb {R}^{N \times m}\) be any assignment matrix on \(N\) data points in which each data point is replicated exactly \(d\) times and each machine holds exactly \(\ell \) data points. Let \(\sigma _2\) be the second largest singular value of \(A\). For any set \(S\) of \(s\) stragglers there exist decoding coefficients \(w = w(S)\) such that

\[ \tfrac 1N|\alpha - \mathbb {1}|_2^2 \le \tfrac 1N\Big(\tfrac {\sigma _2}{d}\Big)^2 \tfrac {sm}{m-s}. \]

LaTeX

Proposition 42Tightness of the adversarial bound [Remark V.4]

For any graph assignment scheme with \(\lfloor pm\rfloor \) stragglers and replication factor \(d\), the stragglers can be chosen so that at least \(\tfrac {mp}{d}\) data blocks are held only at straggling machines, whence for any decoding coefficients, \(\tfrac 1n|\alpha - \mathbb {1}|_2^2 \ge \tfrac {mp}{dn} = \tfrac {p}{2}\), using \(nd = 2m\).

LaTeX

Proposition 57Lower bound for any decoding algorithm [Proposition A.3]

Consider any assignment scheme \(A\) with \(m\) machines, \(n\) data blocks and \(n \le \operatorname {nnz}(A) \le dn\). Suppose a decoding algorithm yields an unbiased gradient, \(\mathbb {E}[\alpha ] = c\mathbb {1}\). Then

\[ \tfrac 1n\mathbb {E}[|\overline{\alpha } - \mathbb {1}|_2^2] \ge \tfrac {p^d}{1 - p^d}. \]

LaTeX

Proposition 36Centered binomial moment bound [Proposition IV.12]

For any \(n\), \(q\) and \(c \le \log (n)\),

\[ \mathbb {E}\big[(\mathrm{Binomial}(n,q) - nq)^c\big] \le \big(2q\sqrt{nc}\big)^c . \]

LaTeX

Proposition 50Convergence to a noise floor [Proposition VII.1]

Let \(f = \sum _i f_i\) be \(\mu \)-strongly convex with \(L\)-Lipschitz gradient, each \(f_i\) convex with \(L'\)-Lipschitz gradient, \(x^*\) the minimizer, and \(\sigma ^2 = \sum _i|\nabla f_i(x^*)|_2^2\). Perform gradient descent with \(x_{k+1} = x_k - \gamma \sum _i \alpha _i^{(k)}\nabla f_i(x_k)\) where at each iteration \(|\alpha ^{(k)} - \mathbb {1}|_2^2 \le r^2\). Let \(a = 1 - \tfrac {r\sqrt{L'}}{\sqrt\mu }\) and suppose \(a {\gt} 0\). For any \(0 {\lt} \epsilon \le 1\), with constant step size \(\gamma = \tfrac {\epsilon a\mu }{4(L^2 + 2rLL’ + 4r^2(L’)^2)}\), after

\[ k \le \frac{4(1+\epsilon )(L^2 + 2rLL' + 4r^2(L')^2) \log \big(\tfrac {2a^2\mu ^2|x_0 - x_*|_2^2}{(1+\epsilon )^2 r^2\sigma ^2}\big)}{3a^2\mu ^2\epsilon ^2} \]

iterations, \(|x_k - x_*|_2 \le (1 + \epsilon )\tfrac {r\sigma }{a\mu }\).

LaTeX

Proposition 60Convergence with a debiased scheme [Proposition B.2]

Let \(f = \sum _{i=1}^N f_i\) be \(\mu \)-strongly convex with \(L\)-Lipschitz gradient, each \(f_i\) convex with \(L'\)-Lipschitz gradient, \(\theta ^*\) the minimizer, \(\sigma ^2 = \sum _i|\nabla f_i(\theta ^*)|_2^2\). Suppose a possibly biased coding scheme with computational load \(\ell \) has \(\tfrac 1N\mathbb {E}[|\alpha - \mathbb {1}|_2^2] \le \zeta \) for some \(\zeta \le \tfrac 14\). Debias it via Proposition 59 (load \(\le 2\ell \)) and run gradient coding as in Algorithm 43 with \(w\) from the original scheme, with straggling probability \(p\). Then for any \(\epsilon \) there is a step size \(\gamma \) such that after

\[ k = 2\log (\epsilon /\epsilon _0)\Big(\tfrac {L}{\mu } + 8n\zeta \tfrac {L’}{\mu } + \tfrac {\zeta \sigma ^2}{\mu ^2\epsilon }\Big) \]

steps, \(\mathbb {E}[|\theta _k - \theta ^*|_2^2] \le \epsilon \), where \(\epsilon _0 = |\theta _0 - \theta ^*|^2\).

LaTeX

Proposition 49Convergence of coded GD with graph schemes [Proposition VI.3]

Let \(f = \sum _{i=1}^n f_i\) be \(\mu \)-strongly convex with \(L\)-Lipschitz gradient, each \(f_i\) convex with \(L'\)-Lipschitz gradient, \(\theta ^*\) the minimizer, and \(\sigma ^2 = \sum _i|\nabla f_i(\theta ^*)|_2^2\). Perform gradient coding with optimal decoding (Algorithm 43) with an assignment matrix for a \(d\)-regular vertex-transitive graph of spectral gap \(d - o(d)\), so \(m = \tfrac {nd}{2}\), and machine straggling probability \(p\). Then for any accuracy \(\epsilon \) there is a step size \(\gamma \) such that after

\[ k = 2\log (\epsilon _0/\epsilon )\Big(\tfrac {L}{\mu } + \log ^2(n)p^{2d - o(d)}\tfrac {L’}{\mu } + \tfrac {p^{d-o(d)}\sigma ^2}{\mu ^2\epsilon }\Big) \]

steps, \(\mathbb {E}[|\theta _k - \theta ^*|_2^2] \le \epsilon \), where \(\epsilon _0 = |\theta _0 - \theta ^*|^2\).

LaTeX

Proposition 45Convergence of SGD-ALG [Proposition VI.1]

Let \(f = \sum _{i=1}^n f_i\) be \(\mu \)-strongly convex with \(L\)-Lipschitz gradient, each \(f_i\) convex with \(L'\)-Lipschitz gradient, and let \(x^*\) be the minimizer. Run SGD-ALG\((P_\beta , x_0, \gamma , \{ f_i\} , k)\) from \(x_0\) for \(k\) iterations with step size \(\gamma \le \tfrac {1}{sL’+L}\) and a distribution \(P_\beta \) with \(\mathbb {E}[\beta ] = \mathbb {1}\). Let \(\sigma ^2 = \sum _i|\nabla f_i(x^*)|_2^2\), \(r = \tfrac 1n\mathbb {E}[|\beta - \mathbb {1}|_2^2]\), and \(s = |\mathbb {E}[(\beta - \mathbb {1})(\beta - \mathbb {1})^T]|_2\). Then

\[ \mathbb {E}[|x_k - x^*|_2^2] \le (1 - 2\gamma \mu (1 - \gamma (sL'+L)))^k |x_0 - x^*|_2^2 + \frac{\gamma r(1 + \tfrac 1{n-1})\sigma ^2}{\mu (1 - \gamma (sL'+L))}, \]

the expectation being over \(\rho \) and \(\{ \beta ^{(j)} : j {\lt} k\} \).

LaTeX

Proposition 59Debiasing a biased scheme [Proposition B.1]

Suppose there is an assignment matrix \(A\) with computational load \(\ell \) on \(m\) machines and \(N\) data blocks, and a decoding strategy \(w\) with \(\alpha \) such that \(\mathbb {E}[|\alpha - \mathbb {1}|_2^2] \le \epsilon N\). Then there exists an assignment matrix \(\hat A\) with computational load at most \(2\ell \) on \(m\) machines and \(N\) data blocks, and a decoding strategy \(\hat w\) with \(\hat\alpha \) such that

\[ \mathbb {E}[|\hat\alpha - \mathbb {1}|_2^2] \le \tfrac {2\epsilon }{(1 - \sqrt{2\epsilon })^2}N \qquad \text{and}\qquad \mathbb {E}[\hat\alpha ] = \mathbb {1}. \]

LaTeX

Proposition 56Fixed-decoding lower bound [Proposition A.1]

Consider any assignment scheme \(A\) with \(m\) machines, \(n\) data blocks, and \(n \le \operatorname {nnz}(A) \le dn\). Suppose a fixed decoding scheme is used that yields an unbiased gradient: (1) for some \(\hat w\), \(w_j = \hat w_j\) if machine \(j\) does not straggle and \(0\) otherwise; (2) \(\mathbb {E}[\alpha ] = c\mathbb {1}\). Then

\[ \tfrac 1n\mathbb {E}[|\overline{\alpha } - \mathbb {1}|_2^2] \ge \tfrac {p}{d(1-p)}, \qquad |\mathbb {E}[(\overline{\alpha } - \mathbb {1})(\overline{\alpha } - \mathbb {1})^T]|_2 \ge \tfrac {n}{m}\tfrac {p}{1-p}. \]

(For graph schemes \(m = dn/2\), so the second bound is \(\ge \tfrac {2p}{d(1-p)}\).)

LaTeX

Proposition 58Lower bound robustness [Remark A.4]

The lower bound of Proposition 57 holds even if the distributed algorithm uses a more complicated coding strategy than the linear scheme of the introduction, including non-linear or coordinate-wise coding of the gradients (e.g. multiplying gradients by a matrix).

LaTeX

Proposition 38Lower bound on random error \(\geq p^d/(1-p^d)\) [Remark IV.2]

For a graph assignment scheme with replication factor \(d\), \(\tfrac 1n\mathbb {E}[|\overline{\alpha ^*} - \mathbb {1}|_2^2] \ge p^d\), because \(p^d\) is the probability that a fixed data block is stored only at straggling machines. Theorem 27 therefore implies the variance of \(\alpha ^*\) shrinks exponentially in \(d\); for Ramanujan graphs (\(\lambda \ge d - 2\sqrt{d-1}\)) the bound is tight up to factors in the exponent.

LaTeX

Theorem 27Random straggler error [Theorem IV.1]

Let \(G = (V,E)\) be any vertex-transitive graph with \(n\) vertices, \(m\) edges, and spectral expansion \(\lambda \), and let \(A\) be the graph assignment matrix of \(G\). Suppose for some positive \(\epsilon \):

\(\lambda {\gt} 1.5\);
\((\tfrac {\lambda }{2}+1)(1 - p e^{1/\lambda }) \ge 1 + \epsilon \);
\(\Big(1 - \tfrac {e^2 p^{\lambda (1-\frac{1}{3+\epsilon })}}{(1-pe^{1/\lambda })^2}\Big) \Big(\lambda - (2d-\lambda )\tfrac {e^2 p^{\lambda (1-\frac{1}{3+\epsilon })}}{(1-pe^{1/\lambda })^2}\Big) \ge \epsilon \);
\(\tfrac {n}{\log (n)^2} \ge \tfrac {4(1+\epsilon )}{\epsilon ^2} \big(2 - 2\log (\epsilon ) + 2\log (1+\epsilon ) - \log (1 - pe^{1/\lambda })\big)\).

If each machine straggles independently with probability \(p\), then

\(\mathbb {E}[\alpha ^*] = r\mathbb {1}\) for some \(r \ge 1 - \tfrac {6}{n} - t\);
for all \(i\), \(\mathbb {E}[(\alpha ^*_i - r)^2] \le \tfrac {6}{n} + t\);
\(\big|\mathbb {E}[(\alpha ^* - r\mathbb {1})(\alpha ^* - r\mathbb {1})^T]\big|_2 \le 2k^2\big(t + \tfrac {6}{n}\big)^2 + 24\),

where \(t = \tfrac {e^2 p^{\lambda (1-\frac{1}{3+\epsilon })}}{(1-pe^{1/\lambda })^2}\) and \(k = \tfrac {2(1+\epsilon )}{\epsilon ^2} \big(2\log (n) - 2\log (\epsilon ) + 2\log (1+\epsilon ) - \log (1-p)\big)\), and all expectations are over the random stragglers.

LaTeX

Theorem 29Giant component of a sparsified expander [Theorem IV.3]

Let \(G\) be any \(d\)-regular \(\lambda \)-spectral expander on \(n\) vertices and suppose for some positive \(\epsilon \): (1) \(\lambda {\gt} 1.5\); (2) \((\tfrac {\lambda }{2}+1)(1-p) \ge 1 + \epsilon \); (3) \(\tfrac {n}{\log (n)^2} \ge \tfrac {2(3+\epsilon )(1+\epsilon )}{\epsilon ^2} \big(2 - 2\log (\epsilon ) + 2\log (1+\epsilon ) - \log (1-p)\big)\). Let \(G(p)\) be the random sparsification deleting each edge with probability \(p\). With probability at least \(1 - \tfrac {5}{n}\):

\(G(p)\) has a giant component of size at least \(n\Big(1 - \tfrac {e p^{\lambda (1-\frac{1}{3+\epsilon })}}{(1-p)^2}\Big)\);
every vertex is either in a component of size at most \(k\), where \(k\) is as in Theorem 27, or in a component of size greater than \(n/2\).

LaTeX