9 Appendix A: Lower bounds for random stragglers
Consider any assignment scheme \(A\) with \(m\) machines, \(n\) data blocks, and \(n \le \operatorname {nnz}(A) \le dn\). Suppose a fixed decoding scheme is used that yields an unbiased gradient: (1) for some \(\hat w\), \(w_j = \hat w_j\) if machine \(j\) does not straggle and \(0\) otherwise; (2) \(\mathbb {E}[\alpha ] = c\mathbb {1}\). Then
(For graph schemes \(m = dn/2\), so the second bound is \(\ge \tfrac {2p}{d(1-p)}\).)
Assume WLOG \(\hat w = \mathbb {1}\) (scale columns of \(A\) by \(\hat w_j\)) and \(c = 1\). By independence of machine failures,
By the cyclic property of trace (Lemma 1), \(\mathbb {E}[|\alpha - \mathbb {1}|_2^2] = \operatorname {Tr}(p(1-p)AA^T) = p(1-p)|A|_F^2\). Now \(\mathbb {1}^TA\mathbb {1}= \mathbb {1}^TA\tfrac {\mathbb {E}[w]}{1-p}\cdot (1-p) = \tfrac {\mathbb {1}^T\mathbb {E}[\alpha ]}{1-p}(1-p)\cdot \! \ldots \) more directly \(\mathbb {1}^TA\hat w = \tfrac {\mathbb {1}^T\mathbb {E}[\alpha ]}{1-p} = \tfrac {n}{1-p}\), so to minimize \(|A|_F\) under this constraint all nonzero entries equal \(\tfrac 1{d(1-p)}\), giving \(|A|_F^2 = \tfrac {n}{d(1-p)^2}\) and \(\mathbb {E}[|\alpha - \mathbb {1}|_2^2] \ge \tfrac {pn}{d(1-p)}\). Also \(|AA^T|_2 \ge \tfrac 1m\mathbb {1}^TA^TA\mathbb {1}= \tfrac 1m\tfrac {n}{(1-p)^2}\), and since \(\operatorname {Cov}(\alpha ) = p(1-p)AA^T\) the covariance bound follows.
Consider any assignment scheme \(A\) with \(m\) machines, \(n\) data blocks and \(n \le \operatorname {nnz}(A) \le dn\). Suppose a decoding algorithm yields an unbiased gradient, \(\mathbb {E}[\alpha ] = c\mathbb {1}\). Then
WLOG scale so \(\mathbb {E}[\alpha ] = \mathbb {1}\). For each block \(i\) with replication factor \(d_i\), with probability at least \(p^{d_i}\) all machines holding block \(i\) straggle, forcing \(\alpha _i = 0\). Since \(\mathbb {E}[\alpha _i] = 1\),
Hence \(\sum _i \mathbb {E}[\alpha _i] \ge \sum _i \tfrac {p^{d_i}}{1-p^{d_i}} = n - \sum _i\tfrac 1{1-p^{d_i}}\). Subject to \(\sum _i d_i \le dn\), this is minimized when all \(d_i = d\), giving \(\tfrac 1n\mathbb {E}[|\overline\alpha - \mathbb {1}|_2^2] \ge \tfrac {p^d}{1-p^d}\).
The lower bound of Proposition 57 holds even if the distributed algorithm uses a more complicated coding strategy than the linear scheme of the introduction, including non-linear or coordinate-wise coding of the gradients (e.g. multiplying gradients by a matrix).