9 Appendix A: Lower bounds for random stragglers

Proposition 56 Fixed-decoding lower bound [Proposition A.1]

Consider any assignment scheme \(A\) with \(m\) machines, \(n\) data blocks, and \(n \le \operatorname {nnz}(A) \le dn\). Suppose a fixed decoding scheme is used that yields an unbiased gradient: (1) for some \(\hat w\), \(w_j = \hat w_j\) if machine \(j\) does not straggle and \(0\) otherwise; (2) \(\mathbb {E}[\alpha ] = c\mathbb {1}\). Then

\[ \tfrac 1n\mathbb {E}[|\overline{\alpha } - \mathbb {1}|_2^2] \ge \tfrac {p}{d(1-p)}, \qquad |\mathbb {E}[(\overline{\alpha } - \mathbb {1})(\overline{\alpha } - \mathbb {1})^T]|_2 \ge \tfrac {n}{m}\tfrac {p}{1-p}. \]

(For graph schemes \(m = dn/2\), so the second bound is \(\ge \tfrac {2p}{d(1-p)}\).)

Proof ▶

Assume WLOG \(\hat w = \mathbb {1}\) (scale columns of \(A\) by \(\hat w_j\)) and \(c = 1\). By independence of machine failures,

\[ \mathbb {E}[(\alpha - \mathbb {E}[\alpha ])(\alpha - \mathbb {E}[\alpha ])^T] = A\, \operatorname {Cov}(w)\, A^T = A(p(1-p)I)A^T = p(1-p)AA^T . \]

By the cyclic property of trace (Lemma 1), \(\mathbb {E}[|\alpha - \mathbb {1}|_2^2] = \operatorname {Tr}(p(1-p)AA^T) = p(1-p)|A|_F^2\). Now \(\mathbb {1}^TA\mathbb {1}= \mathbb {1}^TA\tfrac {\mathbb {E}[w]}{1-p}\cdot (1-p) = \tfrac {\mathbb {1}^T\mathbb {E}[\alpha ]}{1-p}(1-p)\cdot \! \ldots \) more directly \(\mathbb {1}^TA\hat w = \tfrac {\mathbb {1}^T\mathbb {E}[\alpha ]}{1-p} = \tfrac {n}{1-p}\), so to minimize \(|A|_F\) under this constraint all nonzero entries equal \(\tfrac 1{d(1-p)}\), giving \(|A|_F^2 = \tfrac {n}{d(1-p)^2}\) and \(\mathbb {E}[|\alpha - \mathbb {1}|_2^2] \ge \tfrac {pn}{d(1-p)}\). Also \(|AA^T|_2 \ge \tfrac 1m\mathbb {1}^TA^TA\mathbb {1}= \tfrac 1m\tfrac {n}{(1-p)^2}\), and since \(\operatorname {Cov}(\alpha ) = p(1-p)AA^T\) the covariance bound follows.

Proposition 57 Lower bound for any decoding algorithm [Proposition A.3]

Consider any assignment scheme \(A\) with \(m\) machines, \(n\) data blocks and \(n \le \operatorname {nnz}(A) \le dn\). Suppose a decoding algorithm yields an unbiased gradient, \(\mathbb {E}[\alpha ] = c\mathbb {1}\). Then

\[ \tfrac 1n\mathbb {E}[|\overline{\alpha } - \mathbb {1}|_2^2] \ge \tfrac {p^d}{1 - p^d}. \]

Proof ▶

WLOG scale so \(\mathbb {E}[\alpha ] = \mathbb {1}\). For each block \(i\) with replication factor \(d_i\), with probability at least \(p^{d_i}\) all machines holding block \(i\) straggle, forcing \(\alpha _i = 0\). Since \(\mathbb {E}[\alpha _i] = 1\),

\[ \mathbb {E}[(\alpha _i - 1)^2] \ge p^{d_i} + (1 - p^{d_i})\Big(\tfrac 1{1 - p^{d_i}} - 1\Big)^2 = \tfrac {p^{d_i}}{1 - p^{d_i}} . \]

Hence \(\sum _i \mathbb {E}[\alpha _i] \ge \sum _i \tfrac {p^{d_i}}{1-p^{d_i}} = n - \sum _i\tfrac 1{1-p^{d_i}}\). Subject to \(\sum _i d_i \le dn\), this is minimized when all \(d_i = d\), giving \(\tfrac 1n\mathbb {E}[|\overline\alpha - \mathbb {1}|_2^2] \ge \tfrac {p^d}{1-p^d}\).

Proposition 58 Lower bound robustness [Remark A.4]

The lower bound of Proposition 57 holds even if the distributed algorithm uses a more complicated coding strategy than the linear scheme of the introduction, including non-linear or coordinate-wise coding of the gradients (e.g. multiplying gradients by a matrix).