8 Representational strength of transformers

Theorem 50 An attention unit solves OV, Theorem 4.1

An attention unit with input and output MLPs and parameters \(d = \ell \), \(d_{\mathrm{in}} = \ell \), \(d_{\mathrm{out}} = 1\), \(m \ge \ell + 1\) can solve \(\mathsf{OV}_{n,\ell }\).

Proof ▶

Let \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\) be an \(\mathsf{OV}_{n,\ell }\) instance; define \(v_{n+1} := 0^{\ell }\) and let \(X \in \mathbb {R}^{(n+1)\times \ell }\) have rows \(X_{i,:} = v_i\). Since \(m \ge \ell + 1\), choose \(Q, K \in \mathbb {R}^{\ell \times m}\) with \(QK^{\top } = -3\log n \cdot I_{\ell }\), let \(V \in \mathbb {R}^{\ell \times 1}\) be the all-one matrix, and take \(\varphi _1\) to be the identity. We claim that if there is a pair \(i \ne j\) with \(\langle v_i, v_j \rangle = 0\) then \(A_{Q,K,V}(X)\) has an entry \(\ge \frac{1}{n+1}\), and otherwise all entries are \(\le \frac{1}{(n+1)^{1.5}}\); the second MLP \(\varphi _2\) then outputs \(1\) iff some entry is \(\ge \frac{1}{n+1}\) (existence guaranteed by Lemma 53).

The \(i\)-th entry of \(A_{Q,K,V}(X)\) equals

\[ \sum _{j=1}^{n} \frac{\exp (-3\log n \cdot \langle v_i, v_j \rangle )}{\sum _{k=1}^{n}\exp (-3\log n \cdot \langle v_i, v_k \rangle ) + 1}\cdot \lVert v_j \rVert _1 = \frac{\sum _{j=1}^{n} n^{-3\langle v_i, v_j \rangle }\cdot \lVert v_j \rVert _1}{\sum _{k=1}^{n} n^{-3\langle v_i, v_k \rangle } + 1}, \]

where \(\lVert v_j \rVert _1 = (XV)_j\) (the all-one \(V\) sums the entries of \(v_j\)) and the \(+1\) in the denominator is the contribution of \(v_{n+1} = 0^{\ell }\) (giving \(n^{0} = 1\)). If there is a pair with \(\langle v_{i^{*}}, v_{j^{*}} \rangle = 0\), the \(i^{*}\)-th entry is at least

\[ \frac{n^{-3\langle v_{i^{*}}, v_{j^{*}} \rangle }\cdot \lVert v_{j^{*}} \rVert _1}{n+1} \ge \frac{1}{n+1}, \]

since \(\lVert v_{j^{*}} \rVert _1 \ge 1\) and the denominator is at most \(n+1\). Otherwise \(\langle v_i, v_j \rangle \ge 1\) for all \(i,j\), so the \(i\)-th entry is at most

\[ \sum _{j=1}^{n} \frac{n^{-3\langle v_i, v_j \rangle } \cdot \lVert v_j \rVert _1}{1} \le \sum _{j=1}^{n} \frac{\ell }{n^{3\langle v_i, v_j \rangle }} \le n\cdot \frac{\ell }{n^{3}} {\lt} \frac{1}{n^{1.5}}, \]

using \(\lVert v_j \rVert _1 \le \ell \) and \(\ell = O(\operatorname {polylog} n)\).

Theorem 51 An attention unit solves Max-IP and Min-IP decision, Theorem C.1

An attention unit with input and output MLPs and parameters \(d = \ell \), \(d_{\mathrm{in}} = \ell + 1\), \(d_{\mathrm{out}} = 1\), \(m \ge \ell + 1\) can solve \(\mathsf{Max\text{-}IP}_{n,\ell ,t}\) and \(\mathsf{Min\text{-}IP}_{n,\ell ,t}\) for \(1 \le t \le \ell \).

Proof ▶

Max-IP. Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\) and \(V \in \mathbb {R}^{n\times \ell }\) with \(V_{i,:} = v_i\), set \(x_i = (v_i, 1) \in \{ 0,1\} ^{\ell +1}\), \(x_{n+1} = (0, \ldots , 0, t+1) \in \mathbb {R}^{\ell +1}\), and let \(X \in \mathbb {R}^{(n+1)\times (\ell +1)}\) have rows \(X_{i,:} = x_i\). Choose \(Q, K \in \mathbb {R}^{(\ell +1)\times m}\) with \(QK^{\top } = 3\log n \cdot I_{\ell +1}\), and \(\mathcal V = (1, \ldots , 1, 0) \in \mathbb {R}^{\ell +1}\). The input MLP \(\varphi _1\) produces the rows \(x_i\) from the \(v_i\) via Lemma 54. Then the \(i\)-th entry of \(A_{Q,K,\mathcal V}(X)\) equals

\[ \frac{\sum _{j=1}^{n} n^{3\langle x_i, x_j \rangle }\cdot \lVert v_j \rVert _1}{\sum _{k=1}^{n} n^{3\langle x_i, x_k \rangle } + n^{3(t+1)}}. \]

Let \(i^{*} \ne j^{*}\) maximize \(\langle v_{i^{*}}, v_{j^{*}} \rangle \). If \(\langle v_{i^{*}}, v_{j^{*}} \rangle \ge t\), then \(\langle x_{i^{*}}, x_{j^{*}} \rangle \ge t+1\) and the \(i^{*}\)-th entry is at least

\[ \frac{n^{3\langle x_{i^{*}}, x_{j^{*}} \rangle }\cdot \lVert v_{j^{*}} \rVert _1}{\sum _{k} n^{3\langle x_{i^{*}}, x_k \rangle } + n^{3(t+1)}} \ge \frac{\lVert v_{j^{*}} \rVert _1}{n+1} \ge \frac{1}{n+1}. \]

If instead \(\langle v_i, v_j \rangle \le t-1\) for all \(i \ne j\), then \(\langle x_i, x_j \rangle \le t\) and the \(i\)-th entry is at most

\[ \frac{\sum _{j=1}^{n} n^{3\langle x_i, x_j \rangle }\cdot \lVert v_j \rVert _1}{n^{3(t+1)}} \le \frac{\ell \cdot n\cdot n^{3t}}{n^{3(t+1)}} = \frac{\ell }{n^{2}} {\lt} \frac{1}{(n+1)^{1.5}}. \]

Thus \(\varphi _2\) outputs \(1\) iff some entry is \(\ge \frac{1}{n+1}\) (Lemma 53), solving \(\mathsf{Max\text{-}IP}_{n,\ell ,t}\).

Min-IP. The proof is the same except \(QK^{\top } = -3\log n \cdot I_{\ell +1}\) and \(x_{n+1} = (0, \ldots , 0, -(t+1)) \in \mathbb {R}^{\ell +1}\). The \(i\)-th entry of \(A_{Q,K,\mathcal V}(X)\) becomes

\[ \frac{\sum _{j=1}^{n} n^{-3\langle x_i, x_j \rangle }\cdot \lVert v_j \rVert _1}{\sum _{k=1}^{n} n^{-3\langle x_i, x_k \rangle } + n^{-3(t+1)}}. \]

Let \(i^{*}, j^{*}\) minimize \(\langle x_{i^{*}}, x_{j^{*}} \rangle \). If \(\langle v_{i^{*}}, v_{j^{*}} \rangle \le t\) then \(\langle x_{i^{*}}, x_{j^{*}} \rangle \le t+1\) and the \(i^{*}\)-th entry is at least \(\frac{1}{n+1}\); if \(\langle v_i, v_j \rangle \ge t+1\) for all \(i \ne j\) then \(\langle x_i, x_j \rangle \ge t+2\) and the \(i\)-th entry is at most \(\frac{\ell }{n^{2}} {\lt} \frac{1}{(n+1)^{1.5}}\). Again \(\varphi _2\) thresholds at \(\frac{1}{n+1}\).

Theorem 52 An attention unit solves MSD and LSD decision, Theorem C.2

An attention unit with input and output MLPs and parameters \(d = \ell \), \(d_{\mathrm{in}} = \ell + 1\), \(d_{\mathrm{out}} = 1\), \(m \ge \ell + 1\) can solve \(\mathsf{MSD}_{n,\ell ,t}\) and \(\mathsf{LSD}_{n,\ell ,t}\) for any \(t \in [0,1]\).

Proof ▶

When \(t = 0\) the problem is trivial (output \(1\)), so assume \(t \ne 0\). Given an \(\mathsf{MSD}_{n,\ell ,t}\) instance \(v_1, \ldots , v_n\) with \(V_{i,:} = v_i\), set \(x_i = \left(\frac{v_i}{\lVert v_i \rVert _1}, 1\right) \in \mathbb {R}^{\ell +1}\), \(x_{n+1} = (0, \ldots , 0, t+1)\), and \(X\) with rows \(x_i\). Take \(QK^{\top } = 3\log n \cdot I_{\ell +1}\) and \(\mathcal V = (1, \ldots , 1, 0)\). Because cosine similarity is exactly the inner product after normalizing the embeddings, the analysis of Theorem 51 (Max-IP case) applies once \(\varphi _1\) produces the normalized rows \(x_i\); the required normalizing MLP exists by Lemma 55. The proof for \(\mathsf{LSD}_{n,\ell ,t}\) is identical to the Min-IP case of Theorem 51 after applying the same normalizing \(\varphi _1\) (Lemma 55).