Fundamental Limitations on Subquadratic Alternatives to Transformers

3 Document similarity embeddings and problems

Definition 12 Bag-of-words embedding
#

Fix a list of \(\ell \) key words. The bag-of-words embedding of a document \(D\) is the vector \(\operatorname {BOW}(D) \in \{ 0,1\} ^{\ell }\) whose \(i\)-th entry is \(1\) iff the \(i\)-th key word occurs in \(D\). (E.g. a document containing only "There are ten apples on the apple tree", with key words "apple", "tree", "computer", "ten", has embedding \((1,1,0,1)\).)

Definition 13 Cosine similarity

For nonzero document embeddings \(v, w \in \mathbb {R}^{d}\), their cosine similarity is \(\dfrac {\langle v, w \rangle }{\lVert v \rVert \cdot \lVert w \rVert } \in [0,1]\) when \(v,w \in \{ 0,1\} ^d\); the value \(1\) means complete similarity and \(0\) means no similarity. (Used instead of the raw inner product so that two vectors with similar directions but large magnitude are still considered close.)

Definition 14 Most Similar Document, \(\mathsf{MSD}_{n,\ell }\), Definition 2.10

Given \(n\) document embeddings \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\) (assumed nonzero), \(\mathsf{MSD}_{n,\ell }\) asks to find \(1 \le i,j \le n\), \(i \ne j\), maximizing \(\dfrac {\langle v_i, v_j \rangle }{\lVert v_i \rVert \cdot \lVert v_j \rVert }\). Note \(\mathsf{MSD}\) differs from \(\mathsf{Max\text{-}IP}\) because of the normalization by \(\lVert v_i \rVert \lVert v_j \rVert \).

Definition 15 \(\gamma \)-MSD, Definition 2.11

Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\), \(\gamma \text{-}\mathsf{MSD}_{n,\ell }\) asks to find \(i^{*}, j^{*}\) with \(i^{*}\ne j^{*}\) such that

\[ \frac{1}{\gamma }\cdot \max _{1 \le i,j \le n} \dfrac {\langle v_i, v_j \rangle }{\lVert v_i \rVert \cdot \lVert v_j \rVert } \le \dfrac {\langle v_{i^{*}}, v_{j^{*}} \rangle }{\lVert v_{i^{*}} \rVert \cdot \lVert v_{j^{*}} \rVert } \le \max _{1 \le i,j \le n} \dfrac {\langle v_i, v_j \rangle }{\lVert v_i \rVert \cdot \lVert v_j \rVert }, \]

i.e. a pair whose cosine similarity is within a factor \(\gamma \) of the maximum.

Definition 16 MSD decision version, \(\mathsf{MSD}_{n,\ell ,t}\), Definition 2.12

Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\) and \(t \in [0,1]\), \(\mathsf{MSD}_{n,\ell ,t}\) asks to determine whether there exist \(i \ne j\) with \(\dfrac {\langle v_i, v_j \rangle }{\lVert v_i \rVert \cdot \lVert v_j \rVert } \ge t\).

Definition 17 Least Similar Document, \(\mathsf{LSD}_{n,\ell }\), Definition A.14

Given nonzero \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\), \(\mathsf{LSD}_{n,\ell }\) asks to find \(i \ne j\) minimizing \(\dfrac {\langle v_i, v_j \rangle }{\lVert v_i \rVert \cdot \lVert v_j \rVert }\).

Definition 18 \(\gamma \)-LSD, Definition A.15

Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\), \(\gamma \text{-}\mathsf{LSD}_{n,\ell }\) asks to find \(i^{*} \ne j^{*}\) such that

\[ \min _{1 \le i,j \le n} \dfrac {\langle v_i, v_j \rangle }{\lVert v_i \rVert \cdot \lVert v_j \rVert } \le \dfrac {\langle v_{i^{*}}, v_{j^{*}} \rangle }{\lVert v_{i^{*}} \rVert \cdot \lVert v_{j^{*}} \rVert } \le \gamma \cdot \min _{1 \le i,j \le n} \dfrac {\langle v_i, v_j \rangle }{\lVert v_i \rVert \cdot \lVert v_j \rVert }. \]
Definition 19 LSD decision version, \(\mathsf{LSD}_{n,\ell ,t}\), Definition A.16

Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\) and \(t \in [0,1]\), \(\mathsf{LSD}_{n,\ell ,t}\) asks to determine whether there exist \(i \ne j\) with \(\dfrac {\langle v_i, v_j \rangle }{\lVert v_i \rVert \cdot \lVert v_j \rVert } \le t\).