- Boxes
- definitions
- Ellipses
- theorems and lemmas
- Blue border
- the statement of this result is ready to be formalized; all prerequisites are done
- Orange border
- the statement of this result is not ready to be formalized; the blueprint needs more work
- Blue background
- the proof of this result is ready to be formalized; all prerequisites are done
- Green border
- the statement of this result is formalized
- Green background
- the proof of this result is formalized
- Dark green background
- the proof of this result and all its ancestors are formalized
- Dark green border
- this is in Mathlib
Assuming SETH or OVC, for every \(\varepsilon {\gt}0\) there exists a constant \(c{\gt}0\) such that \(\mathsf{LSD}_{n,\ell }\) cannot be solved in \(O(n^{2-\varepsilon })\) time when \(\ell \ge c\log n\). Moreover, the same lower bound holds for bichromatic \(\gamma \text{-}\mathsf{LSD}_{n,\ell }\) for all \(\gamma \ge 1\), and for \(\mathsf{LSD}_{n,\ell ,t}\) for some \(t \in [0,1]\).
Assuming SETH or OVC, for every \(\varepsilon {\gt}0\) there exists a constant \(c{\gt}0\) such that \(\mathsf{MSD}_{n,\ell }\) cannot be solved in \(O(n^{2-\varepsilon })\) time when \(\ell \ge (\log n)^{\frac{c\log n}{(\log \log n)^{2}}}\). Moreover the same lower bound holds for bichromatic \(\gamma \text{-}\mathsf{MSD}_{n,\ell }\) for all \(1 \le \gamma \le (1 + \tfrac {1}{\log \log n})^{\frac{\log n}{(\log \log n)^{2}}}\) and for \(\mathsf{MSD}_{n,\ell ,t}\) for some \(t \in [0,1]\).
For \(A, B \subseteq \{ 0,1\} ^{\ell }\) with \(|A| = |B| = n\) and a threshold \(\alpha \), the bichromatic \(\gamma \)-Additive-Max-IP problem (\(\gamma \)-Additive-BMax-IP) asks to distinguish: Yes: there is \((a,b)\in A\times B\) with \(\langle a, b \rangle \ge \alpha \); No: for all \((a,b)\in A\times B\), \(\langle a, b \rangle {\lt} \alpha - \gamma \).
For \(A, B \subseteq \{ 0,1\} ^{\ell }\) with \(|A| = |B| = n\) and \(\alpha \in [0,1]\), the bichromatic \(\gamma \)-Additive-MSD problem asks to distinguish:
Yes: there exists \((a,b) \in A \times B\) with \(\dfrac {\langle a, b \rangle }{\lVert a \rVert \lVert b \rVert } \ge \alpha \);
No: for every \((a,b) \in A \times B\), \(\dfrac {\langle a, b \rangle }{\lVert a \rVert \lVert b \rVert } {\lt} \alpha - \gamma \).
For input dimension \(d_{\mathrm{in}} \in \mathbb {N}\), output dimension \(d_{\mathrm{out}} \in \mathbb {N}\), embedding dimension \(m \in \mathbb {N}\), and matrices \(Q, K \in \mathbb {R}^{d_{\mathrm{in}} \times m}\) and \(V \in \mathbb {R}^{d_{\mathrm{in}} \times d_{\mathrm{out}}}\), an attention is the mapping \(A_{Q,K,V} : \mathbb {R}^{n \times d_{\mathrm{in}}} \to \mathbb {R}^{n \times d_{\mathrm{out}}}\) defined by
We write \(\mathcal A_{d_{\mathrm{in}}, m, d_{\mathrm{out}}}\) for the set of all such attentions.
Given two sets \(A, B\) of vectors in \(\{ 0,1\} ^{\ell }\), bichromatic \(\mathsf{Max\text{-}IP}_{n,\ell }\) (resp. its \(\gamma \)-approximate and decision versions) asks to find \(i,j\) achieving (resp. \(\gamma \)-approximating, deciding) the maximal \(\langle a_i, b_j \rangle \).
Given two sets \(A, B\) of vectors in \(\{ 0,1\} ^{\ell }\), bichromatic \(\mathsf{Min\text{-}IP}_{n,\ell }\) (resp. its \(\gamma \)-approximate and decision versions) asks to find \(i,j\) achieving (resp. \(\gamma \)-approximating, deciding) the minimal \(\langle a_i, b_j \rangle \) over \(a_i \in A\), \(b_j \in B\).
Given two sets \(A = \{ a_1, \ldots , a_n\} \) and \(B = \{ b_1, \ldots , b_n\} \) of vectors in \(\{ 0,1\} ^{\ell }\), bichromatic \(\mathsf{OV}_{n,\ell }\) asks whether there exist \(i,j\) with \(\langle a_i, b_j \rangle = 0\).
Fix a list of \(\ell \) key words. The bag-of-words embedding of a document \(D\) is the vector \(\operatorname {BOW}(D) \in \{ 0,1\} ^{\ell }\) whose \(i\)-th entry is \(1\) iff the \(i\)-th key word occurs in \(D\). (E.g. a document containing only "There are ten apples on the apple tree", with key words "apple", "tree", "computer", "ten", has embedding \((1,1,0,1)\).)
For nonzero document embeddings \(v, w \in \mathbb {R}^{d}\), their cosine similarity is \(\dfrac {\langle v, w \rangle }{\lVert v \rVert \cdot \lVert w \rVert } \in [0,1]\) when \(v,w \in \{ 0,1\} ^d\); the value \(1\) means complete similarity and \(0\) means no similarity. (Used instead of the raw inner product so that two vectors with similar directions but large magnitude are still considered close.)
Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\), \(\gamma \text{-}\mathsf{LSD}_{n,\ell }\) asks to find \(i^{*} \ne j^{*}\) such that
Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\), \(\gamma \text{-}\mathsf{Max\text{-}IP}_{n,\ell }\) asks to find \(i \ne j\) such that \(\langle v_i, v_j \rangle \) is a \(\gamma \)-approximation of the maximal inner product.
Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\), \(\gamma \text{-}\mathsf{Min\text{-}IP}_{n,\ell }\) asks to find \(i \ne j\) such that \(\langle v_i, v_j \rangle \) is a \(\gamma \)-approximation of the minimal inner product.
Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\), \(\gamma \text{-}\mathsf{MSD}_{n,\ell }\) asks to find \(i^{*}, j^{*}\) with \(i^{*}\ne j^{*}\) such that
i.e. a pair whose cosine similarity is within a factor \(\gamma \) of the maximum.
For \(v, w \in \mathbb {R}^{\ell }\) we write \(\langle v, w \rangle = \sum _{i=1}^{\ell } v[i]\, w[i]\) for their (standard) inner product, where \(v[i]\) denotes the \(i\)-th entry of \(v\). For binary vectors \(v,w \in \{ 0,1\} ^{\ell }\) this counts the coordinates on which both are \(1\). Recalled only so later nodes can depend on it.
For \(v \in \mathbb {R}^{a}\) and \(w \in \mathbb {R}^{b}\), the Kronecker product \(v \otimes w \in \mathbb {R}^{ab}\) is the vector whose entries are all the products of an entry of \(v\) and an entry of \(w\). We write \(v^{\otimes q}\) for the \(q\)-fold Kronecker power \(v \otimes \cdots \otimes v\). Standard.
\(k\mathsf{SAT}\) is the satisfiability problem for Boolean formulas in conjunctive normal form in which every clause has at most \(k\) literals; \(n\) denotes the number of variables. Recalled to state SETH.
For \(v \in \mathbb {R}^{\ell }\), \(\lVert v \rVert = \sqrt{\langle v, v \rangle }\) is the \(\ell _2\) norm. We also use the \(\ell _1\) norm \(\lVert v \rVert _1 = \sum _{i=1}^{\ell } |v[i]|\); for a binary vector \(\lVert v \rVert _1\) is its number of \(1\) entries. Standard.
Given nonzero \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\), \(\mathsf{LSD}_{n,\ell }\) asks to find \(i \ne j\) minimizing \(\dfrac {\langle v_i, v_j \rangle }{\lVert v_i \rVert \cdot \lVert v_j \rVert }\).
Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\) and \(t \in [0,1]\), \(\mathsf{LSD}_{n,\ell ,t}\) asks to determine whether there exist \(i \ne j\) with \(\dfrac {\langle v_i, v_j \rangle }{\lVert v_i \rVert \cdot \lVert v_j \rVert } \le t\).
Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\), \(\mathsf{Max\text{-}IP}_{n,\ell }\) asks to find a pair \(i \ne j\) such that \(\langle v_i, v_j \rangle \) is maximal.
Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\) and \(0 \le t \le \ell \), \(\mathsf{Max\text{-}IP}_{n,\ell ,t}\) asks to determine whether there exist \(i \ne j\) with \(\langle v_i, v_j \rangle \ge t\).
Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\), \(\mathsf{Min\text{-}IP}_{n,\ell }\) asks to find a pair \(i \ne j\) such that \(\langle v_i, v_j \rangle \) is minimum.
Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\) and \(0 \le t \le \ell \), \(\mathsf{Min\text{-}IP}_{n,\ell ,t}\) asks to determine whether there exists a pair \(i \ne j\) with \(\langle v_i, v_j \rangle \le t\).
A multi-layer perceptron (MLP) is represented by some continuous function \(\varphi : \mathbb {R}^{a} \to \mathbb {R}^{b}\) for positive integers \(a,b\) (modelling, following the universal approximation theorem, any function approximable by a neural network). It is applied to a matrix row-wise: for \(X \in \mathbb {R}^{n \times a}\), \(\varphi (X) = (\varphi (X_1), \ldots , \varphi (X_n)) \in \mathbb {R}^{n \times b}\).
Given \(n\) document embeddings \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\) (assumed nonzero), \(\mathsf{MSD}_{n,\ell }\) asks to find \(1 \le i,j \le n\), \(i \ne j\), maximizing \(\dfrac {\langle v_i, v_j \rangle }{\lVert v_i \rVert \cdot \lVert v_j \rVert }\). Note \(\mathsf{MSD}\) differs from \(\mathsf{Max\text{-}IP}\) because of the normalization by \(\lVert v_i \rVert \lVert v_j \rVert \).
Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\) and \(t \in [0,1]\), \(\mathsf{MSD}_{n,\ell ,t}\) asks to determine whether there exist \(i \ne j\) with \(\dfrac {\langle v_i, v_j \rangle }{\lVert v_i \rVert \cdot \lVert v_j \rVert } \ge t\).
Given binary vectors \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\), \(\mathsf{OV}_{n,\ell }\) asks to determine whether there exists a pair \(i \ne j\) with \(\langle v_i, v_j \rangle = 0\).
For a vector \(v \in \mathbb {R}^{n}\),
For a matrix \(A \in \mathbb {R}^{n \times n}\) the softmax operator is applied row-wise: \(\operatorname {softmax}(A)_{i,:} = \operatorname {softmax}(A_{i,:})\).
A transformer \(\mathsf{TF}\) solves a (decision) problem whose answer on instance \(E \in \mathbb {R}^{n \times d}\) is \(\mathsf{TF}(E)\). For a problem on vectors \(v_1, \ldots , v_n\) with answer a Boolean, \(\mathsf{TF}\) solves it if for every input \(E\) with \(E_{i,:} = v_i\) for all \(i\) one has \(\mathsf{TF}(E) = 1\) exactly when the answer is "yes" and \(\mathsf{TF}(E) = 0\) otherwise.
A transformer is a mapping \(\mathsf{TF}: \mathbb {R}^{n \times d} \to \mathbb {R}\) specified by an attention unit \(A_{Q,K,V}\) and two MLPs \(\varphi _1 : \mathbb {R}^{n \times d} \to \mathbb {R}^{n \times d_{\mathrm{in}}}\) and \(\varphi _2 : \mathbb {R}^{n \times d_{\mathrm{out}}} \to \mathbb {R}\). On an embedding matrix \(E \in \mathbb {R}^{n \times d}\) it outputs
This models a single-attention-unit transformer (first MLP, then the attention unit, then the second MLP).
An algorithm for a problem on inputs of size parameter \(n\) runs in truly subquadratic time if it runs in time \(O(n^{2-\varepsilon })\) for some fixed constant \(\varepsilon {\gt} 0\) (independent of the instance). Throughout, "cannot be solved in \(O(n^{2-\varepsilon })\) time" is understood with \(\varepsilon \) quantified as in each statement. We say a problem \(\mathcal P\) reduces to \(\mathcal Q\) if a truly subquadratic algorithm for \(\mathcal Q\) yields one for \(\mathcal P\); the two are subquadratic equivalent if each reduces to the other.
Bichromatic \(\frac{\log n}{\ell }\)-Additive-\(\mathsf{MSD}_{n,\ell }\) with \(\ell = O(\log n)\) and bichromatic \((\log n)\)-Additive-Max-IP\(_{n,\ell '}\) with \(\ell ' = O(\log n)\) are subquadratic equivalent.
Finite sums, finite products and compositions of continuous functions between finite-dimensional real vector spaces are continuous, as are constant maps, coordinate projections, and the maps \(x \mapsto \max (x,c)\), \(x \mapsto \min (x,c)\). Recalled so the piecewise-linear MLP constructions below can be shown continuous.
For \(v, w \in \mathbb {R}^{a}\) and \(x, y \in \mathbb {R}^{b}\) we have \(\langle v \otimes x, w \otimes y \rangle = \langle v, w \rangle \cdot \langle x, y \rangle \). In particular \(\langle v^{\otimes q}, w^{\otimes q} \rangle = \langle v, w \rangle ^{q}\) and \(\lVert v^{\otimes q} \rVert = \lVert v \rVert ^{q}\). Standard property of tensor products.
Suppose there is an algorithm \(\mathcal A\) for bichromatic \((\gamma \text{-})\mathsf{Max\text{-}IP}_{n,\ell }\) running in time \(O(n^{2-\varepsilon }\operatorname {poly}(\ell ))\) for some \(\varepsilon {\gt}0\). Then there is an algorithm for \((\gamma \text{-})\mathsf{Max\text{-}IP}_{n,\ell }\) running in time \(O(n^{2-\varepsilon '}\operatorname {poly}(\ell ))\) for some \(\varepsilon '{\gt}0\).
Suppose there is an algorithm \(\mathcal A\) for bichromatic \(\mathsf{Max\text{-}IP}_{n,\ell }\) running in time \(O(n^{2-\varepsilon }\operatorname {poly}(\ell ))\) for some \(\varepsilon {\gt}0\). Then there is an algorithm for \(\mathsf{OV}_{n,\ell }\) running in time \(O(n^{2-\varepsilon }\operatorname {poly}(\ell ))\).
Suppose there is an algorithm \(\mathcal A\) for bichromatic \(\mathsf{Min\text{-}IP}_{n,\ell }\) running in time \(O(n^{2-\varepsilon }\operatorname {poly}(\ell ))\) for some \(\varepsilon {\gt}0\). Then there is an algorithm for \(\mathsf{Min\text{-}IP}_{n,\ell }\) running in time \(O(n^{2-\varepsilon '}\operatorname {poly}(\ell ))\) for some \(\varepsilon '{\gt}0\). The same holds for \(\gamma \text{-}\mathsf{Min\text{-}IP}_{n,\ell }\) and the decision version \(\mathsf{Min\text{-}IP}_{n,\ell ,t}\).
For any \(\gamma \ge 1\), \(\mathsf{Min\text{-}IP}_{n,\ell }\) and \(\gamma \text{-}\mathsf{Min\text{-}IP}_{n,\ell }\) are both at least as hard as \(\mathsf{OV}_{n,\ell }\). Consequently, assuming OVC, for any \(\varepsilon {\gt}0\) there is \(c{\gt}0\) such that \(\mathsf{Min\text{-}IP}_{n,c\log n}\) cannot be solved in \(O(n^{2-\varepsilon })\) time.
There exists a continuous function \(f : \mathbb {R}^{\ell } \to \mathbb {R}^{\ell +1}\) such that
There exists a continuous function \(f : \mathbb {R}^{\ell } \to \mathbb {R}^{\ell +1}\) such that
For any \(a, b \in \mathbb {R}\) with \(b {\gt} a\), there exists a continuous function \(f : \mathbb {R}^{\ell } \to \mathbb {R}\) such that
Suppose there is an algorithm for \(\gamma \text{-}\mathsf{MSD}_{n,\ell }\) with \(\ell = (\log n)^{\frac{c\log n}{(\log \log n)^{2}}}\) for any constant \(c{\gt}0\) and \(\gamma \le (1 + \tfrac {1}{\log \log n})^{\frac{\log n}{(\log \log n)^{2}}}\) that runs in \(O(n^{2-\varepsilon })\) time for any \(\varepsilon {\gt}0\). Then there is an algorithm for \((1 + \tfrac {1}{\log \log n})\text{-}\mathsf{MSD}_{n, (\log n)^{k}}\) with running time \(O(n^{2-\varepsilon })\) for any constant \(k{\gt}0\).
Suppose there is an algorithm for \((1 + \tfrac {1}{\log \log n})\text{-}\mathsf{MSD}_{n, (\log n)^{k}}\) for any constant \(k{\gt}0\) with \(O(n^{2-\varepsilon })\) running time. Then there is an algorithm for bichromatic \(\frac{\log n}{\ell }\)-Additive-\(\mathsf{MSD}_{n, c\log n}\) for any \(c{\gt}0\) in \(O(n^{2-\varepsilon '})\) time for some \(\varepsilon '{\gt}0\).
There exists an algorithm for \(\mathsf{OV}_{n,\ell }\) running in time \(O(n^{2-\varepsilon }\cdot \operatorname {poly}(\ell ))\) for some \(\varepsilon {\gt}0\) if and only if there exists an algorithm for bichromatic \(\mathsf{OV}_{n,\ell }\) running in time \(O(n^{2-\varepsilon '}\cdot \operatorname {poly}(\ell ))\) for some \(\varepsilon '{\gt}0\).
Suppose there is an algorithm for \((\log n)\)-Additive-BMax-IP\(_{n, c\log n}\) for any constant \(c{\gt}0\) running in \(O(n^{2-\varepsilon })\) time for some \(\varepsilon {\gt}0\). Then there is an algorithm for \(\mathsf{OV}_{n, c'\log n}\) for any constant \(c'{\gt}0\) running in \(O(n^{2-\varepsilon '})\) time for some \(\varepsilon '{\gt}0\), refuting SETH and OVC.
Suppose there is an algorithm for \((1 + \tfrac {1}{\log \log n})\text{-}\mathsf{MSD}_{n, (\log n)^{k}}\) for any constant \(k{\gt}0\) running in \(O(n^{2-\varepsilon })\) time for any \(\varepsilon {\gt}0\). Then there is an algorithm for bichromatic \((\log n)\)-Additive-Max-IP\(_{n, c\log n}\) for any constant \(c{\gt}0\) running in \(O(n^{2-\varepsilon '})\) time for some \(\varepsilon '{\gt}0\).
An attention unit with input and output MLPs and parameters \(d = \ell \), \(d_{\mathrm{in}} = \ell + 1\), \(d_{\mathrm{out}} = 1\), \(m \ge \ell + 1\) can solve \(\mathsf{Max\text{-}IP}_{n,\ell ,t}\) and \(\mathsf{Min\text{-}IP}_{n,\ell ,t}\) for \(1 \le t \le \ell \).
An attention unit with input and output MLPs and parameters \(d = \ell \), \(d_{\mathrm{in}} = \ell + 1\), \(d_{\mathrm{out}} = 1\), \(m \ge \ell + 1\) can solve \(\mathsf{MSD}_{n,\ell ,t}\) and \(\mathsf{LSD}_{n,\ell ,t}\) for any \(t \in [0,1]\).
An attention unit with input and output MLPs and parameters \(d = \ell \), \(d_{\mathrm{in}} = \ell \), \(d_{\mathrm{out}} = 1\), \(m \ge \ell + 1\) can solve \(\mathsf{OV}_{n,\ell }\).
Assuming SETH or OVC, for every \(\varepsilon {\gt}0\) there exists a constant \(c{\gt}0\) such that \(\gamma \text{-}\mathsf{LSD}_{n,\ell }\) cannot be solved in \(O(n^{2-\varepsilon })\) time for any \(\gamma \ge 1\) when \(\ell = c\log n\).
Assuming SETH or OVC, for every \(\varepsilon {\gt}0\) there exists a constant \(c{\gt}0\) such that \(\gamma \text{-}\mathsf{MSD}_{n,\ell }\) cannot be solved in \(O(n^{2-\varepsilon })\) time when