6 Hardness of document similarity
Assuming SETH or OVC, for every \(\varepsilon {\gt}0\) there exists a constant \(c{\gt}0\) such that \(\gamma \text{-}\mathsf{LSD}_{n,\ell }\) cannot be solved in \(O(n^{2-\varepsilon })\) time for any \(\gamma \ge 1\) when \(\ell = c\log n\).
Assume for contradiction that some algorithm \(\mathcal A\) solves \(\gamma \text{-}\mathsf{LSD}_{n, c\log n}\) in time \(O(n^{2-\varepsilon })\) for some \(\gamma \ge 1\), \(\varepsilon {\gt}0\) and every constant \(c{\gt}0\). We show that \(\mathsf{OV}_{n, c\log n}\) can then be solved in \(O(n^{2-\varepsilon })\) time for every \(c\), refuting OVC (and hence SETH, by Theorem 25).
Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\) with \(\ell = c\log n\): if any vector is the zero vector (checkable in \(O(n\ell )\) time) output "yes". Otherwise run \(\mathcal A\) to obtain \(i^{*}, j^{*}\) with
Since all \(\lVert v_i \rVert {\gt} 0\), the quantity \(\min _{i,j}\dfrac {\langle v_i, v_j \rangle }{\lVert v_i \rVert \cdot \lVert v_j \rVert }\) equals \(0\) iff there is an orthogonal pair (cosine similarity \(0 \Leftrightarrow \) inner product \(0\)). As \(\gamma \ge 1\), \(\dfrac {\langle v_{i^{*}}, v_{j^{*}} \rangle }{\lVert v_{i^{*}} \rVert \cdot \lVert v_{j^{*}} \rVert } = 0\) iff the true minimum is \(0\), i.e. iff \(\mathcal A\) returns an orthogonal pair. Thus checking whether \(\langle v_{i^{*}}, v_{j^{*}} \rangle = 0\) decides \(\mathsf{OV}_{n,\ell }\). The total time is \(O(n\ell + n^{2-\varepsilon }) = O(n^{2-\varepsilon })\), refuting SETH/OVC.
Assuming SETH or OVC, for every \(\varepsilon {\gt}0\) there exists a constant \(c{\gt}0\) such that \(\mathsf{LSD}_{n,\ell }\) cannot be solved in \(O(n^{2-\varepsilon })\) time when \(\ell \ge c\log n\). Moreover, the same lower bound holds for bichromatic \(\gamma \text{-}\mathsf{LSD}_{n,\ell }\) for all \(\gamma \ge 1\), and for \(\mathsf{LSD}_{n,\ell ,t}\) for some \(t \in [0,1]\).
Since \(\gamma \text{-}\mathsf{LSD}_{n,\ell }\) (with \(\gamma = 1\)) and bichromatic \(\gamma \text{-}\mathsf{LSD}_{n,\ell }\) are both easier than \(\mathsf{LSD}_{n,\ell }\), the lower bound of Theorem 39 transfers to them. For the decision version: there must exist \(t \in [0,1]\) such that \(\mathsf{LSD}_{n,\ell ,t}\) cannot be solved in \(O(n^{2-\varepsilon })\) time when \(\ell = c\log n\), because otherwise binary search over the \(O(\ell ^{3})\) possible threshold values would give a \(O(n^{2-\varepsilon })\) algorithm for \(\mathsf{LSD}_{n,\ell }\), contradicting the result for \(\mathsf{LSD}_{n,\ell }\). (Making \(\ell \) larger only makes the problem harder, giving the bound for all \(\ell \ge c\log n\).)
Assuming SETH or OVC, for every \(\varepsilon {\gt}0\) there exists a constant \(c{\gt}0\) such that \(\gamma \text{-}\mathsf{MSD}_{n,\ell }\) cannot be solved in \(O(n^{2-\varepsilon })\) time when
This follows from a combination of Lemma 45, Lemma 48, Lemma 47 and Theorem 49. A truly subquadratic algorithm for \(\gamma \text{-}\mathsf{MSD}_{n,\ell }\) with the stated \(\ell , \gamma \) would, by Lemma 45, give a truly subquadratic algorithm for \((1 + \tfrac {1}{\log \log n})\text{-}\mathsf{MSD}_{n, (\log n)^{k}}\) for any constant \(k{\gt}0\); by Lemma 48 (which itself combines Theorem 46 and Lemma 47) this gives a truly subquadratic algorithm for bichromatic \(\frac{\log n}{\ell }\)-Additive-\(\mathsf{MSD}_{n,c\log n}\); and finally Theorem 49 turns a truly subquadratic algorithm for the corresponding bichromatic additive Max-IP into a truly subquadratic algorithm for \(\mathsf{OV}_{n, c'\log n}\), refuting SETH and OVC (Conjecture 24, Theorem 25). The hardness also transfers to harder problems including \(\mathsf{MSD}_{n,\ell }\) and bichromatic \(\gamma \text{-}\mathsf{MSD}_{n,\ell }\).
Assuming SETH or OVC, for every \(\varepsilon {\gt}0\) there exists a constant \(c{\gt}0\) such that \(\mathsf{MSD}_{n,\ell }\) cannot be solved in \(O(n^{2-\varepsilon })\) time when \(\ell \ge (\log n)^{\frac{c\log n}{(\log \log n)^{2}}}\). Moreover the same lower bound holds for bichromatic \(\gamma \text{-}\mathsf{MSD}_{n,\ell }\) for all \(1 \le \gamma \le (1 + \tfrac {1}{\log \log n})^{\frac{\log n}{(\log \log n)^{2}}}\) and for \(\mathsf{MSD}_{n,\ell ,t}\) for some \(t \in [0,1]\).
As in Corollary 40: \(\gamma \text{-}\mathsf{MSD}\) and bichromatic \(\gamma \text{-}\mathsf{MSD}\) are easier than \(\mathsf{MSD}\), so the bound of Theorem 41 transfers; and for the decision version \(\mathsf{MSD}_{n,\ell ,t}\), the number of distinct relevant thresholds \(t\) is \(O(\ell ^{3})\), so a truly subquadratic algorithm for all \(t\) would solve \(\mathsf{MSD}_{n,\ell }\) by binary search, contradicting Theorem 41; hence some \(t\) is hard.