3 Document similarity embeddings and problems
Fix a list of \(\ell \) key words. The bag-of-words embedding of a document \(D\) is the vector \(\operatorname {BOW}(D) \in \{ 0,1\} ^{\ell }\) whose \(i\)-th entry is \(1\) iff the \(i\)-th key word occurs in \(D\). (E.g. a document containing only "There are ten apples on the apple tree", with key words "apple", "tree", "computer", "ten", has embedding \((1,1,0,1)\).)
For nonzero document embeddings \(v, w \in \mathbb {R}^{d}\), their cosine similarity is \(\dfrac {\langle v, w \rangle }{\lVert v \rVert \cdot \lVert w \rVert } \in [0,1]\) when \(v,w \in \{ 0,1\} ^d\); the value \(1\) means complete similarity and \(0\) means no similarity. (Used instead of the raw inner product so that two vectors with similar directions but large magnitude are still considered close.)
Given \(n\) document embeddings \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\) (assumed nonzero), \(\mathsf{MSD}_{n,\ell }\) asks to find \(1 \le i,j \le n\), \(i \ne j\), maximizing \(\dfrac {\langle v_i, v_j \rangle }{\lVert v_i \rVert \cdot \lVert v_j \rVert }\). Note \(\mathsf{MSD}\) differs from \(\mathsf{Max\text{-}IP}\) because of the normalization by \(\lVert v_i \rVert \lVert v_j \rVert \).
Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\), \(\gamma \text{-}\mathsf{MSD}_{n,\ell }\) asks to find \(i^{*}, j^{*}\) with \(i^{*}\ne j^{*}\) such that
i.e. a pair whose cosine similarity is within a factor \(\gamma \) of the maximum.
Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\) and \(t \in [0,1]\), \(\mathsf{MSD}_{n,\ell ,t}\) asks to determine whether there exist \(i \ne j\) with \(\dfrac {\langle v_i, v_j \rangle }{\lVert v_i \rVert \cdot \lVert v_j \rVert } \ge t\).
Given nonzero \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\), \(\mathsf{LSD}_{n,\ell }\) asks to find \(i \ne j\) minimizing \(\dfrac {\langle v_i, v_j \rangle }{\lVert v_i \rVert \cdot \lVert v_j \rVert }\).
Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\), \(\gamma \text{-}\mathsf{LSD}_{n,\ell }\) asks to find \(i^{*} \ne j^{*}\) such that
Given \(v_1, \ldots , v_n \in \{ 0,1\} ^{\ell }\) and \(t \in [0,1]\), \(\mathsf{LSD}_{n,\ell ,t}\) asks to determine whether there exist \(i \ne j\) with \(\dfrac {\langle v_i, v_j \rangle }{\lVert v_i \rVert \cdot \lVert v_j \rVert } \le t\).