20 Cutting Data Down to Size: Hashing
Given a domain
\(\)D\( and a range \)Σ\( (typically with \)|Σ| < |D|\(), a hash function is a map \[ h : D \to \Sigma . \] \)
Given
\(\)D\(, \)Σ\(, and an integer \)m ≥1\(, a hash family \)H\( is a set \){h_1,…,h_m}\( such that for each \)i [m]\(, \[ h_i : D \to \Sigma . \] \)
A hash family
\(\)H = {h_1,…,h_m}\( defined over the domain \)D\( and range \)Σ\( is said to be \)ε\(-almost universal (for some \)0 < ε≤1\() if for every \)x ≠y D\(, \[ \Pr _i\left[h_i(x) = h_i(y)\right] \leq \varepsilon , \] where \)i\( is chosen uniformly at random from \)[m]\(. \)
Let
\(\)H = {h_1,…,h_m}\( with domain \)D\( and range \)Σ\( be an \)ε\(-almost universal hash family. Then for any (distinct) \)x,a_1,a_2,…,a_N D\(, \[ \mathbb {E}_i\left[\left|\{ a_j \mid h_i(x) = h_i(a_j)\} \right|\right] \leq \varepsilon \cdot N, \] where the expectation is taken over a uniformly random choice of \)i [m]\(. \)
Fix
\(\)a_j\( for \)j [N]\(. By definition of an \)ε\(-almost universal hash family, \[ \Pr _i[h_i(x) = h_i(a_j)] \leq \varepsilon . \] We wish to bound \)E_i∑_j=1^N 1_h_i(a_j) = h_i(x)\(. By linearity of expectation, this equals \[ \sum _{j=1}^N \Pr _i[h_i(a_j) = h_i(x)] \leq \sum _{j=1}^N \varepsilon = \varepsilon N, \] which completes the proof. \)
Given an
\(\)O\(-almost universal hash family with domain \)D\( and range \)Σ\( such that \)|Σ| = O(N)\(, there exists a randomized data structure that, given \)N\( elements \)a_1,…,a_N D\(, supports searching, insertion, and deletion in expected \)O(1)\( time while using \)O(N)\( space in the worst case. \)
Let
\(\)H = {h_1,…,h_m}\( be an \)ε\(-almost universal hash family, with \)ε= O(1/N)\(, from domain \)D\( to range \)Σ\( with \)|Σ| = O(N)\(. Build an array of linked lists with one entry in the array for each value \)v Σ\(. Pick a random hash function \)h_i H\(. For each \)a_j\(, \)j [N]\(, add it to the linked list corresponding to \)h_i(a_j)\(. To search for \)x D\(, scan the linked list corresponding to \)h_i(x)\( and check whether \)x\( occurs in it. Insertion of \)x\( appends \)x\( to the list for \)h_i(x)\(; deletion of \)x\( first performs a search and then removes \)x\( from the corresponding list, if present. These algorithms are clearly correct. \textbf{Space:} the linked lists together use \)O(N)\( space, and the array has \)O(|Σ|) = O(N)\( entries, giving total space \)O(N + |Σ|) = O(N)\(. \textbf{Time:} assuming a hash value can be computed in \)O(1)\( time, insertion takes \)O(1)\( time, so pre-processing \)N\( elements takes \)O(N + |Σ|) = O(N)\( time. Deletion, after the search step, takes \)O(1)\( time if the lists are doubly linked. Finally, the time to search for \)x\( is dominated by the length of the linked list for \)h_i(x)\(, i.e. by the number of \)a_j\( with \)h_i(x) = h_i(a_j)\(. By Lemma~ \ref{lem:c20-collision-bound} (Claim 20.2.3), the expected number of such \)a_j\( is at most \)εN = O(1)\(. Hence searching takes expected \)O(1)\( time. \)
Given a hash family
\(\)H = {h_1,…,h_n}\(, where for each \)i [n]\(, \)h_i : D Σ\(, the associated code is \[ C_{\mathcal{H}} : D \to \Sigma ^n, \qquad C_{\mathcal{H}}(x) = \left(h_1(x), h_2(x), \dots , h_n(x)\right). \] Conversely, given an \)(n,k)_Σ\( code \)C\(, the associated hash family is \)H_C = {h_1,…,h_n}\(, where for every \)i [n]\(, \[ h_i : \Sigma ^k \to \Sigma , \qquad h_i(x) = C(x)_i \quad \text{for every } x \in \Sigma ^k. \] \)
Let
\(\)H = {h_1,…,h_n}\( be an \)ε\(-almost universal hash function. Then the code \)C_H\( has distance at least \)(1-ε)n\(. On the other hand, if \)C\( is an \)(n,k,δn)\(-code, then \)H_C\( is a \)(1-δ)\(-almost universal hash function. \)
First claim. Let
\(\)H = {h_1,…,h_n}\( be \)ε\(-almost universal. Fix arbitrary \)x ≠y D\(. By definition of \)C_H\(, \[ \{ i \mid h_i(x) = h_i(y)\} = \{ i \mid C_{\mathcal{H}}(x)_i = C_{\mathcal{H}}(y)_i\} . \] Hence \[ \Pr _i[h_i(x) = h_i(y)] = \frac{|\{ i \mid h_i(x) = h_i(y)\} |}{n} = \frac{n - \Delta (C_{\mathcal{H}}(x), C_{\mathcal{H}}(y))}{n} = 1 - \frac{\Delta (C_{\mathcal{H}}(x),C_{\mathcal{H}}(y))}{n}, \] where the second equality is the definition of Hamming distance. By \)ε\(-almost universality, this probability is at most \)ε\(, so \[ \Delta (C_{\mathcal{H}}(x), C_{\mathcal{H}}(y)) \geq n(1-\varepsilon ). \] Since \)x ≠y\( were arbitrary, \)C_H\( has minimum distance at least \)n(1-ε)\(. \textbf{Second claim.} Let \)C\( be an \)(n,k,δn)\(-code and let \)H_C = {h_1,…,h_n}\( be its associated hash family. Fix arbitrary \)x ≠y Σ^k\(. By definition of \)H_C\(, \[ \{ i \mid h_i(x) = h_i(y)\} = \{ i \mid C(x)_i = C(y)_i\} , \] so, exactly as above, \[ \Pr _i[h_i(x) = h_i(y)] = 1 - \frac{\Delta (C(x),C(y))}{n}. \] Since \)C\( has minimum distance at least \)δn\(, \)Δ(C(x),C(y)) ≥δn\(, and hence \[ \Pr _i[h_i(x) = h_i(y)] \leq 1 - \delta . \] As \)x ≠y\( were arbitrary, \)H_C\( is a \)(1-δ)\(-almost universal hash family. \)
Let
\(\)C\( be a \)[q,k,q-k+1]_q\( Reed--Solomon code (domain \)Σ^k = F_q^k\(, i.e. polynomials over \)F_q\( of degree less than \)k\(, evaluated at the \)q\( points of \)F_q\(). The associated hash family \)H_C\( (via Definition~ \ref{def:c20-hash-code-correspondence}) is called the polynomial hash: for \)x F_q^k\( viewed as a polynomial \)f_x\( of degree \)< k\(, and \)i F_q\(, \[ h_i(x) = f_x(i). \] \)
The polynomial hash family
\(\)H_C\( of Construction~ \ref{def:c20-polynomial-hash}, built from the \)[q,k,q-k+1]_q\( Reed--Solomon code \)C\(, is a \)\(-almost universal hash family. \)
The code
\(\)C\( has parameters \)(n,k,δn) = (q,k,q-k+1)\(, so \)δ= \(. By Proposition~ \ref{prop:c20-hash-code-equiv} (second claim, applied to \)C\(), \)H_C\( is \)(1-δ)\(-almost universal, and \[ 1 - \delta = 1 - \frac{q-k+1}{q} = \frac{k-1}{q}. \] \)
For any integer
\(\)N ≥1\(, choosing \)q\( to be the smallest power of \)2\( larger than \)N\( and \)k = O(1)\( in Construction~ \ref{def:c20-polynomial-hash} yields an \)O(1/N)\(-almost universal hash family with domain \)D = F_q^k\( (so \)|D| = N^O(1)\() and range \)Σ= F_q\( of size \)|Σ| = O(N)\(. \)
By Corollary 439, the polynomial hash family is
\(\)\(-almost universal. With \)k = O(1)\( and \)q = Θ(N)\( (the smallest power of \)2\( exceeding \)N\(), we get \) = O(1/N)\(. Since \)q = Θ(N)\(, the range size is \)|Σ| = q = O(N)\(, as required. \)
We assume the server follows the following general protocol. First, upon receiving
\(\)x\(, the server performs some computation that terminates on \)x\( to produce \)y\(, and stores \)y\( (e.g., \)y = x\(, or \)y\( is a compressed version of \)x\(). Then, upon receiving a challenge \)(i,j)\( for \)x\(, it uses an algorithm \)A\( that always terminates, and returns the answer \)a ←A(y,j)\(. (We take \)A\( to be independent of \)y\( and \)j\( in the sense that the algorithm's code does not depend on which \)y,j\( it is invoked with, though its output naturally depends on its inputs.) The server may use an arbitrary but finite amount of time to compute its answer. \)
Input: Data
\(\)x D\(, hash family \)H = {h_1,…,h_m}\( over domain \)D\(. \begin{enumerate} \item Client computes an index \end{enumerate}\)i\( for \)x\(. \item Client picks a random \)j [m]\(. \item Client computes \)z ←h_j(x)\( and stores \)(i,j,z)\(. \item Client sends \)x\( to the server. \)
Input: Index
\(\)i\( of data \)x D\( (with associated stored triple \)(i,j,z)\( from Algorithm~ \ref{alg:c20-preprocessing}).\\ \textbf{Output:} \)1\( if the server has \)x\(, and \)0\( otherwise. \begin{enumerate} \item Client sends a challenge \end{enumerate}\)(i,j)\( to the server. \item Client receives an answer \)a\(. \item If \)a = z\(, return \)1\(; else return \)0\(. \)
Input: Algorithm
\(\)A\(, stored server data \)y\(.\\ \textbf{Output:} \)x’\(. \begin{enumerate} \item \end{enumerate}\)z ←A(y,j)_j [m]\(. \item Run the MLD algorithm for \)C_H\( on \)z\(, and let \)C_H(x’)\( be its output. \item Return \)x’\(. \)
Assume that the hash family
\(\)H\( is an \)ε\(-almost universal hash family. Then if the server passes the protocol in Algorithm~ \ref{alg:c20-verification} with probability \)> + \(, then the server has enough information (i.e., its stored \)y\() to recreate \)x\(. \)
We present an algorithm that computes
\(\)x\( from \)y\( (it need not be efficient, only terminate): Algorithm~ \ref{alg:c20-decompression}. We show its output \)x’\( equals \)x\(. \textbf{Claim:} \)Δ(z, C_H(x)) < (1-ε)\(, where \)z\( is as computed in Step 1 of Algorithm~ \ref{alg:c20-decompression}. To see this, suppose the server passes the protocol in Algorithm~ \ref{alg:c20-verification} (i.e., the client outputs \)1\(). Then it must be that \)z_j = A(y,j) = h_j(x)\( for the challenged \)j\(, and by construction of the decompression algorithm, this reasoning is applied uniformly over all \)j [m]\( that could have been challenged: since the server's response strategy \)A\( does not depend on \)j\( except through the query itself, and the client accepts with probability \)> + \( over the random choice of \)j [m]\(, it follows that for more than \)m + \( positions \)j [m]\( we have \)z_j = C_H(x)_j\( (using \)h_j(x) = C_H(x)_j\( by definition of \)C_H\(). Hence \[ \Delta (z, C_{\mathcal{H}}(x)) {\lt} m - m\left(\frac12 + \frac{\varepsilon }{2}\right) = \frac{m}{2}(1-\varepsilon ), \] proving the claim. By Proposition~ \ref{prop:c20-hash-code-equiv}, \)C_H\( has minimum distance at least \)m(1-ε)\(. Since \)Δ(z,C_H(x)) < (1-ε)\( is strictly less than half this minimum distance, the correctness of the MLD algorithm (Algorithm 1.4.1) within half the minimum distance (Proposition 1.4.2) implies that running MLD on \)z\( for \)C_H\( returns \)C_H(x)\(. Thus Step 2 of Algorithm~ \ref{alg:c20-decompression} outputs \)C_H(x)\(, and so \)x’ = x\(, as desired. \)
Given a string
\(\)x\(, its Kolmogorov complexity, denoted \)K(x)\(, is the minimum of \)|y| + |D|\( over all pairs \)(y,D)\( such that \)D\( is a decompression algorithm which, on input \)y\(, outputs \)x\( (where \)|x|\( and \)|D|\( denote the lengths, in bits, of \)x\( and of a description of \)D\(, respectively). \)
Input: Algorithm
\(\)A\(, stored server data \)y\(, advice \)a [L]\(.\\ \textbf{Output:} \)x\(. \begin{enumerate} \item \end{enumerate}\)z ←A(y,j)_j [m]\(. \item \)L ←∅\(. \item For each \)x’ D\(: if \)Δ(C_H(x’), z) ≤1 - εm\(, add \)x’\( to \)L\(. \item Return the \)a\(-th element of \)L\(. \)
Assume that the hash family
\(\)H\( is an \)ε\(-almost universal hash family. Then if the server passes the protocol in Algorithm~ \ref{alg:c20-verification} with probability \)> ε\(, then the amount of information the server has stored for \)x\( is at least \)K(x) - O(|x|)\(. \)
Suppose, towards a contradiction, that
\(\)|y| < K(x) - O(|x|)\(; we derive a description of \)x\( that is shorter than \)K(x)\(, a contradiction to Definition~ \ref{def:c20-kolmogorov-complexity}. By Proposition~ \ref{prop:c20-hash-code-equiv}, \)C_H\( is a \)q\(-ary code (where \)q = |Σ|\() of minimum distance at least \)m(1-ε)\(. By the Johnson bound (canonical thm:c07-johnson-bound), \)C_H\( is \)1-ε, L\(-list decodable for \[ L \leq q m^2. \tag {20.1} \] Consider Algorithm~ \ref{alg:c20-decompression-list}, which takes \)y\( and an advice string \)a [L]\( and (we claim) can be made to output \)x\( for a suitable choice of \)a\(. \textbf{Claim:} \)Δ(z, C_H(x)) < m1 - ε\(, where \)z\( is as computed in Step 1 of Algorithm~ \ref{alg:c20-decompression-list}. Indeed, if the server passes the protocol in Algorithm~ \ref{alg:c20-verification}, then (as in the proof of Theorem~ \ref{thm:c20-data-possession}) \)z_j = A(y,j) = h_j(x) = C_H(x)_j\( for the challenged position, and since the client accepts with probability \)> ε\( over the random choice of \)j [m]\(, this equality holds for more than \)mε\( positions \)j [m]\(. Hence \[ \Delta (z,C_{\mathcal{H}}(x)) {\lt} m - m\sqrt{\varepsilon } = m\left(1-\sqrt{\varepsilon }\right), \] proving the claim. The claim shows \)x\( satisfies the condition in Step 3 of Algorithm~ \ref{alg:c20-decompression-list}, so \)x L\( at the end of the loop. Since \)|L| ≤L\( by the list-decodability bound, we may take \)a\( to be the index of \)x\( within \)L\(, and then Algorithm~ \ref{alg:c20-decompression-list} on input \)(A, y, a)\( outputs \)x\(. Thus \)(y,a)\(, together with (a fixed, \)O(1)\(-bit description of) Algorithm~ \ref{alg:c20-decompression-list} as decompression procedure, gives a complete description of \)x\(, of total length \)|y| + |a| + O(1) = |y| + L + O(1)\(. By Definition~ \ref{def:c20-kolmogorov-complexity}, this length is at least \)K(x)\(, so \[ |y| \geq K(x) - \log L - O(1) \geq K(x) - O(\log |x|), \] where the last inequality uses (20.1) together with \)q,m\( being \)|x|^O(1)\( (so \)L = O(|x|)\(). This contradicts our assumption \)|y| < K(x) - O(|x|)\(, completing the proof. \)