20 Cutting Data Down to Size: Hashing

Definition 431 Hash Function, Def 20.1.1

Given a domain

\(\)D\( and a range \)Σ\( (typically with \)|Σ| < |D|\(), a hash function is a map \[ h : D \to \Sigma . \] \)

Definition

Definition 432 Hash Family, Def 20.2.1

Given

\(\)D\(, \)Σ\(, and an integer \)m ≥1\(, a hash family \)H\( is a set \){h_1,…,h_m}\( such that for each \)i [m]\(, \[ h_i : D \to \Sigma . \] \)

Definition

Definition 433 Almost Universal Hash Family, Def 20.2.2

A hash family

\(\)H = {h_1,…,h_m}\( defined over the domain \)D\( and range \)Σ\( is said to be \)ε\(-almost universal (for some \)0 < ε≤1\() if for every \)x ≠y D\(, \[ \Pr _i\left[h_i(x) = h_i(y)\right] \leq \varepsilon , \] where \)i\( is chosen uniformly at random from \)[m]\(. \)

Definition

Lemma 434 Claim 20.2.3

Let

\(\)H = {h_1,…,h_m}\( with domain \)D\( and range \)Σ\( be an \)ε\(-almost universal hash family. Then for any (distinct) \)x,a_1,a_2,…,a_N D\(, \[ \mathbb {E}_i\left[\left|\{ a_j \mid h_i(x) = h_i(a_j)\} \right|\right] \leq \varepsilon \cdot N, \] where the expectation is taken over a uniformly random choice of \)i [m]\(. \)

Lemma

Proof ▶

Fix

\(\)a_j\( for \)j [N]\(. By definition of an \)ε\(-almost universal hash family, \[ \Pr _i[h_i(x) = h_i(a_j)] \leq \varepsilon . \] We wish to bound \)E_i∑_j=1^N 1_h_i(a_j) = h_i(x)\(. By linearity of expectation, this equals \[ \sum _{j=1}^N \Pr _i[h_i(a_j) = h_i(x)] \leq \sum _{j=1}^N \varepsilon = \varepsilon N, \] which completes the proof. \)

Proof ▶

Proposition 435 Prop 20.2.4

Given an

\(\)O\(-almost universal hash family with domain \)D\( and range \)Σ\( such that \)|Σ| = O(N)\(, there exists a randomized data structure that, given \)N\( elements \)a_1,…,a_N D\(, supports searching, insertion, and deletion in expected \)O(1)\( time while using \)O(N)\( space in the worst case. \)

Proposition

Proof ▶

Let

\(\)H = {h_1,…,h_m}\( be an \)ε\(-almost universal hash family, with \)ε= O(1/N)\(, from domain \)D\( to range \)Σ\( with \)|Σ| = O(N)\(. Build an array of linked lists with one entry in the array for each value \)v Σ\(. Pick a random hash function \)h_i H\(. For each \)a_j\(, \)j [N]\(, add it to the linked list corresponding to \)h_i(a_j)\(. To search for \)x D\(, scan the linked list corresponding to \)h_i(x)\( and check whether \)x\( occurs in it. Insertion of \)x\( appends \)x\( to the list for \)h_i(x)\(; deletion of \)x\( first performs a search and then removes \)x\( from the corresponding list, if present. These algorithms are clearly correct. \textbf{Space:} the linked lists together use \)O(N)\( space, and the array has \)O(|Σ|) = O(N)\( entries, giving total space \)O(N + |Σ|) = O(N)\(. \textbf{Time:} assuming a hash value can be computed in \)O(1)\( time, insertion takes \)O(1)\( time, so pre-processing \)N\( elements takes \)O(N + |Σ|) = O(N)\( time. Deletion, after the search step, takes \)O(1)\( time if the lists are doubly linked. Finally, the time to search for \)x\( is dominated by the length of the linked list for \)h_i(x)\(, i.e. by the number of \)a_j\( with \)h_i(x) = h_i(a_j)\(. By Lemma~ \ref{lem:c20-collision-bound} (Claim 20.2.3), the expected number of such \)a_j\( is at most \)εN = O(1)\(. Hence searching takes expected \)O(1)\( time. \)

Proof ▶

Definition 436 Def 20.3.1

Given a hash family

\(\)H = {h_1,…,h_n}\(, where for each \)i [n]\(, \)h_i : D Σ\(, the associated code is \[ C_{\mathcal{H}} : D \to \Sigma ^n, \qquad C_{\mathcal{H}}(x) = \left(h_1(x), h_2(x), \dots , h_n(x)\right). \] Conversely, given an \)(n,k)_Σ\( code \)C\(, the associated hash family is \)H_C = {h_1,…,h_n}\(, where for every \)i [n]\(, \[ h_i : \Sigma ^k \to \Sigma , \qquad h_i(x) = C(x)_i \quad \text{for every } x \in \Sigma ^k. \] \)

Definition

Proposition 437 Prop 20.3.2

Let

\(\)H = {h_1,…,h_n}\( be an \)ε\(-almost universal hash function. Then the code \)C_H\( has distance at least \)(1-ε)n\(. On the other hand, if \)C\( is an \)(n,k,δn)\(-code, then \)H_C\( is a \)(1-δ)\(-almost universal hash function. \)

Proposition

Proof ▶

First claim. Let

\(\)H = {h_1,…,h_n}\( be \)ε\(-almost universal. Fix arbitrary \)x ≠y D\(. By definition of \)C_H\(, \[ \{ i \mid h_i(x) = h_i(y)\} = \{ i \mid C_{\mathcal{H}}(x)_i = C_{\mathcal{H}}(y)_i\} . \] Hence \[ \Pr _i[h_i(x) = h_i(y)] = \frac{|\{ i \mid h_i(x) = h_i(y)\} |}{n} = \frac{n - \Delta (C_{\mathcal{H}}(x), C_{\mathcal{H}}(y))}{n} = 1 - \frac{\Delta (C_{\mathcal{H}}(x),C_{\mathcal{H}}(y))}{n}, \] where the second equality is the definition of Hamming distance. By \)ε\(-almost universality, this probability is at most \)ε\(, so \[ \Delta (C_{\mathcal{H}}(x), C_{\mathcal{H}}(y)) \geq n(1-\varepsilon ). \] Since \)x ≠y\( were arbitrary, \)C_H\( has minimum distance at least \)n(1-ε)\(. \textbf{Second claim.} Let \)C\( be an \)(n,k,δn)\(-code and let \)H_C = {h_1,…,h_n}\( be its associated hash family. Fix arbitrary \)x ≠y Σ^k\(. By definition of \)H_C\(, \[ \{ i \mid h_i(x) = h_i(y)\} = \{ i \mid C(x)_i = C(y)_i\} , \] so, exactly as above, \[ \Pr _i[h_i(x) = h_i(y)] = 1 - \frac{\Delta (C(x),C(y))}{n}. \] Since \)C\( has minimum distance at least \)δn\(, \)Δ(C(x),C(y)) ≥δn\(, and hence \[ \Pr _i[h_i(x) = h_i(y)] \leq 1 - \delta . \] As \)x ≠y\( were arbitrary, \)H_C\( is a \)(1-δ)\(-almost universal hash family. \)

Proof ▶

Construction 438 The Polynomial Hash Function

Let

\(\)C\( be a \)[q,k,q-k+1]_q\( Reed--Solomon code (domain \)Σ^k = F_q^k\(, i.e. polynomials over \)F_q\( of degree less than \)k\(, evaluated at the \)q\( points of \)F_q\(). The associated hash family \)H_C\( (via Definition~ \ref{def:c20-hash-code-correspondence}) is called the polynomial hash: for \)x F_q^k\( viewed as a polynomial \)f_x\( of degree \)< k\(, and \)i F_q\(, \[ h_i(x) = f_x(i). \] \)

Construction

Corollary 439 Almost-universality of the polynomial hash

The polynomial hash family

\(\)H_C\( of Construction~ \ref{def:c20-polynomial-hash}, built from the \)[q,k,q-k+1]_q\( Reed--Solomon code \)C\(, is a \)\(-almost universal hash family. \)

Corollary

Proof ▶

The code

\(\)C\( has parameters \)(n,k,δn) = (q,k,q-k+1)\(, so \)δ= \(. By Proposition~ \ref{prop:c20-hash-code-equiv} (second claim, applied to \)C\(), \)H_C\( is \)(1-δ)\(-almost universal, and \[ 1 - \delta = 1 - \frac{q-k+1}{q} = \frac{k-1}{q}. \] \)

Proof ▶

Corollary 440 Resolution of Question 20.2.1

For any integer

\(\)N ≥1\(, choosing \)q\( to be the smallest power of \)2\( larger than \)N\( and \)k = O(1)\( in Construction~ \ref{def:c20-polynomial-hash} yields an \)O(1/N)\(-almost universal hash family with domain \)D = F_q^k\( (so \)|D| = N^O(1)\() and range \)Σ= F_q\( of size \)|Σ| = O(N)\(. \)

Corollary

Proof ▶

By Corollary 439, the polynomial hash family is

\(\)\(-almost universal. With \)k = O(1)\( and \)q = Θ(N)\( (the smallest power of \)2\( exceeding \)N\(), we get \) = O(1/N)\(. Since \)q = Θ(N)\(, the range size is \)|Σ| = q = O(N)\(, as required. \)

Proof ▶

Assumption 441 Server protocol model

We assume the server follows the following general protocol. First, upon receiving

\(\)x\(, the server performs some computation that terminates on \)x\( to produce \)y\(, and stores \)y\( (e.g., \)y = x\(, or \)y\( is a compressed version of \)x\(). Then, upon receiving a challenge \)(i,j)\( for \)x\(, it uses an algorithm \)A\( that always terminates, and returns the answer \)a ←A(y,j)\(. (We take \)A\( to be independent of \)y\( and \)j\( in the sense that the algorithm's code does not depend on which \)y,j\( it is invoked with, though its output naturally depends on its inputs.) The server may use an arbitrary but finite amount of time to compute its answer. \)

Assumption

Algorithm 442 Pre-Processing for Data Possession Verification, Alg 20.4.1

Input: Data

\(\)x D\(, hash family \)H = {h_1,…,h_m}\( over domain \)D\(. \begin{enumerate} \item Client computes an index \end{enumerate}\)i\( for \)x\(. \item Client picks a random \)j [m]\(. \item Client computes \)z ←h_j(x)\( and stores \)(i,j,z)\(. \item Client sends \)x\( to the server. \)

Algorithm

Algorithm 443 Verification for Data Possession Verification, Alg 20.4.2

Input: Index

\(\)i\( of data \)x D\( (with associated stored triple \)(i,j,z)\( from Algorithm~ \ref{alg:c20-preprocessing}).\\ \textbf{Output:} \)1\( if the server has \)x\(, and \)0\( otherwise. \begin{enumerate} \item Client sends a challenge \end{enumerate}\)(i,j)\( to the server. \item Client receives an answer \)a\(. \item If \)a = z\(, return \)1\(; else return \)0\(. \)

Algorithm

Algorithm 444 Decompression Algorithm, Alg 20.4.3

Input: Algorithm

\(\)A\(, stored server data \)y\(.\\ \textbf{Output:} \)x’\(. \begin{enumerate} \item \end{enumerate}\)z ←A(y,j)_j [m]\(. \item Run the MLD algorithm for \)C_H\( on \)z\(, and let \)C_H(x’)\( be its output. \item Return \)x’\(. \)

Algorithm

Theorem 445 Thm 20.4.1

Assume that the hash family

\(\)H\( is an \)ε\(-almost universal hash family. Then if the server passes the protocol in Algorithm~ \ref{alg:c20-verification} with probability \)> + \(, then the server has enough information (i.e., its stored \)y\() to recreate \)x\(. \)

Theorem

Proof ▶

We present an algorithm that computes

\(\)x\( from \)y\( (it need not be efficient, only terminate): Algorithm~ \ref{alg:c20-decompression}. We show its output \)x’\( equals \)x\(. \textbf{Claim:} \)Δ(z, C_H(x)) < (1-ε)\(, where \)z\( is as computed in Step 1 of Algorithm~ \ref{alg:c20-decompression}. To see this, suppose the server passes the protocol in Algorithm~ \ref{alg:c20-verification} (i.e., the client outputs \)1\(). Then it must be that \)z_j = A(y,j) = h_j(x)\( for the challenged \)j\(, and by construction of the decompression algorithm, this reasoning is applied uniformly over all \)j [m]\( that could have been challenged: since the server's response strategy \)A\( does not depend on \)j\( except through the query itself, and the client accepts with probability \)> + \( over the random choice of \)j [m]\(, it follows that for more than \)m + \( positions \)j [m]\( we have \)z_j = C_H(x)_j\( (using \)h_j(x) = C_H(x)_j\( by definition of \)C_H\(). Hence \[ \Delta (z, C_{\mathcal{H}}(x)) {\lt} m - m\left(\frac12 + \frac{\varepsilon }{2}\right) = \frac{m}{2}(1-\varepsilon ), \] proving the claim. By Proposition~ \ref{prop:c20-hash-code-equiv}, \)C_H\( has minimum distance at least \)m(1-ε)\(. Since \)Δ(z,C_H(x)) < (1-ε)\( is strictly less than half this minimum distance, the correctness of the MLD algorithm (Algorithm 1.4.1) within half the minimum distance (Proposition 1.4.2) implies that running MLD on \)z\( for \)C_H\( returns \)C_H(x)\(. Thus Step 2 of Algorithm~ \ref{alg:c20-decompression} outputs \)C_H(x)\(, and so \)x’ = x\(, as desired. \)

Proof ▶

Definition 446 Kolmogorov Complexity, Def 20.4.2

Given a string

\(\)x\(, its Kolmogorov complexity, denoted \)K(x)\(, is the minimum of \)|y| + |D|\( over all pairs \)(y,D)\( such that \)D\( is a decompression algorithm which, on input \)y\(, outputs \)x\( (where \)|x|\( and \)|D|\( denote the lengths, in bits, of \)x\( and of a description of \)D\(, respectively). \)

Definition

Algorithm 447 Decompression Algorithm Using List Decoding, Alg 20.4.4

Input: Algorithm

\(\)A\(, stored server data \)y\(, advice \)a [L]\(.\\ \textbf{Output:} \)x\(. \begin{enumerate} \item \end{enumerate}\)z ←A(y,j)_j [m]\(. \item \)L ←∅\(. \item For each \)x’ D\(: if \)Δ(C_H(x’), z) ≤1 - εm\(, add \)x’\( to \)L\(. \item Return the \)a\(-th element of \)L\(. \)

Algorithm

Theorem 448 Thm 20.4.3

Assume that the hash family

\(\)H\( is an \)ε\(-almost universal hash family. Then if the server passes the protocol in Algorithm~ \ref{alg:c20-verification} with probability \)> ε\(, then the amount of information the server has stored for \)x\( is at least \)K(x) - O(|x|)\(. \)

Theorem

Proof ▶

Suppose, towards a contradiction, that

\(\)|y| < K(x) - O(|x|)\(; we derive a description of \)x\( that is shorter than \)K(x)\(, a contradiction to Definition~ \ref{def:c20-kolmogorov-complexity}. By Proposition~ \ref{prop:c20-hash-code-equiv}, \)C_H\( is a \)q\(-ary code (where \)q = |Σ|\() of minimum distance at least \)m(1-ε)\(. By the Johnson bound (canonical thm:c07-johnson-bound), \)C_H\( is \)1-ε, L\(-list decodable for \[ L \leq q m^2. \tag {20.1} \] Consider Algorithm~ \ref{alg:c20-decompression-list}, which takes \)y\( and an advice string \)a [L]\( and (we claim) can be made to output \)x\( for a suitable choice of \)a\(. \textbf{Claim:} \)Δ(z, C_H(x)) < m1 - ε\(, where \)z\( is as computed in Step 1 of Algorithm~ \ref{alg:c20-decompression-list}. Indeed, if the server passes the protocol in Algorithm~ \ref{alg:c20-verification}, then (as in the proof of Theorem~ \ref{thm:c20-data-possession}) \)z_j = A(y,j) = h_j(x) = C_H(x)_j\( for the challenged position, and since the client accepts with probability \)> ε\( over the random choice of \)j [m]\(, this equality holds for more than \)mε\( positions \)j [m]\(. Hence \[ \Delta (z,C_{\mathcal{H}}(x)) {\lt} m - m\sqrt{\varepsilon } = m\left(1-\sqrt{\varepsilon }\right), \] proving the claim. The claim shows \)x\( satisfies the condition in Step 3 of Algorithm~ \ref{alg:c20-decompression-list}, so \)x L\( at the end of the loop. Since \)|L| ≤L\( by the list-decodability bound, we may take \)a\( to be the index of \)x\( within \)L\(, and then Algorithm~ \ref{alg:c20-decompression-list} on input \)(A, y, a)\( outputs \)x\(. Thus \)(y,a)\(, together with (a fixed, \)O(1)\(-bit description of) Algorithm~ \ref{alg:c20-decompression-list} as decompression procedure, gives a complete description of \)x\(, of total length \)|y| + |a| + O(1) = |y| + L + O(1)\(. By Definition~ \ref{def:c20-kolmogorov-complexity}, this length is at least \)K(x)\(, so \[ |y| \geq K(x) - \log L - O(1) \geq K(x) - O(\log |x|), \] where the last inequality uses (20.1) together with \)q,m\( being \)|x|^O(1)\( (so \)L = O(|x|)\(). This contradicts our assumption \)|y| < K(x) - O(|x|)\(, completing the proof. \)

Proof ▶