← All blueprints

Essential Coding Theory — Blueprint

22 Finding Defectives: Group Testing

Definition 460 Query / test
#

Given

\(\)N\( individuals indexed by \)[N]\(, a \emph{query} (or \emph{test}) \)S\( is a subset \)S ⊆[N]\(. If \)x = (x_1,…,x_N) {0,1}^N\( is the (unknown) indicator vector of the individuals possessing a given biomarker (so \)x_i = 1\( iff individual \)i\( is defective), the answer to the query \)S\( is \[ A(S) = \begin{cases} 1 & \text{if } \sum _{i \in S} x_i \ge 1, \\ 0 & \text{otherwise.} \end{cases} \] Equivalently, \)A(S) = _i S x_i\( is the logical-OR of the bits of \)x\( indexed by \)S\(. The \emph{group testing problem} is: given an upper bound \)d\( on the number of defectives (i.e.\ \)wt(x) ≤d\(), compute \)x\( while minimizing the number of tests/queries used. \)

Definition
#
Definition 461 Adaptive vs. non-adaptive group testing
#

A group testing scheme is adaptive if each test may be chosen based on the outcomes of all previously performed tests. A group testing scheme is non-adaptive if the entire collection of tests is fixed in advance, before any test outcome is observed.

Definition 462 \(\)t(d,N)\(, Definition 22.1.1\)

Given a subset of \(\)N\( items with \)d\( defects represented as \)x {0,1}^N\( with \)wt(x) ≤d\(, the minimum number of \emph{non-adaptive} tests that one would have to make (over all valid non-adaptive schemes that correctly determine \)x\() is defined as \)t(d,N)\(. \)

Definition 463 \(\)t^a(d,N)\(, Definition 22.1.2\)

Given a set of \(\)N\( items with \)d\( defects, \)t^a(d,N)\( is defined as the minimum number of \emph{adaptive} tests that one would have to make to detect all the defective items. \)

Proposition 464 Trivial bounds, Proposition 22.1.3

For every

\(\)1 ≤d ≤N\(, we have \[ 1 \le t^a(d,N) \le t(d,N) \le N. \] \)

Proposition
#
Proof

The inequality

\(\)t(d,N) ≤N\( follows from the naive scheme of testing all individuals with singleton tests \){1},…,{N}\( (which is trivially non-adaptive). The inequality \)1 ≤t^a(d,N)\( is trivial (if \)N ≥1\( and \)d ≥1\( at least one test is needed to distinguish the all-zero input from a weight-one input). Finally, \)t^a(d,N) ≤t(d,N)\( holds because any non-adaptive test schedule can be simulated by an adaptive scheme that simply runs all of the non-adaptive tests as its first round of tests and ignores the intermediate outcomes when choosing later tests; adaptive schemes can only do at least as well since later tests may be chosen using information revealed by earlier ones. \)

Proof
Definition 465 Group testing matrix

For

\(\)S ⊆[N]\(, let \)χ_S {0,1}^N\( denote its indicator vector, i.e.\ \)χ_S(i) = 1\( iff \)i S\(. A non-adaptive group testing scheme with \)t\( tests \)S_1,…,S_t\( is represented by the \)t ×N\( \emph{testing matrix} \)M = (M_ij)_i [t], j [N]\( whose \)i\(-th row is \)χ_S_i\(, i.e.\ \)M_ij = 1\( if \)j S_i\( and \)M_ij=0\( otherwise. (Testing each individual as a singleton set corresponds to \)M\( being the \)N ×N\( identity matrix.) Interpreting multiplication as logical AND (\)∧\() and addition as logical OR (\)∨\(), we write \)M ⊙x\( for the vector \)r {0,1}^t\( of test outcomes, i.e.\ \)r = M ⊙x\( where \)r_i = 1\( iff row \)i\( of \)M\( and \)x\( share a common \)1\(-position. For a column index \)j [N]\( we also write \)M^j {0,1}^t\( for the \)j\(-th column of \)M\(, thought of interchangeably as a subset of \)[t]\( (namely, the set of rows in which it has a \)1\(). The goal of non-adaptive group testing is to design \)M\( with as few rows \)t\( as possible so that \)x\( can be recovered from \)M ⊙x\(. \)

Definition
#
Proposition 466 Counting lower bound, Proposition 22.2.1

For every

\(\)1 ≤d ≤N\(, \[ t^a(d,N) \ge d \log \frac{N}{d}. \] \)

Proposition
#
Proof

Fix any valid adaptive group testing scheme with

\(\)t\( tests. Observe that if \)x ≠y {0,1}^N\( with \)wt(x), wt(y) ≤d\(, then \)r(x) ≠r(y)\(, where \)r(x)\( denotes the vector of test results obtained when running the (adaptive) scheme on input \)x\(, and similarly for \)r(y)\(: two valid inputs cannot produce the same result vector, since otherwise the scheme could not distinguish \)x\( from \)y\(. This observation implies that the number of distinct test-outcome vectors is at least the number of distinct binary vectors of Hamming weight at most \)d\( in \){0,1}^N\(, i.e.\ \)Vol_2(d,N)\( (the volume of a Hamming ball of radius \)d\(, Definition~ \ref{def:c01-hamming-ball}). On the other hand, the number of possible distinct \)t\(-bit outcome vectors is at most \)2^t\(. Combining the two facts, \[ 2^t \ge \mathrm{Vol}_2(d,N), \] and hence \[ t \ge \log \mathrm{Vol}_2(d,N). \] Recall that \[ \mathrm{Vol}_2(d,N) \ge \binom {N}{d} \ge \left(\frac{N}{d}\right)^d, \] where the first inequality is the definition of the volume of a Hamming ball as a sum of binomial coefficients (of which \)Nd\(, the term for weight exactly \)d\(, is one summand) and the second is the standard bound \)Nd ≥(N/d)^d\(. Combining these, \)t ≥d (N/d)\(, as desired, and since this holds for every valid adaptive scheme with \)t\( tests, \)t^a(d,N) ≥d(N/d)\(. \)

Proof
Proposition 467 Bound for d=1\(, Proposition 22.3.1\)
\[ t(1,N) \le \lceil \log (N+1) \rceil + 1. \]
Proof

We prove the upper bound by exhibiting a matrix that solves non-adaptive group testing for

\(\)d=1\(. Let the group testing matrix \)M\( be the parity check matrix for the \)[2^m-1,2^m-m-1,3]_2\( Hamming code (Definition~ \ref{def:c01-hamming-code}), i.e.\ \)H_m\(, where the \)i\(-th column is the binary representation of \)i\(. This works because if \)x\( has \)wt(x) ≤1\( and \)x = e_i\( for some \)i [N]\(, then \)M ⊙x = M^i\(, the \)i\(-th column of \)M\(, and by construction \)M^i\( is exactly the binary representation of \)i\(; hence \)i\( (and thus \)x\() can be read off directly from \)M ⊙x\(. If instead \)wt(x) = 0\(, then \)M ⊙x = 0\(, which is exactly \)x\( and is not equal to the binary representation of any \)i [N]\( (since \)i ≥1\(). Hence \)M ⊙x\( uniquely identifies \)x\( whenever \)wt(x) ≤1\(. If \)N ≠2^m - 1\( for any \)m\(, take the matrix \)H_m\( for the (unique) \)m\( with \)2^m-1 - 1 < N < 2^m - 1\( and pad \)x\( with \)0\(s at the end to length \)2^m-1\( before applying the above scheme; decoding remains trivial since a binary representation is directly read off. In either case the number of tests (rows of \)M\() is at most \)m = (N+1) + 1\(, which completes the proof. \)

Proof
Definition 468 \(\)d\(-separable matrix, Definition 22.3.2\)

A \(\)t ×N\( matrix \)M\( is \)\(\)d\(-separable\) if and only if for every \(\)S_1 ≠S_2 ⊆[N]\( such that \)|S_1|,|S_2| ≤d\(, we have \[ \bigcup _{j \in S_1} M^j \ne \bigcup _{i \in S_2} M^i. \] \)

Algorithm 469 Decoder for Separable Matrices, Algorithm 22.3.1

Input: Result vector

\(\)r\( and a \)d\(-separable matrix \)M\(. \textbf{Output:} \)x\( if \)r = M ⊙x\(, else \texttt{Fail}. \begin{enumerate} \item \end{enumerate} \)R ←{i ∣r_i = 1}\(. \item For every \)T ⊆[N]\( with \)|T| ≤d\(: \begin{enumerate} \item \end{enumerate}\)S_T ←_i T M^i\(. \item If \)R = S_T\(: set \)x {0,1}^N\( with \)x_i = 1\( iff \)i T\(, and return \)x\(. \)

Return Fail.

Correctness follows directly from Definition 468: the set \(\)T\( found (if any) satisfies \)_i T M^i = R = _i supp(x) M^i\(, and \)d\(-separability guarantees that no other set of size \)≤d\( has the same union, so \)T = supp (x)\(. The algorithm runs in time \)N^Θ(d)\(, since it enumerates all subsets \)T ⊆[N]\( of size at most \)d\(. \)

Algorithm
#
Definition 470 \(\)d\(-disjunct matrix, Definition 22.3.3\)

A \(\)t ×N\( matrix \)M\( is \)\(\)d\(-disjunct\) if and only if for every \(\)S ⊂[N]\( with \)|S| ≤d\( and for every \)j S\(, there exists \)i [t]\( such that \)M_ij = 1\( but \)M_ik = 0\( for all \)k S\(. Or, equivalently, \[ M^j \not\subseteq \bigcup _{k \in S} M^k. \] \)

Lemma 471 Disjunct implies separable, Lemma 22.3.4

Any

\(\)d\(-disjunct matrix is also \)d\(-separable. \)

Lemma
#
Proof

For contradiction, assume

\(\)M\( is \)d\(-disjunct but not \)d\(-separable. Since \)M\( is not \)d\(-separable, there exist two distinct subsets \)S ≠T ⊂[N]\( of size at most \)d\( each such that \[ \bigcup _{k \in S} M^k = \bigcup _{k \in T} M^k. \] Since \)S ≠T\(, there exists \)j T ∖S\(. But then \[ M^j \subseteq \bigcup _{k \in T} M^k = \bigcup _{k \in S} M^k, \] where the equality is the displayed equality above. However, since by assumption \)j S\(, this directly contradicts the fact that \)M\( is \)d\(-disjunct (Definition~ \ref{def:c22-disjunct}). This contradiction completes the proof. \)

Proof
Algorithm 472 Naive Decoder for Disjunct Matrices, Algorithm 22.3.2

Input: Result vector

\(\)r\( and a \)d\(-disjunct matrix \)M\(. \textbf{Output:} \)x\( if \)M ⊙x = r\(, else \texttt{Fail}. \begin{enumerate} \item Initialize \end{enumerate} \)x {0,1}^N\( to be the all-ones vector. \item For every \)i [t]\(: if \)r_i = 0\(, then for every \)j [N]\( with \)M_ij = 1\(, set \)x_j ←0\(. \item If \)M ⊙x = r\(, return \)x\(; else return \texttt{Fail}. \)

Algorithm
#
Lemma 473 Efficient decoding of disjunct matrices, Lemma 22.3.5

There exists an

\(\)O(tN)\(-time decoding algorithm for any \)t ×N\( matrix that is \)d\(-disjunct. \)

Lemma
#
Proof

The proof follows from two observations, which together justify Algorithm 472 (Algorithm 22.3.2).

First, say we have a matrix

\(\)M\(, a vector \)x\(, and \)r = M ⊙x\( such that \)r_i = 1\(. Then there exists a column \)j\( that made it possible, i.e.\ if \)r_i = 1\( then there is a \)j\( such that \)M_ij = 1\( and \)x_j = 1\(. Second, let \)T = {ℓ∣x_ℓ= 1}\( with \)|T| ≤d\(, and let \)j\( be a column not in \)T\(. Consider the row \)i\( such that \)T\( has all zeros in the \)i\(-th row and \)M_ij=1\(; such a row exists by \)d\(-disjunctness (Definition~ \ref{def:c22-disjunct}, applied with \)S=T\(), and for this row \)r_i = 0\(. Conversely, if \)r_i = 0\(, then for every \)j [N]\( with \)M_ij=1\(, it must be the case that \)x_j = 0\( (else, by the first observation applied to row \)i\(, we would need \)r_i=1\(). These two observations show that Algorithm~ \ref{alg:c22-disjunct-decoder} correctly sets to \)0\( exactly the coordinates \)x_j\( with \)j T\(, and hence correctly recovers \)x\(. The algorithm runs in time \)O(tN)\(: Step 2 examines each of the \)t\( rows once, and for each row with \)r_i=0\( scans all \)N\( columns. \)

Proof
Lemma 474 Sufficient condition for disjunctness, Lemma 22.4.1

Let

\(\)1 ≤d ≤N\( be an integer and \)M\( be a \)t ×N\( matrix such that \begin{enumerate} \item[(i)]for every \end{enumerate}\)j [N]\(, the \)j\(-th column has at least \)w_\( ones in it, i.e.\ \)|M^j| ≥w_\(; and \item [(ii)] for every \)i ≠j [N]\(, the \)i\(-th and \)j\(-th columns have at most \)a_\( ones in common, i.e.\ \)|M^i ∩M^j| ≤a_\(, \)

for some integers \(\)a_ ≤w_ ≤t\(. Then \)M\( is a \) \(-disjunct matrix. \)

Lemma
#
Proof

For notational convenience, define

\(\)d = \(. Fix an arbitrary \)S ⊂[N]\( with \)|S| ≤d\( and a \)j S\(. We must show that \[ M^j \not\subseteq \bigcup _{i \in S} M^i, \] or equivalently \[ M^j \not\subseteq \bigcup _{i \in S} \left(M^i \cap M^j\right). \] We prove this by showing that \[ \left| M^j \setminus \bigcup _{i \in S} \left(M^i \cap M^j\right) \right| {\gt} 0. \] Indeed, consider the following sequence of relations: \begin{align*} \left| M^j \setminus \bigcup _{i \in S} \left(M^i \cap M^j\right)\right| & = |M^j| - \left|\bigcup _{i \in S}\left(M^i \cap M^j\right)\right| \\ & \ge |M^j| - \sum _{i \in S} \left|\left(M^i \cap M^j\right)\right| \\ & \ge w_{\min } - |S| \cdot a_{\max } \\ & \ge w_{\min } - d \cdot a_{\max } \\ & \ge w_{\min } - \frac{w_{\min }-1}{a_{\max }}\cdot a_{\max } \\ & = 1. \end{align*} Here the first equality holds because \)M^j\( is the disjoint union of the set on the right and the set being subtracted. The first inequality follows from the fact that the size of a union of sets is at most the sum of their sizes. The second inequality follows from the definitions of \)w_\( and \)a_\( (hypotheses (i) and (ii)). The third follows from \)|S| ≤d\(. The fourth follows from the definition of \)d\(. This shows the displayed set has size at least \)1\(, so it is nonempty, which completes the proof. \)

Proof
Definition 475 Matrix from a code
#

Let

\(\)C ⊆[q]^t\( be a code with codewords \)C = {c_1, …,c_N}\(. The matrix \)M_C\( associated to \)C\( is the \)t ×N\( matrix whose \)i\(-th column is \)c_i\(, i.e. \[ M_C = \begin{bmatrix} \uparrow & \uparrow & & \uparrow \\ \mathbf{c}_1 & \mathbf{c}_2 & \cdots & \mathbf{c}_N \\ \downarrow & \downarrow & & \downarrow \end{bmatrix}. \] \)

Definition
#
Construction 476 Reduction of disjunct-matrix design to codes

By Lemma 474, to construct a

\(\) \(-disjunct matrix it suffices to design a binary code \)C^* ⊆{0,1}^t\( such that (i) for every \)c C^*\(, \)wt(c) ≥w_\(, and (ii) for every \)c^1 ≠c^2 C^*\(, \)|{i ∣c^1_i = c^2_i = 1}| ≤a_\(, and then take \)M = M_C^*\( (Definition~ \ref{def:c22-code-to-matrix}). \)

Construction
#

22.4.1 Kautz–Singleton Construction

Construction 477 Kautz–Singleton construction

Let

\(\)C^* = C_out ○C_in\( be a concatenated code (in the sense of code concatenation, Ch.\ 10), where \)C_out\( is a \)[q,k,q-k+1]_q\( Reed--Solomon code (Definition~ \ref{def:c05-reed-solomon}) and the inner code \)C_in : F_q {0,1}^q\( is defined by \)C_in(i) = e_i\( for every \)i F_q\( (identified with \)[q]\(). Note that \)M_C_in\( is the \)q ×q\( identity matrix, and that \)N = q^k\( and \)t = q^2\(. The group testing matrix produced by this construction is \)M_C^*\( (Definition~ \ref{def:c22-code-to-matrix}). \)

Construction
#
Theorem 478 Kautz–Singleton, Theorem 22.4.2

For every integer

\(\)d ≥1\( and large enough \)N ≥d\(, there exists a \)t ×N\( matrix that is \)d\(-disjunct where \[ t = O\! \left(d^2 (\log _d N)^2\right). \] \)

Theorem
#
Proof

We instantiate the Kautz–Singleton construction (Construction 477) and analyze the resulting matrix

\(\)M = M_C^*\( via Lemma~ \ref{lem:c22-sufficient-disjunct}. Recall \)N = q^k\( and \)t = q^2\(. Every column of \)M\( has exactly \)q\( ones in it (since the inner encoding of any symbol is a weight-one vector \)e_i {0,1}^q\(, and every codeword of \)C_out\( has \)q\( (i.e.\ all) coordinates over \)F_q\( each contributing one block of length \)q\(). In other words, \)w_ = q\(. Next we estimate \)a_\(. Divide the \)t = q^2\( rows into \)q\(-sized chunks, and index them by pairs \)(i,j) [q] ×[q]\(. For \)(i,j) [q]×[q]\( and a column index \)ℓ[N]\( (corresponding to a codeword \)c_ℓC^*\(, which in turn corresponds to a message \)m F_q^k\(), we have \)M_(i,j),ℓ = 1\( if and only if \)c_ℓ(j) = i\( (i.e.\ the \)j\(-th coordinate of \)C_out (m)\( equals \)i\(, under a fixed bijection \)F_q ↔[q]\(). Hence, the number of rows in which columns \)ℓ_1\( and \)ℓ_2\( both have a \)1\( is exactly the number of coordinate positions \)j\( where \)c_ℓ_1(j) = c_ℓ_2(j)\(, which equals \)q - Δ(c_ℓ_1,c_ℓ_2)\(. Since \)C_out\( is a \)[q,k,q-k+1]_q\( code (Definition~ \ref{def:c05-reed-solomon}), any two distinct codewords of \)C_out\( agree in at most \)k-1\( positions. Hence any two distinct columns of \)M\( agree (have a common \)1\() in at most \)k-1\( rows, i.e.\ \)a_ = k-1\(. By Lemma~ \ref{lem:c22-sufficient-disjunct}, \)M\( is \)d\(-disjunct for \[ d = \left\lfloor \frac{q-1}{k-1} \right\rfloor . \] We now pick \)q\( and \)k\( (in terms of \)d\() so that this holds. Note that \)d = Θ((q-1)/(k-1))\( gives \)q = O(kd)\(. Since \)N = q^k\(, we have \)k = _q N\(, so \)q = O(d _q N)\(, i.e.\ \)q q = O(d N)\(, which gives \[ q = O(d \log _d N). \] Recalling that \)t = q^2\( completes the proof. \)

Proof
Example 479 Example 22.4.3

Let

\(\)k=1\( and \)q=3\(. By the choice of the \)[3,1]_3\( Reed--Solomon code, \)C_ out = {(0,0,0),(1,1,1),(2,2,2)}\(, so \[ M_{C_{\mathrm{out}}} = \begin{bmatrix} 0 & 1 & 2 \\ 0 & 1 & 2 \\ 0 & 1 & 2 \end{bmatrix}. \] Concatenating each entry with the inner code \)C_in(i) = e_i {0,1}^3\( produces the \)9 ×3\( matrix \)M_C^*\(, where every group of three consecutive rows of \)M_C^*\( corresponds to one row of \)M_C_out\(. \)

Example
#
Definition 480 Data stream algorithm, Definition 22.5.1
#

A data stream algorithm is an algorithm satisfying the following four requirements.

  1. The algorithm makes one sequential pass over the input.

  2. The algorithm uses poly-log space (in particular, it cannot store the entire input).

  3. The algorithm has poly-log update time, i.e. whenever a new item appears, the algorithm updates its internal state in poly-log time.

  4. The algorithm has poly-log reporting time, i.e. at any point of time the algorithm can return the required answers in poly-log time.

Definition 481 Frequency, Definition 22.5.2
#

Given a stream of pairs

\(\)(i_1,u_1),…,(i_m,u_m)\( with \)i_ℓ[N]\( and \)u_ℓZ\(, let \)f_ℓ\( denote the total count for item (stock) id \)ℓ\(. Initially \)f_ℓ= 0\(; upon seeing a pair \)(ℓ,u_ℓ)\(, update \)f_ℓ←f_ℓ+ u_ℓ\(. \)

Definition
#
Definition 482 Hot Items Problem, Definition 22.5.3

Given

\(\)N\( different items, and \)m\( input pairs of data \)(i_ℓ,u_ℓ)\( for \)1 ≤ℓ≤m\(, where \)i_ℓ[N]\( indicates the item index and \)u_ℓ\( indicates the corresponding count, the \emph{hot items problem} requires updating the count \)f_ℓ\( (\)1 ≤ℓ≤N\() for each item, and outputting all item indices \)j\( such that \[ f_j {\gt} \frac{\sum _{\ell =1}^{N} u_\ell }{d}. \] Any such item is called a \emph{hot item}. (The hot items problem is also called the \emph{heavy hitters problem}.) \)

Definition
#
Theorem 483 Space lower bound for exact hot items, Theorem 22.5.4

Computing hot items exactly by a deterministic one-pass algorithm needs

\(\)Ω(N)\( space (even allowing exponential time). \)

Theorem
#
Proof

22.5.1 Connection to Group Testing

We assume throughout this subsection the heavy-tail distribution assumption on the input data, namely

\[ \sum _{\ell :\, \text{not hot}} f_\ell \le \frac{m}{d}. \]
Algorithm 484 Initialization, Algorithm 22.5.1
#

Output: initialize the counters.

\(\)m ←0\(. \item For every \)j [t]\(: \)C_j ←0\(. \)

Algorithm
#
Algorithm 485 Update, Algorithm 22.5.2

Input: pair

\(\)(i,u)\(, \)i [N]\(, \)u Z\(. \textbf{Output:} updates the counters. \begin{enumerate} \item \end{enumerate} \)m ←m+1\(. \item For every \)j [t]\(: if \)M_ij = 1\(, then \)C_j ←C_j + u\(. \)

Here \(\)M\( is a fixed \)t ×N\( matrix and \)C_1,…,C_t\( are counters, \)C_i\( maintaining the total count of all items present in the \)i\(-th row of \)M\(. \)

Algorithm
#
Lemma 486 Observation 22.5.5

If

\(\)j\( is a hot item and \)M_ij=1\(, then \)C_i > \(. \)

Lemma
#
Proof

Let

\(\)i [t]\( be such that \)M_ij=1\(. Then at any point of time, \[ C_i = \sum _{k:\, M_{ik}=1} f_k \ge f_j, \] where the equality follows by an easy induction on Algorithm~ \ref{alg:c22-hot-items-update} (each update to \)f_k\( for \)M_ik=1\( is mirrored by an equal update to \)C_i\(). Since \)j\( is a hot item, we have \)f_j > m/d\( (Definition~ \ref{def:c22-hot-items}, applied with \)∑_ℓu_ℓ= m\(), which completes the proof. \)

Proof
Lemma 487 Observation 22.5.6

For any

\(\)1 ≤i ≤t\(, if all \)j\( with \)M_ij=1\( are not hot items, then \)C_i ≤\(. \)

Lemma
#
Proof

Fix

\(\)i [t]\( such that every \)j [N]\( with \)M_ij=1\( is not a hot item. Then by the same argument as in the proof of Lemma~ \ref{lem:c22-hot-item-obs1}, \[ C_i = \sum _{k:\, M_{ik}=1} f_k. \] Since every \)k\( with \)M_ik=1\( is not hot, this sum is a sub-sum of \)∑_ℓ:  not hot f_ℓ≤m/d\( (the heavy-tail assumption), so \)C_i ≤m/d\(. \)

Proof
Algorithm 488 Report Heavy Items, Algorithm 22.5.3

Input: counters

\(\)m\( and \)C_1,…,C_t\(. \textbf{Output:} the heavy (hot) items. \begin{enumerate} \item For every \end{enumerate} \)j [t]\(: set \)r_j ←1\( if \)C_j > m/d\(, else \)r_j ←0\(. \item Let \)x\( be the result of decoding (for group testing) \)r = (r_1,…,r_t)\( against \)M\(. \item Return \){i ∣x_i = 1}\(. \)

Define \(\)x = (x_1,…,x_N) {0,1}^N\( with \)x_j = 1\( iff \)j\( is a hot item. By Lemmas~ \ref{lem:c22-hot-item-obs1} and~ \ref{lem:c22-hot-item-obs2}, \)r_i = 1\( if and only if \)C_i > m/d\(, which happens if and only if \)_j: M_ij=1 x_j = 1\(; that is, \)r = M ⊙x\(. Since \)wt(x) < d\( (there are at most \)d\( hot items), reporting the hot items reduces exactly to decoding the group testing instance \)(M,r)\(. \)

Algorithm
#
Theorem 489 Data stream algorithm for hot items, Theorem 22.5.7

There exists a data streaming algorithm that computes the

\(\)d\( hot items with one pass, \)O(t N)\( space for \)t = O(d^2 _d^2 N)\(, \)O(t  polylog  t)\( update time, and \)poly(t)\( reporting time. \)

Theorem
#
Proof

We take

\(\)M\( to be the \)d\(-disjunct matrix of Theorem~ \ref{thm:c22-kautz-singleton} (Kautz--Singleton construction), with \)t = O(d^2 _d^2 N)\(, and use Algorithms~ \ref{alg:c22-hot-items-init},~ \ref{alg:c22-hot-items-update}, and~ \ref{alg:c22-report-heavy-items} to initialize, update, and report, respectively. We check the four requirements of Definition~ \ref{def:c22-data-stream-algorithm}. \emph{One-pass.} Using non-adaptive group testing, the algorithm above can be implemented in one pass: Algorithm~ \ref{alg:c22-hot-items-update} is invoked once per input pair, and the choice of \)M\( (which is fixed in advance) satisfies the non-adaptivity requirement. \emph{Poly-log space.} We must maintain the counters \)C_1,…,C_t\( and \)m\(. Each counter is at most \)mN\( in absolute value, so it can be represented in \)O(N + m)\( bits, giving a total of \)O((N + m) t)\( bits, i.e.\ \)O(t N)\( space (treating \)m = O(N)\(, e.g.\ if \)m = poly(N)\(). Since \)t = O(d^2 _d^2 N)\(, this is poly-logarithmic in \)N\( provided \)d = poly N\(. We do not store the matrix \)M\( itself (which would need \)Ω(tN)\( space); instead \)M\(’s entries are computed on the fly (see next item). \emph{Poly-log update time.} Recall \)M = M_C^*\( with \)C^* = C_ out ○C_in\(, \)C_out\( a \)[q,k,q-k+1]_q\( Reed--Solomon code (Definition~ \ref{def:c05-reed-solomon}) and \)M_C_in\( the \)q ×q\( identity matrix, so \)N = q^k\(, \)t = q^2\(. Column \)M^j\(, for a message \)m F_q^k\( corresponding to item \)j\(, equals (a suitable encoding of) the codeword \)C_out(m)\(: since \)C_out\( is a linear code (Definition~ \ref{def:c02-linear-code}), \)C_out(m)\( can be computed in \)O(q^2 ×polylog q)\( time by a matrix-vector product against the (implicitly stored, low-space) generator matrix, and hence \)M^j\( can be computed in \)O(q^2 ×polylog q) = O(t ×polylog t)\( time. Step 2 of Algorithm~ \ref{alg:c22-hot-items-update} therefore runs in \)O(t ×polylog t)\( time overall. \emph{Poly(t) reporting time.} The run time of Algorithm~ \ref{alg:c22-report-heavy-items} is dominated by Step 2 (decoding). The only decoding algorithm exhibited in this chapter for a general \)d\(-disjunct matrix is Algorithm~ \ref{alg:c22-disjunct-decoder}~ /~ Lemma~ \ref{lem:c22-disjunct-decoding-time}, which runs in time \)Ω(tN)\(, not \)poly(t)\(. Granting the existence of such a \)poly(t)\(-time decoder for this specific \)M\( (which uses the algebraic structure of the Reed--Solomon outer code), Step 2 runs in \)poly(t)\( time, completing the proof. \)

Proof
Theorem 490 Adaptive bound, Theorem 22.6.1
\[ t^a(d,N) = \Theta \! \left(d \log (N/d)\right). \]
Proof

The lower bound

\(\)t^a(d,N) = Ω(d(N/d))\( is exactly Proposition~ \ref{prop:c22-adaptive-lower-bound}. \notready \)

Proof
Theorem 491 Non-adaptive bounds, Theorem 22.6.2
\[ \Omega \! \left(d^2 \log _d N\right) \le t(d,N) \le O\! \left(d^2 \min \! \left(\log (N/d), \log _d^2 N\right)\right). \]
Proof

The upper bound term

\(\)O(d^2 _d^2 N)\( follows directly from Theorem~ \ref{thm:c22-kautz-singleton} (Kautz--Singleton). \notready \)

Proof
Corollary 492 Adaptive/non-adaptive gap, Corollary 22.6.3
\[ \frac{t(d,N)}{t^a(d,N)} \ge \Omega \! \left(\frac{d}{\log d}\right). \]
Proof

By Theorem 491,

\(\)t(d,N) ≥Ω(d^2 _d N)\(, and by Theorem~ \ref{thm:c22-adaptive-bound-summary}, \)t^a(d,N) = Θ(d(N/d)) = O(d N)\(. Hence \[ \frac{t(d,N)}{t^a(d,N)} \ge \Omega \! \left(\frac{d^2 \log _d N}{d \log N} \right) = \Omega \! \left(\frac{d \log _d N}{\log N}\right). \] Writing \)_d N = N / d\(, the right-hand side equals \)Ω(d / d)\(, which is the claimed bound. \)

Proof