22 Finding Defectives: Group Testing

Definition 460 Query / test

Given

\(\)N\( individuals indexed by \)[N]\(, a \emph{query} (or \emph{test}) \)S\( is a subset \)S ⊆[N]\(. If \)x = (x_1,…,x_N) {0,1}^N\( is the (unknown) indicator vector of the individuals possessing a given biomarker (so \)x_i = 1\( iff individual \)i\( is defective), the answer to the query \)S\( is \[ A(S) = \begin{cases} 1 & \text{if } \sum _{i \in S} x_i \ge 1, \\ 0 & \text{otherwise.} \end{cases} \] Equivalently, \)A(S) = _i S x_i\( is the logical-OR of the bits of \)x\( indexed by \)S\(. The \emph{group testing problem} is: given an upper bound \)d\( on the number of defectives (i.e.\ \)wt(x) ≤d\(), compute \)x\( while minimizing the number of tests/queries used. \)

Definition

Definition 461 Adaptive vs. non-adaptive group testing

A group testing scheme is adaptive if each test may be chosen based on the outcomes of all previously performed tests. A group testing scheme is non-adaptive if the entire collection of tests is fixed in advance, before any test outcome is observed.

Definition 462 \(\)t(d,N)\(, Definition 22.1.1\)

Given a subset of \(\)N\( items with \)d\( defects represented as \)x {0,1}^N\( with \)wt(x) ≤d\(, the minimum number of \emph{non-adaptive} tests that one would have to make (over all valid non-adaptive schemes that correctly determine \)x\() is defined as \)t(d,N)\(. \)

Definition 463 \(\)t^a(d,N)\(, Definition 22.1.2\)

Given a set of \(\)N\( items with \)d\( defects, \)t^a(d,N)\( is defined as the minimum number of \emph{adaptive} tests that one would have to make to detect all the defective items. \)

Proposition 464 Trivial bounds, Proposition 22.1.3

For every

\(\)1 ≤d ≤N\(, we have \[ 1 \le t^a(d,N) \le t(d,N) \le N. \] \)

Proposition

Proof ▶

The inequality

\(\)t(d,N) ≤N\( follows from the naive scheme of testing all individuals with singleton tests \){1},…,{N}\( (which is trivially non-adaptive). The inequality \)1 ≤t^a(d,N)\( is trivial (if \)N ≥1\( and \)d ≥1\( at least one test is needed to distinguish the all-zero input from a weight-one input). Finally, \)t^a(d,N) ≤t(d,N)\( holds because any non-adaptive test schedule can be simulated by an adaptive scheme that simply runs all of the non-adaptive tests as its first round of tests and ignores the intermediate outcomes when choosing later tests; adaptive schemes can only do at least as well since later tests may be chosen using information revealed by earlier ones. \)

Proof ▶

Definition 465 Group testing matrix

For

\(\)S ⊆[N]\(, let \)χ_S {0,1}^N\( denote its indicator vector, i.e.\ \)χ_S(i) = 1\( iff \)i S\(. A non-adaptive group testing scheme with \)t\( tests \)S_1,…,S_t\( is represented by the \)t ×N\( \emph{testing matrix} \)M = (M_ij)_i [t], j [N]\( whose \)i\(-th row is \)χ_S_i\(, i.e.\ \)M_ij = 1\( if \)j S_i\( and \)M_ij=0\( otherwise. (Testing each individual as a singleton set corresponds to \)M\( being the \)N ×N\( identity matrix.) Interpreting multiplication as logical AND (\)∧\() and addition as logical OR (\)∨\(), we write \)M ⊙x\( for the vector \)r {0,1}^t\( of test outcomes, i.e.\ \)r = M ⊙x\( where \)r_i = 1\( iff row \)i\( of \)M\( and \)x\( share a common \)1\(-position. For a column index \)j [N]\( we also write \)M^j {0,1}^t\( for the \)j\(-th column of \)M\(, thought of interchangeably as a subset of \)[t]\( (namely, the set of rows in which it has a \)1\(). The goal of non-adaptive group testing is to design \)M\( with as few rows \)t\( as possible so that \)x\( can be recovered from \)M ⊙x\(. \)

Definition

Proposition 466 Counting lower bound, Proposition 22.2.1

For every

\(\)1 ≤d ≤N\(, \[ t^a(d,N) \ge d \log \frac{N}{d}. \] \)

Proposition

Proof ▶

Fix any valid adaptive group testing scheme with

\(\)t\( tests. Observe that if \)x ≠y {0,1}^N\( with \)wt(x), wt(y) ≤d\(, then \)r(x) ≠r(y)\(, where \)r(x)\( denotes the vector of test results obtained when running the (adaptive) scheme on input \)x\(, and similarly for \)r(y)\(: two valid inputs cannot produce the same result vector, since otherwise the scheme could not distinguish \)x\( from \)y\(. This observation implies that the number of distinct test-outcome vectors is at least the number of distinct binary vectors of Hamming weight at most \)d\( in \){0,1}^N\(, i.e.\ \)Vol_2(d,N)\( (the volume of a Hamming ball of radius \)d\(, Definition~ \ref{def:c01-hamming-ball}). On the other hand, the number of possible distinct \)t\(-bit outcome vectors is at most \)2^t\(. Combining the two facts, \[ 2^t \ge \mathrm{Vol}_2(d,N), \] and hence \[ t \ge \log \mathrm{Vol}_2(d,N). \] Recall that \[ \mathrm{Vol}_2(d,N) \ge \binom {N}{d} \ge \left(\frac{N}{d}\right)^d, \] where the first inequality is the definition of the volume of a Hamming ball as a sum of binomial coefficients (of which \)Nd\(, the term for weight exactly \)d\(, is one summand) and the second is the standard bound \)Nd ≥(N/d)^d\(. Combining these, \)t ≥d (N/d)\(, as desired, and since this holds for every valid adaptive scheme with \)t\( tests, \)t^a(d,N) ≥d(N/d)\(. \)

Proof ▶

Proposition 467 Bound for d=1\(, Proposition 22.3.1\)

\[ t(1,N) \le \lceil \log (N+1) \rceil + 1. \]

Proof ▶

We prove the upper bound by exhibiting a matrix that solves non-adaptive group testing for

\(\)d=1\(. Let the group testing matrix \)M\( be the parity check matrix for the \)[2^m-1,2^m-m-1,3]_2\( Hamming code (Definition~ \ref{def:c01-hamming-code}), i.e.\ \)H_m\(, where the \)i\(-th column is the binary representation of \)i\(. This works because if \)x\( has \)wt(x) ≤1\( and \)x = e_i\( for some \)i [N]\(, then \)M ⊙x = M^i\(, the \)i\(-th column of \)M\(, and by construction \)M^i\( is exactly the binary representation of \)i\(; hence \)i\( (and thus \)x\() can be read off directly from \)M ⊙x\(. If instead \)wt(x) = 0\(, then \)M ⊙x = 0\(, which is exactly \)x\( and is not equal to the binary representation of any \)i [N]\( (since \)i ≥1\(). Hence \)M ⊙x\( uniquely identifies \)x\( whenever \)wt(x) ≤1\(. If \)N ≠2^m - 1\( for any \)m\(, take the matrix \)H_m\( for the (unique) \)m\( with \)2^m-1 - 1 < N < 2^m - 1\( and pad \)x\( with \)0\(s at the end to length \)2^m-1\( before applying the above scheme; decoding remains trivial since a binary representation is directly read off. In either case the number of tests (rows of \)M\() is at most \)m = (N+1) + 1\(, which completes the proof. \)

Proof ▶

Definition 468 \(\)d\(-separable matrix, Definition 22.3.2\)

A \(\)t ×N\( matrix \)M\( is \)\(\)d\(-separable\) if and only if for every \(\)S_1 ≠S_2 ⊆[N]\( such that \)|S_1|,|S_2| ≤d\(, we have \[ \bigcup _{j \in S_1} M^j \ne \bigcup _{i \in S_2} M^i. \] \)

Algorithm 469 Decoder for Separable Matrices, Algorithm 22.3.1

Input: Result vector

\(\)r\( and a \)d\(-separable matrix \)M\(. \textbf{Output:} \)x\( if \)r = M ⊙x\(, else \texttt{Fail}. \begin{enumerate} \item \end{enumerate} \)R ←{i ∣r_i = 1}\(. \item For every \)T ⊆[N]\( with \)|T| ≤d\(: \begin{enumerate} \item \end{enumerate}\)S_T ←_i T M^i\(. \item If \)R = S_T\(: set \)x {0,1}^N\( with \)x_i = 1\( iff \)i T\(, and return \)x\(. \)

Return Fail.

Correctness follows directly from Definition 468: the set \(\)T\( found (if any) satisfies \)_i T M^i = R = _i supp(x) M^i\(, and \)d\(-separability guarantees that no other set of size \)≤d\( has the same union, so \)T = supp (x)\(. The algorithm runs in time \)N^Θ(d)\(, since it enumerates all subsets \)T ⊆[N]\( of size at most \)d\(. \)

Algorithm

Definition 470 \(\)d\(-disjunct matrix, Definition 22.3.3\)

A \(\)t ×N\( matrix \)M\( is \)\(\)d\(-disjunct\) if and only if for every \(\)S ⊂[N]\( with \)|S| ≤d\( and for every \)j S\(, there exists \)i [t]\( such that \)M_ij = 1\( but \)M_ik = 0\( for all \)k S\(. Or, equivalently, \[ M^j \not\subseteq \bigcup _{k \in S} M^k. \] \)

Lemma 471 Disjunct implies separable, Lemma 22.3.4

Any

\(\)d\(-disjunct matrix is also \)d\(-separable. \)

Lemma

Proof ▶

For contradiction, assume

\(\)M\( is \)d\(-disjunct but not \)d\(-separable. Since \)M\( is not \)d\(-separable, there exist two distinct subsets \)S ≠T ⊂[N]\( of size at most \)d\( each such that \[ \bigcup _{k \in S} M^k = \bigcup _{k \in T} M^k. \] Since \)S ≠T\(, there exists \)j T ∖S\(. But then \[ M^j \subseteq \bigcup _{k \in T} M^k = \bigcup _{k \in S} M^k, \] where the equality is the displayed equality above. However, since by assumption \)j S\(, this directly contradicts the fact that \)M\( is \)d\(-disjunct (Definition~ \ref{def:c22-disjunct}). This contradiction completes the proof. \)

Proof ▶

Algorithm 472 Naive Decoder for Disjunct Matrices, Algorithm 22.3.2

Input: Result vector

\(\)r\( and a \)d\(-disjunct matrix \)M\(. \textbf{Output:} \)x\( if \)M ⊙x = r\(, else \texttt{Fail}. \begin{enumerate} \item Initialize \end{enumerate} \)x {0,1}^N\( to be the all-ones vector. \item For every \)i [t]\(: if \)r_i = 0\(, then for every \)j [N]\( with \)M_ij = 1\(, set \)x_j ←0\(. \item If \)M ⊙x = r\(, return \)x\(; else return \texttt{Fail}. \)

Algorithm

Lemma 473 Efficient decoding of disjunct matrices, Lemma 22.3.5

There exists an

\(\)O(tN)\(-time decoding algorithm for any \)t ×N\( matrix that is \)d\(-disjunct. \)

Lemma

Proof ▶

The proof follows from two observations, which together justify Algorithm 472 (Algorithm 22.3.2).

First, say we have a matrix

\(\)M\(, a vector \)x\(, and \)r = M ⊙x\( such that \)r_i = 1\(. Then there exists a column \)j\( that made it possible, i.e.\ if \)r_i = 1\( then there is a \)j\( such that \)M_ij = 1\( and \)x_j = 1\(. Second, let \)T = {ℓ∣x_ℓ= 1}\( with \)|T| ≤d\(, and let \)j\( be a column not in \)T\(. Consider the row \)i\( such that \)T\( has all zeros in the \)i\(-th row and \)M_ij=1\(; such a row exists by \)d\(-disjunctness (Definition~ \ref{def:c22-disjunct}, applied with \)S=T\(), and for this row \)r_i = 0\(. Conversely, if \)r_i = 0\(, then for every \)j [N]\( with \)M_ij=1\(, it must be the case that \)x_j = 0\( (else, by the first observation applied to row \)i\(, we would need \)r_i=1\(). These two observations show that Algorithm~ \ref{alg:c22-disjunct-decoder} correctly sets to \)0\( exactly the coordinates \)x_j\( with \)j T\(, and hence correctly recovers \)x\(. The algorithm runs in time \)O(tN)\(: Step 2 examines each of the \)t\( rows once, and for each row with \)r_i=0\( scans all \)N\( columns. \)

Proof ▶

Lemma 474 Sufficient condition for disjunctness, Lemma 22.4.1

Let

\(\)1 ≤d ≤N\( be an integer and \)M\( be a \)t ×N\( matrix such that \begin{enumerate} \item[(i)]for every \end{enumerate}\)j [N]\(, the \)j\(-th column has at least \)w_\( ones in it, i.e.\ \)|M^j| ≥w_\(; and \item [(ii)] for every \)i ≠j [N]\(, the \)i\(-th and \)j\(-th columns have at most \)a_\( ones in common, i.e.\ \)|M^i ∩M^j| ≤a_\(, \)

for some integers \(\)a_ ≤w_ ≤t\(. Then \)M\( is a \) \(-disjunct matrix. \)

Lemma

Proof ▶

For notational convenience, define

\(\)d = \(. Fix an arbitrary \)S ⊂[N]\( with \)|S| ≤d\( and a \)j S\(. We must show that \[ M^j \not\subseteq \bigcup _{i \in S} M^i, \] or equivalently \[ M^j \not\subseteq \bigcup _{i \in S} \left(M^i \cap M^j\right). \] We prove this by showing that \[ \left| M^j \setminus \bigcup _{i \in S} \left(M^i \cap M^j\right) \right| {\gt} 0. \] Indeed, consider the following sequence of relations: \begin{align*} \left| M^j \setminus \bigcup _{i \in S} \left(M^i \cap M^j\right)\right| & = |M^j| - \left|\bigcup _{i \in S}\left(M^i \cap M^j\right)\right| \\ & \ge |M^j| - \sum _{i \in S} \left|\left(M^i \cap M^j\right)\right| \\ & \ge w_{\min } - |S| \cdot a_{\max } \\ & \ge w_{\min } - d \cdot a_{\max } \\ & \ge w_{\min } - \frac{w_{\min }-1}{a_{\max }}\cdot a_{\max } \\ & = 1. \end{align*} Here the first equality holds because \)M^j\( is the disjoint union of the set on the right and the set being subtracted. The first inequality follows from the fact that the size of a union of sets is at most the sum of their sizes. The second inequality follows from the definitions of \)w_\( and \)a_\( (hypotheses (i) and (ii)). The third follows from \)|S| ≤d\(. The fourth follows from the definition of \)d\(. This shows the displayed set has size at least \)1\(, so it is nonempty, which completes the proof. \)

Proof ▶

Definition 475 Matrix from a code

Let

\(\)C ⊆[q]^t\( be a code with codewords \)C = {c_1, …,c_N}\(. The matrix \)M_C\( associated to \)C\( is the \)t ×N\( matrix whose \)i\(-th column is \)c_i\(, i.e. \[ M_C = \begin{bmatrix} \uparrow & \uparrow & & \uparrow \\ \mathbf{c}_1 & \mathbf{c}_2 & \cdots & \mathbf{c}_N \\ \downarrow & \downarrow & & \downarrow \end{bmatrix}. \] \)

Definition

Construction 476 Reduction of disjunct-matrix design to codes

By Lemma 474, to construct a

\(\) \(-disjunct matrix it suffices to design a binary code \)C^* ⊆{0,1}^t\( such that (i) for every \)c C^*\(, \)wt(c) ≥w_\(, and (ii) for every \)c^1 ≠c^2 C^*\(, \)|{i ∣c^1_i = c^2_i = 1}| ≤a_\(, and then take \)M = M_C^*\( (Definition~ \ref{def:c22-code-to-matrix}). \)

Construction

22.4.1 Kautz–Singleton Construction

Construction 477 Kautz–Singleton construction

Let

\(\)C^* = C_out ○C_in\( be a concatenated code (in the sense of code concatenation, Ch.\ 10), where \)C_out\( is a \)[q,k,q-k+1]_q\( Reed--Solomon code (Definition~ \ref{def:c05-reed-solomon}) and the inner code \)C_in : F_q {0,1}^q\( is defined by \)C_in(i) = e_i\( for every \)i F_q\( (identified with \)[q]\(). Note that \)M_C_in\( is the \)q ×q\( identity matrix, and that \)N = q^k\( and \)t = q^2\(. The group testing matrix produced by this construction is \)M_C^*\( (Definition~ \ref{def:c22-code-to-matrix}). \)

Construction

Theorem 478 Kautz–Singleton, Theorem 22.4.2

For every integer

\(\)d ≥1\( and large enough \)N ≥d\(, there exists a \)t ×N\( matrix that is \)d\(-disjunct where \[ t = O\! \left(d^2 (\log _d N)^2\right). \] \)

Theorem

Proof ▶

We instantiate the Kautz–Singleton construction (Construction 477) and analyze the resulting matrix

\(\)M = M_C^*\( via Lemma~ \ref{lem:c22-sufficient-disjunct}. Recall \)N = q^k\( and \)t = q^2\(. Every column of \)M\( has exactly \)q\( ones in it (since the inner encoding of any symbol is a weight-one vector \)e_i {0,1}^q\(, and every codeword of \)C_out\( has \)q\( (i.e.\ all) coordinates over \)F_q\( each contributing one block of length \)q\(). In other words, \)w_ = q\(. Next we estimate \)a_\(. Divide the \)t = q^2\( rows into \)q\(-sized chunks, and index them by pairs \)(i,j) [q] ×[q]\(. For \)(i,j) [q]×[q]\( and a column index \)ℓ[N]\( (corresponding to a codeword \)c_ℓC^*\(, which in turn corresponds to a message \)m F_q^k\(), we have \)M_(i,j),ℓ = 1\( if and only if \)c_ℓ(j) = i\( (i.e.\ the \)j\(-th coordinate of \)C_out (m)\( equals \)i\(, under a fixed bijection \)F_q ↔[q]\(). Hence, the number of rows in which columns \)ℓ_1\( and \)ℓ_2\( both have a \)1\( is exactly the number of coordinate positions \)j\( where \)c_ℓ_1(j) = c_ℓ_2(j)\(, which equals \)q - Δ(c_ℓ_1,c_ℓ_2)\(. Since \)C_out\( is a \)[q,k,q-k+1]_q\( code (Definition~ \ref{def:c05-reed-solomon}), any two distinct codewords of \)C_out\( agree in at most \)k-1\( positions. Hence any two distinct columns of \)M\( agree (have a common \)1\() in at most \)k-1\( rows, i.e.\ \)a_ = k-1\(. By Lemma~ \ref{lem:c22-sufficient-disjunct}, \)M\( is \)d\(-disjunct for \[ d = \left\lfloor \frac{q-1}{k-1} \right\rfloor . \] We now pick \)q\( and \)k\( (in terms of \)d\() so that this holds. Note that \)d = Θ((q-1)/(k-1))\( gives \)q = O(kd)\(. Since \)N = q^k\(, we have \)k = _q N\(, so \)q = O(d _q N)\(, i.e.\ \)q q = O(d N)\(, which gives \[ q = O(d \log _d N). \] Recalling that \)t = q^2\( completes the proof. \)

Proof ▶

Example 479 Example 22.4.3

Let

\(\)k=1\( and \)q=3\(. By the choice of the \)[3,1]_3\( Reed--Solomon code, \)C_ out = {(0,0,0),(1,1,1),(2,2,2)}\(, so \[ M_{C_{\mathrm{out}}} = \begin{bmatrix} 0 & 1 & 2 \\ 0 & 1 & 2 \\ 0 & 1 & 2 \end{bmatrix}. \] Concatenating each entry with the inner code \)C_in(i) = e_i {0,1}^3\( produces the \)9 ×3\( matrix \)M_C^*\(, where every group of three consecutive rows of \)M_C^*\( corresponds to one row of \)M_C_out\(. \)

Example

Definition 480 Data stream algorithm, Definition 22.5.1

A data stream algorithm is an algorithm satisfying the following four requirements.

The algorithm makes one sequential pass over the input.
The algorithm uses poly-log space (in particular, it cannot store the entire input).
The algorithm has poly-log update time, i.e. whenever a new item appears, the algorithm updates its internal state in poly-log time.
The algorithm has poly-log reporting time, i.e. at any point of time the algorithm can return the required answers in poly-log time.

Definition 481 Frequency, Definition 22.5.2

Given a stream of pairs

\(\)(i_1,u_1),…,(i_m,u_m)\( with \)i_ℓ[N]\( and \)u_ℓZ\(, let \)f_ℓ\( denote the total count for item (stock) id \)ℓ\(. Initially \)f_ℓ= 0\(; upon seeing a pair \)(ℓ,u_ℓ)\(, update \)f_ℓ←f_ℓ+ u_ℓ\(. \)

Definition

Definition 482 Hot Items Problem, Definition 22.5.3

Given

\(\)N\( different items, and \)m\( input pairs of data \)(i_ℓ,u_ℓ)\( for \)1 ≤ℓ≤m\(, where \)i_ℓ[N]\( indicates the item index and \)u_ℓ\( indicates the corresponding count, the \emph{hot items problem} requires updating the count \)f_ℓ\( (\)1 ≤ℓ≤N\() for each item, and outputting all item indices \)j\( such that \[ f_j {\gt} \frac{\sum _{\ell =1}^{N} u_\ell }{d}. \] Any such item is called a \emph{hot item}. (The hot items problem is also called the \emph{heavy hitters problem}.) \)

Definition

Theorem 483 Space lower bound for exact hot items, Theorem 22.5.4

Computing hot items exactly by a deterministic one-pass algorithm needs

\(\)Ω(N)\( space (even allowing exponential time). \)

Theorem

Proof ▶

22.5.1 Connection to Group Testing

We assume throughout this subsection the heavy-tail distribution assumption on the input data, namely

\[ \sum _{\ell :\, \text{not hot}} f_\ell \le \frac{m}{d}. \]

Algorithm 484 Initialization, Algorithm 22.5.1

Output: initialize the counters.

\(\)m ←0\(. \item For every \)j [t]\(: \)C_j ←0\(. \)

Algorithm

Algorithm 485 Update, Algorithm 22.5.2

Input: pair

\(\)(i,u)\(, \)i [N]\(, \)u Z\(. \textbf{Output:} updates the counters. \begin{enumerate} \item \end{enumerate} \)m ←m+1\(. \item For every \)j [t]\(: if \)M_ij = 1\(, then \)C_j ←C_j + u\(. \)

Here \(\)M\( is a fixed \)t ×N\( matrix and \)C_1,…,C_t\( are counters, \)C_i\( maintaining the total count of all items present in the \)i\(-th row of \)M\(. \)

Algorithm

Lemma 486 Observation 22.5.5

\(\)j\( is a hot item and \)M_ij=1\(, then \)C_i > \(. \)

Lemma

Proof ▶

Let

\(\)i [t]\( be such that \)M_ij=1\(. Then at any point of time, \[ C_i = \sum _{k:\, M_{ik}=1} f_k \ge f_j, \] where the equality follows by an easy induction on Algorithm~ \ref{alg:c22-hot-items-update} (each update to \)f_k\( for \)M_ik=1\( is mirrored by an equal update to \)C_i\(). Since \)j\( is a hot item, we have \)f_j > m/d\( (Definition~ \ref{def:c22-hot-items}, applied with \)∑_ℓu_ℓ= m\(), which completes the proof. \)

Proof ▶

Lemma 487 Observation 22.5.6

For any

\(\)1 ≤i ≤t\(, if all \)j\( with \)M_ij=1\( are not hot items, then \)C_i ≤\(. \)

Lemma

Proof ▶

Fix

\(\)i [t]\( such that every \)j [N]\( with \)M_ij=1\( is not a hot item. Then by the same argument as in the proof of Lemma~ \ref{lem:c22-hot-item-obs1}, \[ C_i = \sum _{k:\, M_{ik}=1} f_k. \] Since every \)k\( with \)M_ik=1\( is not hot, this sum is a sub-sum of \)∑_ℓ: not hot f_ℓ≤m/d\( (the heavy-tail assumption), so \)C_i ≤m/d\(. \)

Proof ▶

Algorithm 488 Report Heavy Items, Algorithm 22.5.3

Input: counters

\(\)m\( and \)C_1,…,C_t\(. \textbf{Output:} the heavy (hot) items. \begin{enumerate} \item For every \end{enumerate} \)j [t]\(: set \)r_j ←1\( if \)C_j > m/d\(, else \)r_j ←0\(. \item Let \)x\( be the result of decoding (for group testing) \)r = (r_1,…,r_t)\( against \)M\(. \item Return \){i ∣x_i = 1}\(. \)

Define \(\)x = (x_1,…,x_N) {0,1}^N\( with \)x_j = 1\( iff \)j\( is a hot item. By Lemmas~ \ref{lem:c22-hot-item-obs1} and~ \ref{lem:c22-hot-item-obs2}, \)r_i = 1\( if and only if \)C_i > m/d\(, which happens if and only if \)_j: M_ij=1 x_j = 1\(; that is, \)r = M ⊙x\(. Since \)wt(x) < d\( (there are at most \)d\( hot items), reporting the hot items reduces exactly to decoding the group testing instance \)(M,r)\(. \)

Algorithm

Theorem 489 Data stream algorithm for hot items, Theorem 22.5.7

There exists a data streaming algorithm that computes the

\(\)d\( hot items with one pass, \)O(t N)\( space for \)t = O(d^2 _d^2 N)\(, \)O(t polylog t)\( update time, and \)poly(t)\( reporting time. \)

Theorem

Proof ▶

We take

\(\)M\( to be the \)d\(-disjunct matrix of Theorem~ \ref{thm:c22-kautz-singleton} (Kautz--Singleton construction), with \)t = O(d^2 _d^2 N)\(, and use Algorithms~ \ref{alg:c22-hot-items-init},~ \ref{alg:c22-hot-items-update}, and~ \ref{alg:c22-report-heavy-items} to initialize, update, and report, respectively. We check the four requirements of Definition~ \ref{def:c22-data-stream-algorithm}. \emph{One-pass.} Using non-adaptive group testing, the algorithm above can be implemented in one pass: Algorithm~ \ref{alg:c22-hot-items-update} is invoked once per input pair, and the choice of \)M\( (which is fixed in advance) satisfies the non-adaptivity requirement. \emph{Poly-log space.} We must maintain the counters \)C_1,…,C_t\( and \)m\(. Each counter is at most \)mN\( in absolute value, so it can be represented in \)O(N + m)\( bits, giving a total of \)O((N + m) t)\( bits, i.e.\ \)O(t N)\( space (treating \)m = O(N)\(, e.g.\ if \)m = poly(N)\(). Since \)t = O(d^2 _d^2 N)\(, this is poly-logarithmic in \)N\( provided \)d = poly N\(. We do not store the matrix \)M\( itself (which would need \)Ω(tN)\( space); instead \)M\(’s entries are computed on the fly (see next item). \emph{Poly-log update time.} Recall \)M = M_C^*\( with \)C^* = C_ out ○C_in\(, \)C_out\( a \)[q,k,q-k+1]_q\( Reed--Solomon code (Definition~ \ref{def:c05-reed-solomon}) and \)M_C_in\( the \)q ×q\( identity matrix, so \)N = q^k\(, \)t = q^2\(. Column \)M^j\(, for a message \)m F_q^k\( corresponding to item \)j\(, equals (a suitable encoding of) the codeword \)C_out(m)\(: since \)C_out\( is a linear code (Definition~ \ref{def:c02-linear-code}), \)C_out(m)\( can be computed in \)O(q^2 ×polylog q)\( time by a matrix-vector product against the (implicitly stored, low-space) generator matrix, and hence \)M^j\( can be computed in \)O(q^2 ×polylog q) = O(t ×polylog t)\( time. Step 2 of Algorithm~ \ref{alg:c22-hot-items-update} therefore runs in \)O(t ×polylog t)\( time overall. \emph{Poly(t) reporting time.} The run time of Algorithm~ \ref{alg:c22-report-heavy-items} is dominated by Step 2 (decoding). The only decoding algorithm exhibited in this chapter for a general \)d\(-disjunct matrix is Algorithm~ \ref{alg:c22-disjunct-decoder}~ /~ Lemma~ \ref{lem:c22-disjunct-decoding-time}, which runs in time \)Ω(tN)\(, not \)poly(t)\(. Granting the existence of such a \)poly(t)\(-time decoder for this specific \)M\( (which uses the algebraic structure of the Reed--Solomon outer code), Step 2 runs in \)poly(t)\( time, completing the proof. \)

Proof ▶

Theorem 490 Adaptive bound, Theorem 22.6.1

\[ t^a(d,N) = \Theta \! \left(d \log (N/d)\right). \]

Proof ▶

The lower bound

\(\)t^a(d,N) = Ω(d(N/d))\( is exactly Proposition~ \ref{prop:c22-adaptive-lower-bound}. \notready \)

Proof ▶

Theorem 491 Non-adaptive bounds, Theorem 22.6.2

\[ \Omega \! \left(d^2 \log _d N\right) \le t(d,N) \le O\! \left(d^2 \min \! \left(\log (N/d), \log _d^2 N\right)\right). \]

Proof ▶

The upper bound term

\(\)O(d^2 _d^2 N)\( follows directly from Theorem~ \ref{thm:c22-kautz-singleton} (Kautz--Singleton). \notready \)

Proof ▶

Corollary 492 Adaptive/non-adaptive gap, Corollary 22.6.3

\[ \frac{t(d,N)}{t^a(d,N)} \ge \Omega \! \left(\frac{d}{\log d}\right). \]

Proof ▶

By Theorem 491,

\(\)t(d,N) ≥Ω(d^2 _d N)\(, and by Theorem~ \ref{thm:c22-adaptive-bound-summary}, \)t^a(d,N) = Θ(d(N/d)) = O(d N)\(. Hence \[ \frac{t(d,N)}{t^a(d,N)} \ge \Omega \! \left(\frac{d^2 \log _d N}{d \log N} \right) = \Omega \! \left(\frac{d \log _d N}{\log N}\right). \] Writing \)_d N = N / d\(, the right-hand side equals \)Ω(d / d)\(, which is the claimed bound. \)

Proof ▶