Group Testing with Blocks of Positives and Inhibitors

The main goal of group testing is to identify a small number of specific items among a large population of items. In this paper, we consider specific items as positives and inhibitors and non-specific items as negatives. In particular, we consider a novel model called group testing with blocks of positives and inhibitors. A test on a subset of items is positive if the subset contains at least one positive and does not contain any inhibitors, and it is negative otherwise. In this model, the input items are linearly ordered, and the positives and inhibitors are subsets of small blocks (at unknown locations) of consecutive items over that order. We also consider two specific instantiations of this model. The first instantiation is that model that contains a single block of consecutive items consisting of exactly known numbers of positives and inhibitors. The second instantiation is the model that contains a single block of consecutive items containing known numbers of positives and inhibitors. Our contribution is to propose efficient encoding and decoding schemes such that the numbers of tests used to identify only positives or both positives and inhibitors are less than the ones in the state-of-the-art schemes. Moreover, the decoding times mostly scale to the numbers of tests that are significantly smaller than the state-of-the-art ones, which scale to both the number of tests and the number of items.


Introduction
Group testing [1] was first introduced to reduce time and cost of testing draftees who were possibly positive for syphilis. In this problem, the number of syphilitic draftees is outnumbered by the number of non-syphilitic draftees. The main idea of group testing is instead of testing draftees individually, sets of draftees are pooled and tested. If the test outcome of a pool is positive, then there exists at least one draftee in that pool that is syphilitic and none of the draftees in the pool are syphilitic otherwise. Since this seminal work, group testing has been usually treated as a problem of identifying a small number of specific items in a large population of items. The specific items depend on context and affect how a test on a subset of items is positive or negative.
There are two general strategies for designing tests [2]. The first is adaptive group testing in which the design of a test depends on the designs of the previous tests. This approach usually attains an information-theoretic bound for the number of tests but consumes a substantial amount of time for implementation because of several design stages. To remedy its time-consuming nature while achieving a relatively low number of tests, non-adaptive group testing (NAGT) is used. In this strategy, all tests are designed independently and can be performed in parallel. Because of its advantage, NAGT has been used in a wide range of applications, such as computational and molecular biology [2,3], networking [4], COVID-19 [5,6], and neuroscience [7]. In this work, our focus is on the non-adaptive testing strategy.
NAGT can be represented by a t × n binary matrix T = (t ij ), where n is the number of items and t is the number of tests. An entry t ij = 1 means that item (column) j belongs to test (row) i, and t ij = 0 means otherwise. The jth item is represented by the jth column of the matrix. The procedure to produce the measurement matrix is called construction, the procedure to obtain the outcomes of tests using the measurement matrix is called encoding, and the procedure to recover specific items from the outcomes is called decoding. A measurement matrix is random if some tests are generated by a probabilistic scheme, whereas it is deterministic if every test is deterministic. A measurement matrix is strongly explicit (explicit) if it takes the time and space polynomial of the number rows (respectively, the number rows and the number of columns) to generate a column in it.
Some distribution settings may apply on specific items. There are two common settings: (i) the probabilistic setting, in which there is some probability distribution used on specific items, and the identification error probability is allowed; and (ii) the combinatorial setting, which is our focus here, and no probability distribution is used on specific items.
Consider standard group testing in which specific items are only positives. Suppose a test on a subset of items is positive if the subset contains at least one positive and is negative otherwise. Throughout the paper, log refers to base 2 logarithms. If we give a population of n items up to d positives,then there are a number of works for attaining a low number of tests, say t = O(d 2 log 1+o(1) n), and/or a fast decoding time, say poly(d, ln n) [8][9][10][11][12][13][14] in the combinatorial setting. In probabilistic settings, Bondorf [15] et al. show that the number of tests can be reduced to O(d log n) with a decoding time of O(d 2 log d · log n). Price and Scarlett [16] later improved the decoding time to O(d log n).

New Model and Problem Definition
Because of the natural phenomenon in biology, a new type of item called inhibitor was introduced in group testing [3] and studied [17][18][19][20]. An inhibitor item causes a negative outcome for any test it is involved in. On the other hand, a test on a subset of items is positive if the subset does not contain any inhibitor and contains at least one positive.
Group testing with blocks of positives has been recently presented by Bui et al. [21], which is a generalization of group testing with consecutive positives [22][23][24][25][26][27]. In this model, input n items are linearly ordered, and all positives belong to at most k blocks of consecutive items and each block has up to d consecutive items.
Combining the two models above, we consider a novel model called group testing with blocks of positives and inhibitors. The input n items are linearly ordered. We sub-categorize the model into three models and illustrate them in Figure 1. The first model contains one block of d + h consecutive items and that block contains exactly d positives and h inhibitors. The second model, which is a general model of the first one, contains one block of D ≥ d + h consecutive items and that block contains up to d positives and h inhibitors. The third model, which is the most general one, contains multiple blocks, says k, of consecutive items in which each block of size up to D ≥ d + h contains up to d positives and h inhibitors. Note that the assumption on the known upper bounds for k, h, and d are obtained from previous statistics.
We formulate the three models above as follows. Sets of the form C = {c 1 , . . . , c k } used in this work are equipped with linear order c i ≺ c i+1 for 1 ≤ i < k. We index the population of n items from 1 to n, namely N = {1, 2, . . . , n}. Let x = (x 1 , . . . , x n ) T ∈ {−1, 0, 1} n be the binary representation vector of n items, where x j = 1 indicates that item j is positive, x j = 0 indicates that item j is negative, and x j = −1 indicates that item j is inhibitory. A test on a subset of items is positive if the subset contains at least one positive and does not contain any inhibitors. Otherwise, the test outcome is negative.
The test notation is denoted as . Let p = (p 1 , . . . , p n ) ∈ {0, 1} n be the test representation vector. Then, the outcome vector of the test p with the input vector x, namely p x, is positive (1) if there does not exist a j such that p j = 1 and x j = −1, and there exists a j such that p j = 1 and x j = 1. The test outcome is negative (0) otherwise. Given a measurement matrix M of size t × n and an input vector x, the corresponding outcome vector is M x = [y 1 , . . . , y n ] T , where y i = M(i, :) x.
There are two common decoding types based on classification strategy. The first is to only identify the positives while the second is to identify both positives and inhibitors. Our objective is to find an efficient encoding and decoding scheme to satisfy two decoding types, i.e., minimizing the number of tests and the decoding time.

Contributions
Overview: We study group testing with blocks of positives and inhibitors and provide efficient encoding and decoding schemes to tackle it. By leveraging the knowledge of positives and inhibitors belonging to a small interval of size D, our objective is to identify the position of some positive, say j * ; then, one could claim that the indices of all positives and inhibitors must belong to the range from max{1, j * − D + 1} to min{j * + D − 1, n}. To precisely identify positives and inhibitors, appropriate tests are designed to accomplish this task.
Our proposed scheme includes two procedures, which are the filtering and scrutinizing procedures. The tests in the filtering procedure remove most negative items and leave a subset(s) of size up to 2D that contains all positives, inhibitors, and probably some negatives. The tests in the scrutinizing procedure remove all negatives and then identify positives and inhibitors. The details of the two procedures are specified in accordance to each specific problem.
The contributions for a single block of (consecutive) positives and inhibitors is summarized in Theorem 1. The proofs for the results of the first and second model are described in Sections 3 and 4. The contribution for blocks of positives and inhibitors is summarized in the following theorem, which is proved later in Section 5.
Suppose that a population of n items is linearly ordered and the positives and inhibitors belong to blocks of consecutive items in which each block has a size of up to D and contains up to d positives and h inhibitors. Then, there exists a deterministic and strongly explicit measurement matrix of size O(Dk 2 (D + log n D ) log n D ) × n that can be used to identify all positives in O( tests to identify all positives and inhibitors in time. O Dk 2 log n kD (D + log n D ) + k 4 D 4 log n kD .

Preliminaries
Disjunct matrices were first introduced by Kautz and Singleton [28] as superimposed codes and then generalized by Stinson and Wei [29] and D'yachkov et al. [30]. We later use them for identifying both positives and inhibitors. Let the support set for vector We denote M(i, :) and M(:, j) as the ith row and the jth column of matrix M. The formal definition of a disjunct matrix is as follows. Chen et al. [31] gave an upper bound on the number of rows for (n, v, u)-disjunct matrices as follows.

Theorem 3 ([31] Theorem 3.2).
For any positive integers v, u, and n with x = v + u ≤ n, there exists a t × n (n, v, u)-disjunct matrix with the following.
Once u = 1, (n, v, 1)-disjunct matrices become v-disjunct matrices. The following theorem states the construction and decoding time for a d-disjunct matrix. Theorem 4 ([11] Theorem 16). Let 1 ≤ d ≤ n. Then, there exists a deterministic and explicit t × n d-disjunct matrix with t = O(d 2 log n) that can be decoded in the polynomial time of t.

Single Block of Consecutive Positives and Inhibitors
In this section, we consider the case when the positives and inhibitors are consecutive and the numbers of positives and inhibitors are known in advance. Set D = d + h.

Filtering Matrices
We create h + 2 filtering matrices in the filtering procedure as follows. Let F = [f 1 , · · · , f κ ] be an f × κ (indexing) binary matrix for which its jth column is the f -bit binary representation of integer j, where f = log (κ + 1) . It is obvious that the index j is uniquely identified by f j . We then generate h + 2 binary matrices F . For example, let n = 12, d = 4, and h = 2. We obtain a = (d + h)/(h + 1) = 2 and κ = n/a = 6.
, there exists only one non-zero column. Therefore, it is used to "isolate" each super item in the h + 1 super items generated from the set of D positives and inhibitors. Let T be the outcome vector by using the testing matrix F (u) with the set of super itemsĒ . In particular, if F (u) (i, j) = 1 (respectively, F (u) (i, j) = 0) then all items in the super itemj (respectively, do not) belong to test i. Therefore, we obtain the following.

Sanitizing Matrices
In the sanitizing procedure, the measurement matrix depends on whether the objective is to identify positives only or to identify both positives and inhibitors. For the first objective, we design an s × n matrix S such that S(i, j) = 1 if i ≡ j mod s and S(i, j) = 0 or, otherwise, where s = 2D − 1. In other words, each test contains items spaced 2D − 1 apart in a linear order. For example, when n = 12, d = 4, and h = 2, we obtain the following.
It is straightforward that every column in S (respectively, F (u) ) is deterministic and strongly explicit because each column in it can be generated in time and space of O(2D) = O(d + h) (respectively, O(log κ)).
For the second objective, i.e., the objective of identifying both positives and inhibitors, we design an additional matrix R along with matrix S. Let R be a r × n (n, 2D − 3, 2)disjunct matrix as defined in Definition 1. Therefore, we have r = O(D 3 log(n/D)) = O((d + h) 3 log(n/(d + h))) as in Theorem 3.
Let y S = [y S (1), . . . , y S (s)] T (respectively, y R = [y R (1), . . . , y R (r)] T ) be the outcome vector by using the testing matrix S (respectively, R) with input set N. In particular, we have the following. y S = S x and y R = R x.

Decoding Procedure and Correctness
We first approximately locate some positive items by using outcome vectors y F (1) , . . . , y F (h+1) . Then, we can locate a set of up to 2D − 1 items that contains all positives and inhibitors. We call this set the set of interest. By using y S , we can exactly identify all positives in that set. Meanwhile, if y R is also used, all inhibitors are also identified. The details of the decoding procedure are as follows.
Let λ be an index such that y F (λ) is not a zero vector. There always exists a λ. Indeed, because there are h inhibitors, D consecutive positives and inhibitors, and each super item contains up to D/(h + 1) items, the total number of items contained in super items having inhibitors is up to h D/(h + 1) < D. Therefore, there must exist a super itemᾱ that does not contain any inhibitor but all positives for 1 ≤ α ≤ κ. Let λ be the index such that F (λ) (:, α) = 0. Since two consecutive non-zero column in F (u) are space by h + 2, y F (λ) = F (λ) χĒ = F (λ) (:, α). Therefore, to identify α, we convert a non-zero vector y F (λ) into a decimal number. The indices of all positives and inhibitors, thus, must belong to the range from max{1, j * − D + 1} to min{j * + D − 1, n}. The decoding complexity of identifying λ is therefore O(h f ).
Because of the construction of S, a matrix composed of 2D − 1 consecutive columns in it is a permutation of a (2D − 1) × (2D − 1) identity matrix. Therefore, given the indices from max{1, α − D + 1} to min{α + D − 1, n}, one can identify which item is positive based on the corresponding outcome vector y S . The decoding complexity of y S is, therefore, After identifying d positives, the set of interest contains up to 2D − 1 − d = d + 2h − 1 potential inhibitors. Because of the construction of R, for any two items and other 2D − 3 items, there exists a test that contains the two items and does not contain the other 2D − 3 items. Therefore, one could identify whether a potential inhibitor is truly an inhibitor by checking the row that contains it and whether it is a positive, in addition to checking that it does not contain the remaining items in the set of interest. Since there are up to 2D − 1 − d potential positives and the number of rows in R is r, this procedure to identify inhibitors takes O(r(2D − 1 − d)) = O(r(d + h)).

Decoding Complexity and Number of Tests
As analyzed in the previous section, to identify the positives only, the number of required tests and the decoding complexity are as follows.
To identify both the positives and inhibitors, i.e., classify all items, the required number of tests is as follows.
The corresponding decoding complexity is as follows. O

Single Block of Positives and Inhibitors
In this section, we consider the case when the positives and inhibitors are not necessarily consecutive but belong to a small block (set) of consecutive items of size up to D ≥ d + h, where d and h are the maximum numbers of positives and inhibitors in the population of n items.
In the encoding procedure, we use the same techniques in Section 3.1 but adjust some parameters. In the filtering procedure, we set a = 1, i.e., every super item reduces to an item. Therefore, κ is equal to n. Moreover, we create D filtering matrices, i.e., h + 1 is replaced by D. In the sanitizing procedure, the parameter s in the s × n matrix S is set to be 2D − 1. Matrix R is a r × n (n, 2D − 3, 2)-disjunct matrix as defined in Definition 1. Therefore, we have r = O(D 3 log(n/D)) as in Theorem 3.
Since the decoding procedure and the proofs of correctness are as the same as in Section 3.2, we only pay attention for the required numbers of tests and the decoding complexities. Each matrix F (u) has a size of f × n, where f = log n , for u = 1, . . . , D.
The numbers of tests in matrices S and R are s = 2D − 1 and r = O(D 3 log(n/D)), respectively.
To identify the positives only, the number of required tests and the decoding complexity are as follows.
To identify both the positives and inhibitors, the required number of tests is described as follows.
The corresponding decoding complexity is as follows.

Blocks of Positives and Inhibitors
In this section, we consider a model consisting of multiple blocks of positives and inhibitors, in which all positives and inhibitors belong to at most k special blocks of consecutive items and each block has up to D consecutive items.Moreover, each special block contains up to dpositives and up to h inhibitors.

Encoding Procedure
We generate D sets from the set of n items N as follows. Set N (u) = {u, D + u, 2D + u, . . . , n u } and x (u) = (x u , x D+u , x 2D+u , . . . , x n u ) T , where n u is the largest number smaller than n and n u ≡ u mod D, for u = 1, . . . , D. It is obvious that n u = |N (u) | ≤ n/D . Since each special block has up to D items and there are up to k special blocks, each set N (u) must contain up to k positives and inhibitors in total. Moreover, for each special block τ, there exists an index u τ such that some positive item in that block belongs to N (u τ ) because two consecutive items in N (u) are spaced apart by D and each special block has up to D items.
Let M (u) be an m u × n u k-disjunct matrix. We then obtained m u = O(k 2 log n u ) = O(k 2 log (n/D)) as in Theorem 4. Let B (u) be a b × n u index matrix: where b = 2 log n u , b j is the log n u -bit binary representation of integer j − 1, b j is the complement of b j , and B (u) j := b j b j for j = 1, 2, . . . , n u . Item j is characterized by column B j and that the weight of every column in B is b/2 = log n u . Furthermore, the index j is uniquely identified by b j . For example, if we set n u = 8, b = 2 log n u = 6, and the matrix in (4) becomes the following.
Finally, matrices S and R are defined as in Section 3.1.2. Note that D is not set to be d + h here.
We are now ready to generate a filtering matrix and a scrutinizing matrix. The filtering matrix corresponding to matrix M (u) is as follows: (1, :)) . . .
where diag(·) is a diagonal matrix generated by the input vector. The vector observed after performing the tests given by the measurement matrix F (u) is described as follows: i , for i = 1, 2, . . . , m u . Entry y i indicates whether there exist only negatives and positives in that test.
If the answer is yes, vector y (u) i tells us whether there exists only one positive or more than one positive.
Let expand(M (u) (i, :)) be M (u) (i, :). Then, for any j ∈ N (u) and M (u) (i, j) = 1, every entry in expand(M (u) (i, :)) indexed from max{j − D + 1, 1} to min{j + D − 1, n} is set to be 1. This vector is used to identify a block of 2D − 1 consecutive items that contains at least one positive item. In particular, to identify positives only, the scrutinizing matrix corresponding to matrix M (u) isdefined as follows: . . .
where S is defined in Section 3.1.2, and the outcome vector obtained by using this matrix is as follows: where s (u) i = S (diag(expand(M (u) (i, :))) × x), for i = 1, 2, . . . , m u . To identify both positives and inhibitors, an additional scrutinizing (n, kD − 2, 2)disjunct matrix R is used. Let r be the outcome vector by using this matrix.

Decoding Procedure and Correctness
For each u ∈ {1, D}, we first scan each y (u) to locate some positive item in some block. The decoding procedure is as follows. First, find 1 ≤ i ≤ m u such that y i | = log n u . Then, similarly to the arguments in Section 3.2, since any matrix composed of 2D − 1 consecutive columns in S is a permutation of a (2D − 1) × (2D − 1) identity matrix, one can identify which item is positive based on the corresponding outcome vector s (u) i . Finally, all inhibitors in a block can be identified by using r (u) i . Such i always exists in the first step. Indeed, as proved in Section 5.1 that for each special block τ, there exists an index u τ such that some positive item in that block belongs to N (u τ ) . Since each set N (u τ ) contains up to k positives and inhibitors and M (u τ ) is a k-disjunct matrix, there must exist row i such that M (u τ ) (i, :) contains only that positive. Therefore, y (u τ ) i = 1 and |y there must exist at least one positive item in N (u) in test M (u) (i, :). Moreover, since B (u) (diag(M (u) (i, :)) × x (u) ) and every column in B (u) has weight of log n u , there must exist only one positive item in N (u) in that test. Otherwise, |y In the second step, the indices of positives and inhibitors then ranged from max{1, α − D + 1} to min{α + D − 1, n}.
Because of the construction of vector expand(M (u) (1, :)), every item indexed from max{1, α − D + 1} to min{α + D − 1, n} presents in the characteristic vector diag(expand(M (u) (1, :))) × x. Therefore, s In the last step, since R is an (n, kD − 2, 2)-disjunct matrix, for any block of positives and inhibitors, there exists a row such that it contains only a positive and an inhibitor. That inhibitor is thus identified. This procedure takes O(k × r(2D − 1)) = O(k 4 D 4 log (n/(kD))).

Decoding Complexity and Number of Tests
There are D F (u) deterministic and strongly explicit matrices in the filtering procedure. Since each has m u (1 + b) = O(k 2 log n u × log n u ) = O(k 2 log 2 (n/D)) tests, the total number of tests in the filtering procedure is O(Dk 2 log 2 (n/D)). The decoding complexity by using these tests is also O(Dk 2 log 2 (n/D)). There

Comparison
We compare our proposed schemes with existing schemes, namely Ganesan et al. [32], Chang et al. [33], and Bui et al. [20] in Table 1. There are eight criteria to consider here. The first four criteria are about the structure of the population of n items. They are the number of blocks, the number of items in a block, the number of positives (in a block if applicable), and the number of inhibitors (in a block if applicable). The fifth criterion is the decoding type. The sixth is the construction type, which describes how measurement matrices can be achieved. The seventh and the last are the number of tests and the decoding time.
Consider the decoding type as "positives only." The construction type in our proposed schemes for the first and the second model, i.e., the number of blocks is one, is deterministic and strongly explicit. They are better than the schemes proposed by Chang   Bui et al. [20] Det., Strongly explicit Not applicable d d

Potential Applications
Bruno et al. [34] addressed a group testing-based solution in genetic mapping and sequencing. In this application, the authors consider linear DNA, which consists of consecutive segments of the DNA. Each segment is placed in a pool, called clones, in an order consistently to the order of their appearance in the linear DNA. A collection of such clones is called a linear DNA library. From this point, we can ask where segments (clones) of interest are in the linear DNA library [35]. The segments of interest here can be considered as positives and other segments can be considered as negatives. A pool that contains at least one segment of interest returns a positive outcome when performing testing and returns a negative outcome otherwise.
We extend the application above to a potential application as follows. Given a linear DNA library, we would like to find segments of DNA that express a certain biological property and segments of DNA that inhibit the segments expressing a certain biological property. The first and second types of segments of interest are considered as positives and inhibitors, respectively, while the remaining segments are considered as negatives. Because of the nature of DNA, an inhibitor is usually close to positives. Therefore, the blocks of positives and inhibitors model can be used to identify both positives and inhibitors.

Conclusions
In this paper, we presented efficient encoding and decoding procedures to identify positives and/or inhibitors in a single block of (consecutive) positives and inhibitors or in blocks of positives and inhibitors. The number of tests and the decoding times in our proposed schemes is usually smaller than the ones in existing works. An extension of this work to other settings in group testing such as threshold group testing or complex group testing is still an open problem.