2.1. Parameterized Complexity
One of the decisions made when developing fixed-parameter algorithms is the choice of the parameter. Natural parameters occurring in the problem definition of
Pattern-Guided k-
Anonymity are the number
n of rows, the number
m of columns, the alphabet size
, the number
p of pattern vectors, the anonymity degree
k, and the cost bound
s. In general, the number of rows will arguably be large and, thus, also the cost bound
s, tends to be large. Since fixed-parameter algorithms are fast when the parameter is small, trying to exploit these two parameters tends to be of little use in realistic scenarios. However, analyzing the adult dataset [
22] prepared as described by Machanavajjhala
et al. [
23], it turns out that some of the other mentioned parameters are small: The dataset has
columns, and the alphabet size is 73. Furthermore, it is natural to assume that also the number of pattern vectors is not that large. Indeed, compared to the
rows, even the number of
all possible pattern vectors
is relatively small. Finally, there are applications where
k, the degree of anonymity, is small [
24]. Summarizing, we can state that fixed-parameter tractability with respect to the parameters
,
m,
k, or
p, could be of practical relevance. Unfortunately, by reducing from 3-
Set Cover, we can show that
Pattern-Guided k-
Anonymity is NP-hard in very restricted cases.
Theorem 1. Pattern-Guided k-Anonymity is NP-complete even for two pattern vectors, three columns, and .
Proof. We reduce from the NP-hard 3-
Set Cover [
25]: Given a set family
with
over a universe
and a positive integer
h, the task is to decide whether there is a subfamily
of size at most
h such that
. In the reduction, we need unique entries in the constructed input matrix
M. For ease of notation, we introduce the
▵-symbol with an unusual semantics. Each occurrence of a
▵-symbol stands for a
different unique symbol in the alphabet Σ. One could informally state this as “
”. We now describe the construction. Let
be the 3-
Set Cover instance. We construct an equivalent instance
of
Pattern-Guided k-
Anonymity as follows: Initialize
M and
P as empty matrices. Then, for each element
, add the row
twice to the input matrix
M. For each set
with
, add to
M the three rows
,
, and
. Finally, set
,
, and add to
P the pattern vectors (□, ☆, ☆), and (☆, □, □).
We show the correctness of the above construction by proving that is a yes-instance of 3-Set Cover, if and only if is a yes-instance of Pattern-Guided k-Anonymity.
“⇒:” If is a yes-instance of 3-Set Cover, then there exists a set cover of size at most h. We suppress the following elements in M: First, suppress all ▵-entries in M. This gives suppressions. Then, for each , suppress all -entries in M. This gives suppressions. Finally, for each , suppress the first column of all rows containing the entry . These are suppressions. Let denote the matrix with the suppressed elements. Note that contains suppressed entries. Furthermore, in each row in , either the first element is suppressed or the last two elements. Hence, each row of matches to one of the two pattern vectors of P. Finally, observe that is 3-anonymous: The three rows corresponding to the set are identical: the first column is suppressed, and the next two columns contain the symbol . Since is a set cover, there exists for each element a set such that . Thus, by construction, the two rows corresponding to the element , and the row in M coincide in : The first column contains the entry and the other two columns are suppressed. Finally, for each row in M that corresponds to a set , the row in coincides with the two rows corresponding to the element : Again, the first column contains the entry and the other two columns are suppressed.
“⇐:” If is a yes-instance of Pattern-Guided k-Anonymity, then there is a 3-anonymous matrix , that is obtained from M by suppressing at most s elements, and each row of matches to one of the two pattern vectors in P. Since M and, so, contain rows, contains at most suppressions and each pattern vector contains a ☆-symbol, there are at most rows in containing two suppressions and at least rows containing one suppression. Furthermore, since the rows in M corresponding to the elements of U contain the unique symbol ▵ in the last two columns in , these rows are suppressed in the last two columns. Thus, at most rows corresponding to sets of have two suppressions in . Observe that for each set the entries in the last two columns of the corresponding rows are . There is no other occurrence of this entry in M. Hence, the at least rows in with one suppression correspond to sets in . Thus, the at most rows in that correspond to sets of and contain two suppressions correspond to at most h sets of . Denote these h sets by . We now show that is a set cover for the 3-Set Cover instance. Assume by contradiction that is not a set cover, and hence, there is an element . However, since is 3-anonymous, there has to be a row r in that corresponds to some set such that this row coincides with the two rows and corresponding to u. Since all rows in corresponding to elements of U contain two suppressions in the last two columns, the row r also contains two suppressions in the last two columns. Thus, . Furthermore, r has to coincide with and in the first column, that is, r contains as the entry in the first column the symbol u. Hence, , a contradiction. ☐
Blocki and Williams [
7] showed that, while
3-Anonymity is NP-complete [
7,
8],
2-Anonymity is polynomial-time solvable by reducing it in polynomial time to the polynomial-time solvable,
Simplex Matching [
26], defined as follows:
Simplex Matching- Input:
A hypergraph
with hyperedges of size two and three, a positive integer
h, and a cost function,
, such that:
and
.
- Question:
Is there a subset of the hyperedges , such that for all , there is exactly one edge in containing v and ?
We slightly adjust their reduction to obtain polynomial-time solvability for Pattern-Guided 2-Anonymity, together with Theorem 1, yielding a complexity dichotomy for Pattern-Guided k-Anonymity with respect to the parameter k.
Theorem 2. Pattern-Guided 2-Anonymity is polynomial-time solvable.
Proof. We reduce Pattern-Guided 2-Anonymity to Simplex Matching. To this end, we first introduce some notation. Let be the Pattern-Guided 2-Anonymity instance. For a set A of rows and a pattern vector p in P the set is obtained from A by suppressing entries in the rows of A such that each row matches p (see Definition 2). The set contains all pattern vectors p such that is a set of identical rows. Intuitively, contains all “suitable” pattern vectors to make the rows in A identical.
Now, construct the hypergraph as follows: Initialize and . For each row r in M add a vertex to V. For a vertex subset let be the set of the corresponding rows in M. For each vertex subset of size add the hyperedge if . Let be a pattern vector in with the minimum number of ☆-symbols. Denote this number of ☆-symbols of by ℓ. Then, set . Note that this is exactly the cost to “anonymize” the rows in with the pattern vector p. Finally, set the cost bound . This completes the construction.
First, we show that Conditions 1 and 2 are fulfilled. Clearly, as each pattern vector that makes some row set
A identical also makes each subset of
A identical, it follows that for any
and any
, it holds
. Hence, Condition 1 is fulfilled. Furthermore, it follows that
for each
, implying:
Thus, Condition 2 is fulfilled.
Observe that the construction can be easily performed in polynomial time. Hence, it remains to be shown that is a yes-instance of Pattern-Guided 2-Anonymity, if and only if is a yes-instance of Simplex Matching.
“⇒:” Let be a 2-anonymous matrix obtained from M by suppressing at most s elements, and each row of matches a pattern vector in P. Let be the set of all row types in . We construct a matching for H as follows: First, partition the rows in each row type, such that each part contains two or three rows. For each part Q, add to the set of the vertices corresponding to the rows in Q. By construction, the cost bound is satisfied, and all vertices are matched.
“⇐:” Let be a matching, and let . Recall that denotes the set of rows corresponding to the vertices in e. By construction, . We construct from M by suppressing for each entries in the rows such that they match . Observe that is k-anonymous, and each row matches a pattern vector. Furthermore, by construction, there are at most s suppressions in . Thus, is a yes-instance. ☐
Contrasting the general intractability result of Theorem 1, we will show fixed-parameter tractability with respect to the combined parameter
. To this end, we additionally use as a parameter the number
t of different input rows. Indeed, we show fixed-parameter tractability with respect to the combined parameter
. This implies fixed-parameter tractability with respect to the combined parameter
, as
and
. This results from an adaption of combinatorial algorithms from previous work [
13,
27].
Before presenting the algorithm, we introduce some notation. We distinguish between the
input row types of the input matrix
M and the
output row types of the matrix
. Note that in the beginning, we can compute the input row types of
M in
time using a trie [
28], but the output row types are unknown. By the definition of
Pattern-Guided k-
Anonymity, each output row type
has to match a pattern vector
. We call
an
instance of
v.
Theorem 3. Pattern-Guided k-Anonymity can be solved in time, where p is the number of pattern vectors and t is the number of different rows in the input matrix M.
Proof. We present an algorithm running in two phases:
- Phase 1:
Guess for each possible output row type whether it is used in . Denote with the set of all output row types in according to the guessing result.
- Phase 2:
Check whether there exists a k-anonymous matrix that can be obtained from M by suppressing at most s elements, such that respects the guessing result in Phase 1; that is, the set of row types in is exactly .
As to Phase 1, observe that the number of possible output row types is at most
: For each pattern vector, there exist at most
t different instances—one for each input row type. Hence, Phase 1 can be realized by simply trying all
possibilities. On the contrary, Phase 2 can be computed in polynomial time using the so-called
Row Assignment problem [
27]. To this end, we introduce
and
, where
r is the number of used output row types according to the guessing result of Phase 1, formally,
. With this notation, we can state
Row Assignment.
Row Assignment- Input:
Nonnegative integers k, s, and with , and a function .
- Question:
Is there a function , such that:
We now discuss how we use Row Assignment to solve Phase 2. The function h captures the guessing in Phase 1: If the input row type i is “compatible” with the output row type j, then , otherwise, . Here, an input row type R is compatible with an output row type if the rows in both row types are identical in the non-☆-positions or, equivalently, if any row of R can be made identical to any row of by just replacing entries with the ☆-symbol. The integer is set to the number of stars in the output row type in ; that is, captures the cost of “assigning” a compatible row of M to . In , the size (number of rows) of the input row type is stored. The integers with the same names in Row Assignment and Pattern-Guided k-Anonymity also store the same values.
Next, we show that solving Row Assignment indeed correctly realizes Phase 2: Since the output row types of are given from Phase 1, it remains to specify how many rows each output row type contains, such that can be obtained from M by suppressing at most s entries, and is k-anonymous. Due to Phase 1, it is clear that each row in matches a pattern vector in P. To ensure that can be obtained from M by suppressing entries, we “assign” rows of M to compatible output row types. Herein, this assigning means to suppress the entries in the particular row, such that the modified row belongs to the particular output row type. This assigning is captured by the function g: The number of rows from the input row type that are assigned to the output row type is . Inequality (1) ensures that we only assign rows to compatible output row types. The k-anonymous requirement is guaranteed by Inequality (2). Equation (3) ensures that all rows of M are assigned. Finally, the cost bound is satisfied, due to Inequality (4). Hence, solving Row Assignment indeed solves Phase 2.
Analyzing the running time, we get the following: Computing the input row types in
M can be done in
. In Phase 1, the algorithm tries
possibilities. For each of these possibilities, we have to check which input row types are compatible with which output row types. This is clearly doable in
time. Finally,
Row Assignment can be solved in
(Lemma 1 in [
27]). Since
, we roughly upper-bound this by
. Putting all this together, we arrive at the statement of the theorem. ☐
In other words, Theorem 3 implies that Pattern-Guided k-Anonymity can be solved in linear time if t and p are constants.
2.3. Greedy Heuristic
In this section, we provide a greedy heuristic based on the ideas of the fixed-parameter algorithm of Theorem 3 presented in
Section 2.1. The fixed-parameter algorithm basically does an exhaustive search on the assignment of rows to pattern vectors. More precisely, for each row type
R and each pattern vector
v it tries both possibilities of whether rows of
R are assigned to
v or not. Furthermore, in the ILP formulation, all assignments of rows to pattern vectors are possible. In contrast, our greedy heuristic will just pick for each input row type
R the “cheapest” pattern vector
v and, then, assigns all compatible rows of
M to
v. This is realized as follows: We consider all pattern vectors, one after the other, ordered by increasing number of ☆-symbols. This ensures that we start with the “cheapest” pattern vector. Then, we assign as many rows as possible of
M to
v: We just consider every instance
of
v, and if there are more than
k rows in
M that are compatible with
, then, we assign all compatible rows to
. Once a row is assigned, it will not be reassigned to any other output row type, and hence, the row will be deleted from
M. Overall this gives a running time of
. See Algorithm 1 for the pseudo-code of the greedy heuristic.If at some point in time, there are less than
k remaining rows in
M, then, these rows will be fully suppressed. Note that this slightly deviates from our formal definition of
Pattern-Guided k-
Anonymity. However, since fully suppressed rows do not reveal any data, this potential violation of the
k-anonymity requirement does not matter.
Algorithm 1 Greedy Heuristic () |
- 1:
Sort pattern vectors P by cost (increasing order) - 2:
for each do - 3:
for each instance of v do - 4:
if rows are compatible with then - 5:
Assign all compatible rows of M to - 6:
Delete the assigned rows from M.
|
Our greedy heuristic clearly does not always provide optimal solutions. Our experiments indicate, however, that it is very fast and that it typically provides solutions close to the optimum and outperforms the Mondrian algorithm [
21] in most datasets we tested. While this demonstrates the practicality of our heuristic (Algorithm 1), the following result shows that from the viewpoint of polynomial-time approximation algorithmics, it is weak in the worst case.
Theorem 4. Algorithm 1 for Pattern-Guided k-Anonymity runs in time and provides a factor m-approximation. This approximation bound is asymptotically tight for Algorithm 1.
Proof. Since the running time is already discussed above, it remains to show the approximation factor. Let be the number of suppressions in a solution provided by Algorithm 1 and be the number of suppressions in an optimal solution. We show that for every instance, it holds that . Let M be a matrix, be the suppressed matrix produced by Algorithm 1, and be the suppressed matrix corresponding to an optimal solution. First, observe that for any row in not containing any suppressed entry, it follows that the corresponding row in also does not contain any suppression. Clearly, each row in has at most m entries suppressed. Thus, each row in has at most m times more suppressed entries than the corresponding row in and, hence, .
To show that this upper bound is asymptotically tight, consider the following instance. Set , and let M be as follows: The matrix M contains k-times the row with the symbol 1 in every entry. Furthermore, for each , there are rows in M, such that all but the entry contains the symbol 1. In the entry, each of the rows contains a uniquely occurring symbol. The pattern mask contains pattern vectors: For , the pattern vector contains □-symbols and one ☆-symbol at the position. The last two pattern vectors are the all-□ and all-☆ vectors. Algorithm 1 will suppress nothing in the k all-1 rows and will suppress every entry of the remaining rows. This gives suppressions. However, an optimal solution suppresses in each row exactly one entry: The rows containing in all but the entry the symbol 1 are suppressed in the entry. Furthermore, to ensure the anonymity requirement, in the submatrix with the k rows containing the symbol 1 in every entry, the diagonal is suppressed. Thus, the number of suppressions is equal to the number of rows; that is, . Hence, . ☐