2.1. Parameterized Complexity
One of the decisions made when developing fixedparameter algorithms is the choice of the parameter. Natural parameters occurring in the problem definition of
PatternGuided k
Anonymity are the number
n of rows, the number
m of columns, the alphabet size
$\Sigma $, the number
p of pattern vectors, the anonymity degree
k, and the cost bound
s. In general, the number of rows will arguably be large and, thus, also the cost bound
s, tends to be large. Since fixedparameter algorithms are fast when the parameter is small, trying to exploit these two parameters tends to be of little use in realistic scenarios. However, analyzing the adult dataset [
22] prepared as described by Machanavajjhala
et al. [
23], it turns out that some of the other mentioned parameters are small: The dataset has
$m=9$ columns, and the alphabet size is 73. Furthermore, it is natural to assume that also the number of pattern vectors is not that large. Indeed, compared to the
$n=32,561$ rows, even the number of
all possible pattern vectors
${2}^{9}=512$ is relatively small. Finally, there are applications where
k, the degree of anonymity, is small [
24]. Summarizing, we can state that fixedparameter tractability with respect to the parameters
$\Sigma $,
m,
k, or
p, could be of practical relevance. Unfortunately, by reducing from 3
Set Cover, we can show that
PatternGuided k
Anonymity is NPhard in very restricted cases.
Theorem 1. PatternGuided kAnonymity is NPcomplete even for two pattern vectors, three columns, and $k=3$.
Proof. We reduce from the NPhard 3
Set Cover [
25]: Given a set family
$\mathcal{F}=\{{S}_{1},\dots ,{S}_{\alpha}\}$ with
${S}_{i}=3$ over a universe
$U=\{{u}_{1},\dots ,{u}_{\beta}\}$ and a positive integer
h, the task is to decide whether there is a subfamily
${\mathcal{F}}^{\prime}\subseteq \mathcal{F}$ of size at most
h such that
${\bigcup}_{S\in {\mathcal{F}}^{\prime}}S=U$. In the reduction, we need unique entries in the constructed input matrix
M. For ease of notation, we introduce the
▵symbol with an unusual semantics. Each occurrence of a
▵symbol stands for a
different unique symbol in the alphabet Σ. One could informally state this as “
$\u25b5\ne \u25b5$”. We now describe the construction. Let
$(\mathcal{F},U,h)$ be the 3
Set Cover instance. We construct an equivalent instance
$(M,P,k,s)$ of
PatternGuided k
Anonymity as follows: Initialize
M and
P as empty matrices. Then, for each element
${u}_{i}\in U$, add the row
$({u}_{i},\u25b5,\u25b5)$ twice to the input matrix
M. For each set
${S}_{i}\in \mathcal{F}$ with
${S}_{i}=\{{u}_{a},{u}_{b},{u}_{c}\}$, add to
M the three rows
$({u}_{a},{S}_{i},{S}_{i})$,
$({u}_{b},{S}_{i},{S}_{i})$, and
$({u}_{c},{S}_{i},{S}_{i})$. Finally, set
$k=3$,
$s=4\leftU\right+3\left\mathcal{F}\right+3h$, and add to
P the pattern vectors (□, ☆, ☆), and (☆, □, □).
We show the correctness of the above construction by proving that $(\mathcal{F},U,h)$ is a yesinstance of 3Set Cover, if and only if $(M,P,3,s)$ is a yesinstance of PatternGuided kAnonymity.
“⇒:” If $(\mathcal{F},U,h)$ is a yesinstance of 3Set Cover, then there exists a set cover ${\mathcal{F}}^{\prime}$ of size at most h. We suppress the following elements in M: First, suppress all ▵entries in M. This gives $4\leftU\right$ suppressions. Then, for each ${S}_{i}\in {\mathcal{F}}^{\prime}$, suppress all ${S}_{i}$entries in M. This gives $6{\mathcal{F}}^{\prime}$ suppressions. Finally, for each ${S}_{j}\notin {\mathcal{F}}^{\prime}$, suppress the first column of all rows containing the entry ${S}_{j}$. These are $3\left(\right\mathcal{F}{\mathcal{F}}^{\prime}\left\right)$ suppressions. Let ${M}^{\prime}$ denote the matrix with the suppressed elements. Note that ${M}^{\prime}$ contains $4\leftU\right+3\left\mathcal{F}\right+3{\mathcal{F}}^{\prime}\le s$ suppressed entries. Furthermore, in each row in ${M}^{\prime}$, either the first element is suppressed or the last two elements. Hence, each row of ${M}^{\prime}$ matches to one of the two pattern vectors of P. Finally, observe that ${M}^{\prime}$ is 3anonymous: The three rows corresponding to the set ${S}_{j}\notin {\mathcal{F}}^{\prime}$ are identical: the first column is suppressed, and the next two columns contain the symbol ${S}_{j}$. Since ${\mathcal{F}}^{\prime}$ is a set cover, there exists for each element ${u}_{j}$ a set ${S}_{i}\in {\mathcal{F}}^{\prime}$ such that ${u}_{j}\in {S}_{i}$. Thus, by construction, the two rows corresponding to the element ${u}_{j}$, and the row $({u}_{j},{S}_{i},{S}_{i})$ in M coincide in ${M}^{\prime}$: The first column contains the entry ${u}_{j}$ and the other two columns are suppressed. Finally, for each row $({u}_{i},{S}_{j},{S}_{j})$ in M that corresponds to a set ${S}_{j}\in {\mathcal{F}}^{\prime}$, the row in ${M}^{\prime}$ coincides with the two rows corresponding to the element ${u}_{i}$: Again, the first column contains the entry ${u}_{i}$ and the other two columns are suppressed.
“⇐:” If $(M,P,3,s)$ is a yesinstance of PatternGuided kAnonymity, then there is a 3anonymous matrix ${M}^{\prime}$, that is obtained from M by suppressing at most s elements, and each row of ${M}^{\prime}$ matches to one of the two pattern vectors in P. Since M and, so, ${M}^{\prime}$ contain $2\leftU\right+3\left\mathcal{F}\right$ rows, ${M}^{\prime}$ contains at most $s=4\leftU\right+3\left\mathcal{F}\right+3h$ suppressions and each pattern vector contains a ☆symbol, there are at most $2\leftU\right+3h$ rows in ${M}^{\prime}$ containing two suppressions and at least $3\left\mathcal{F}\right3h$ rows containing one suppression. Furthermore, since the $2\leftU\right$ rows in M corresponding to the elements of U contain the unique symbol ▵ in the last two columns in ${M}^{\prime}$, these rows are suppressed in the last two columns. Thus, at most $3h$ rows corresponding to sets of $\mathcal{F}$ have two suppressions in ${M}^{\prime}$. Observe that for each set ${S}_{i}\in \mathcal{F}$ the entries in the last two columns of the corresponding rows are ${S}_{i}$. There is no other occurrence of this entry in M. Hence, the at least $3\left\mathcal{F}\right3h$ rows in ${M}^{\prime}$ with one suppression correspond to $\left\mathcal{F}\righth$ sets in $\mathcal{F}$. Thus, the at most $3h$ rows in ${M}^{\prime}$ that correspond to sets of $\mathcal{F}$ and contain two suppressions correspond to at most h sets of $\mathcal{F}$. Denote these h sets by ${\mathcal{F}}^{\prime}$. We now show that ${\mathcal{F}}^{\prime}$ is a set cover for the 3Set Cover instance. Assume by contradiction that ${\mathcal{F}}^{\prime}$ is not a set cover, and hence, there is an element $u\in U\setminus \left({\bigcup}_{S\in {\mathcal{F}}^{\prime}}S\right)$. However, since ${M}^{\prime}$ is 3anonymous, there has to be a row r in ${M}^{\prime}$ that corresponds to some set ${S}_{i}$ such that this row coincides with the two rows ${r}_{1}^{u}$ and ${r}_{2}^{u}$ corresponding to u. Since all rows in ${M}^{\prime}$ corresponding to elements of U contain two suppressions in the last two columns, the row r also contains two suppressions in the last two columns. Thus, ${S}_{i}\in {\mathcal{F}}^{\prime}$. Furthermore, r has to coincide with ${r}_{1}^{u}$ and ${r}_{2}^{u}$ in the first column, that is, r contains as the entry in the first column the symbol u. Hence, $u\in {S}_{i}$, a contradiction. ☐
Blocki and Williams [
7] showed that, while
3Anonymity is NPcomplete [
7,
8],
2Anonymity is polynomialtime solvable by reducing it in polynomial time to the polynomialtime solvable,
Simplex Matching [
26], defined as follows:
Simplex Matching Input:
A hypergraph
$H=(V,E)$ with hyperedges of size two and three, a positive integer
h, and a cost function,
$cost:E\to N$, such that:
$\{u,v,w\}\in E\Rightarrow \{u,v\},\{v,w\},\{u,w\}\in E$ and
$cost\left(\right\{u,v\left\}\right)+cost\left(\right\{v,w\left\}\right)+cost\left(\right\{u,w\left\}\right)\le 2\xb7cost\left(\right\{u,v,w\left\}\right)$.
 Question:
Is there a subset of the hyperedges ${E}^{\prime}\subseteq E$, such that for all $v\in V$, there is exactly one edge in ${E}^{\prime}$ containing v and ${\sum}_{e\in {E}^{\prime}}cost\left(e\right)\le h$?
We slightly adjust their reduction to obtain polynomialtime solvability for PatternGuided 2Anonymity, together with Theorem 1, yielding a complexity dichotomy for PatternGuided kAnonymity with respect to the parameter k.
Theorem 2. PatternGuided 2Anonymity is polynomialtime solvable.
Proof. We reduce PatternGuided 2Anonymity to Simplex Matching. To this end, we first introduce some notation. Let $(M,P,2,s)$ be the PatternGuided 2Anonymity instance. For a set A of rows $A=\{{r}_{1},\dots ,{r}_{\ell}\}$ and a pattern vector p in P the set $A\left(p\right)$ is obtained from A by suppressing entries in the rows of A such that each row matches p (see Definition 2). The set $P\left(A\right)$ contains all pattern vectors p such that $A\left(p\right)$ is a set of identical rows. Intuitively, $P\left(A\right)$ contains all “suitable” pattern vectors to make the rows in A identical.
Now, construct the hypergraph $H=(V,E)$ as follows: Initialize $V=\varnothing $ and $E=\varnothing $. For each row r in M add a vertex ${v}_{r}$ to V. For a vertex subset ${V}^{\prime}\subseteq V$ let $M\left({V}^{\prime}\right)$ be the set of the corresponding rows in M. For each vertex subset ${V}^{\prime}\subseteq V$ of size $2\le {V}^{\prime}\le 3$ add the hyperedge ${V}^{\prime}$ if $P\left(M\left({V}^{\prime}\right)\right)\ne \varnothing $. Let $p\left({V}^{\prime}\right)$ be a pattern vector in $P\left(M\left({V}^{\prime}\right)\right)$ with the minimum number of ☆symbols. Denote this number of ☆symbols of $p\left({V}^{\prime}\right)$ by ℓ. Then, set $cost\left({V}^{\prime}\right)=\ell \xb7\left{V}^{\prime}\right$. Note that this is exactly the cost to “anonymize” the rows in $M\left({V}^{\prime}\right)$ with the pattern vector p. Finally, set the cost bound $h=s$. This completes the construction.
First, we show that Conditions 1 and 2 are fulfilled. Clearly, as each pattern vector that makes some row set
A identical also makes each subset of
A identical, it follows that for any
${V}^{\prime}\subseteq V$ and any
${V}^{\prime \prime}\subseteq V$, it holds
$P\left(M\left({V}^{\prime}\right)\right)\subseteq P\left(M\left({V}^{\prime \prime}\right)\right)$. Hence, Condition 1 is fulfilled. Furthermore, it follows that
$2\xb7cost\left(\right\{u,v,w\left\}\right)\ge 3\xb7cost\left(\right\{u,v\left\}\right)$ for each
$u,v,w\in V$, implying:
Thus, Condition 2 is fulfilled.
Observe that the construction can be easily performed in polynomial time. Hence, it remains to be shown that $(M,P,2,s)$ is a yesinstance of PatternGuided 2Anonymity, if and only if $(H,s,cost)$ is a yesinstance of Simplex Matching.
“⇒:” Let ${M}^{\prime}$ be a 2anonymous matrix obtained from M by suppressing at most s elements, and each row of ${M}^{\prime}$ matches a pattern vector in P. Let $\mathcal{R}$ be the set of all row types in ${M}^{\prime}$. We construct a matching ${E}^{\prime}$ for H as follows: First, partition the rows in each row type, such that each part contains two or three rows. For each part Q, add to ${E}^{\prime}$ the set of the vertices corresponding to the rows in Q. By construction, the cost bound is satisfied, and all vertices are matched.
“⇐:” Let ${E}^{\prime}\subseteq E$ be a matching, and let $e\in {E}^{\prime}$. Recall that $M\left(e\right)$ denotes the set of rows corresponding to the vertices in e. By construction, $P\left(M\right(e\left)\right)\ne \varnothing $. We construct ${M}^{\prime}$ from M by suppressing for each $e\in {E}^{\prime}$ entries in the rows $M\left(e\right)$ such that they match $p\left(e\right)$. Observe that ${M}^{\prime}$ is kanonymous, and each row matches a pattern vector. Furthermore, by construction, there are at most s suppressions in ${M}^{\prime}$. Thus, $(M,P,2,s)$ is a yesinstance. ☐
Contrasting the general intractability result of Theorem 1, we will show fixedparameter tractability with respect to the combined parameter
$\left(\right\Sigma ,m)$. To this end, we additionally use as a parameter the number
t of different input rows. Indeed, we show fixedparameter tractability with respect to the combined parameter
$(t,p)$. This implies fixedparameter tractability with respect to the combined parameter
$\left(\right\Sigma ,m)$, as
${\Sigma }^{m}\ge t$ and
${\Sigma }^{m}\ge {2}^{m}\ge p$. This results from an adaption of combinatorial algorithms from previous work [
13,
27].
Before presenting the algorithm, we introduce some notation. We distinguish between the
input row types of the input matrix
M and the
output row types of the matrix
${M}^{\prime}$. Note that in the beginning, we can compute the input row types of
M in
$O\left(nm\right)$ time using a trie [
28], but the output row types are unknown. By the definition of
PatternGuided k
Anonymity, each output row type
${R}^{\prime}$ has to match a pattern vector
$v\in P$. We call
${R}^{\prime}$ an
instance of
v.
Theorem 3. PatternGuided kAnonymity can be solved in $O({2}^{tp}\xb7{t}^{6}{p}^{5}\xb7m+nm)$ time, where p is the number of pattern vectors and t is the number of different rows in the input matrix M.
Proof. We present an algorithm running in two phases:
 Phase 1:
Guess for each possible output row type whether it is used in ${M}^{\prime}$. Denote with $\mathcal{R}$ the set of all output row types in ${M}^{\prime}$ according to the guessing result.
 Phase 2:
Check whether there exists a kanonymous matrix ${M}^{\prime}$ that can be obtained from M by suppressing at most s elements, such that ${M}^{\prime}$ respects the guessing result in Phase 1; that is, the set of row types in ${M}^{\prime}$ is exactly $\mathcal{R}$.
As to Phase 1, observe that the number of possible output row types is at most
$t\xb7p$: For each pattern vector, there exist at most
t different instances—one for each input row type. Hence, Phase 1 can be realized by simply trying all
${2}^{t\xb7p}$ possibilities. On the contrary, Phase 2 can be computed in polynomial time using the socalled
Row Assignment problem [
27]. To this end, we introduce
${T}_{\mathrm{in}}:=\{1,\dots ,t\}$ and
${T}_{\mathrm{out}}:=\{1,\dots ,r\}$, where
r is the number of used output row types according to the guessing result of Phase 1, formally,
$r=\left\mathcal{R}\right$. With this notation, we can state
Row Assignment.
Row Assignment Input:
Nonnegative integers k, s, ${\omega}_{1},\dots ,{\omega}_{r}$ and ${n}_{1},\dots ,{n}_{t}$ with ${\sum}_{i=1}^{t}{n}_{i}=n$, and a function $h:{T}_{\mathrm{in}}\times {T}_{\mathrm{out}}\to \{0,1\}$.
 Question:
Is there a function $g:{T}_{\mathrm{in}}\times {T}_{\mathrm{out}}\to \{0,\dots ,n\}$, such that:
We now discuss how we use Row Assignment to solve Phase 2. The function h captures the guessing in Phase 1: If the input row type i is “compatible” with the output row type j, then $h(i,j)=1$, otherwise, $h(i,j)=0$. Here, an input row type R is compatible with an output row type ${R}^{\prime}$ if the rows in both row types are identical in the non☆positions or, equivalently, if any row of R can be made identical to any row of ${R}^{\prime}$ by just replacing entries with the ☆symbol. The integer ${\omega}_{i}$ is set to the number of stars in the ${i}^{\text{th}}$ output row type ${R}_{i}$ in $\mathcal{R}$; that is, ${\omega}_{i}$ captures the cost of “assigning” a compatible row of M to ${R}_{i}$. In ${n}_{i}$, the size (number of rows) of the ${i}^{\text{th}}$ input row type is stored. The integers with the same names in Row Assignment and PatternGuided kAnonymity also store the same values.
Next, we show that solving Row Assignment indeed correctly realizes Phase 2: Since the output row types of ${M}^{\prime}$ are given from Phase 1, it remains to specify how many rows each output row type contains, such that ${M}^{\prime}$ can be obtained from M by suppressing at most s entries, and ${M}^{\prime}$ is kanonymous. Due to Phase 1, it is clear that each row in ${M}^{\prime}$ matches a pattern vector in P. To ensure that ${M}^{\prime}$ can be obtained from M by suppressing entries, we “assign” rows of M to compatible output row types. Herein, this assigning means to suppress the entries in the particular row, such that the modified row belongs to the particular output row type. This assigning is captured by the function g: The number of rows from the input row type ${R}_{i}$ that are assigned to the output row type ${R}_{j}$ is $g(i,j)$. Inequality (1) ensures that we only assign rows to compatible output row types. The kanonymous requirement is guaranteed by Inequality (2). Equation (3) ensures that all rows of M are assigned. Finally, the cost bound is satisfied, due to Inequality (4). Hence, solving Row Assignment indeed solves Phase 2.
Analyzing the running time, we get the following: Computing the input row types in
M can be done in
$O\left(nm\right)$. In Phase 1, the algorithm tries
${2}^{tp}$ possibilities. For each of these possibilities, we have to check which input row types are compatible with which output row types. This is clearly doable in
$O\left(trm\right)$ time. Finally,
Row Assignment can be solved in
$O\left(\right(t+r)\xb7log(t+r\left)\right(t\xb7r+(t+r)log(t+r)\left)\right)$ (Lemma 1 in [
27]). Since
$r\le tp$, we roughly upperbound this by
$O\left({\left(tp\right)}^{4}\right)$. Putting all this together, we arrive at the statement of the theorem. ☐
In other words, Theorem 3 implies that PatternGuided kAnonymity can be solved in linear time if t and p are constants.
2.3. Greedy Heuristic
In this section, we provide a greedy heuristic based on the ideas of the fixedparameter algorithm of Theorem 3 presented in
Section 2.1. The fixedparameter algorithm basically does an exhaustive search on the assignment of rows to pattern vectors. More precisely, for each row type
R and each pattern vector
v it tries both possibilities of whether rows of
R are assigned to
v or not. Furthermore, in the ILP formulation, all assignments of rows to pattern vectors are possible. In contrast, our greedy heuristic will just pick for each input row type
R the “cheapest” pattern vector
v and, then, assigns all compatible rows of
M to
v. This is realized as follows: We consider all pattern vectors, one after the other, ordered by increasing number of ☆symbols. This ensures that we start with the “cheapest” pattern vector. Then, we assign as many rows as possible of
M to
v: We just consider every instance
${R}^{\prime}$ of
v, and if there are more than
k rows in
M that are compatible with
${R}^{\prime}$, then, we assign all compatible rows to
${R}^{\prime}$. Once a row is assigned, it will not be reassigned to any other output row type, and hence, the row will be deleted from
M. Overall this gives a running time of
$O\left(pnm\right)$. See Algorithm 1 for the pseudocode of the greedy heuristic.If at some point in time, there are less than
k remaining rows in
M, then, these rows will be fully suppressed. Note that this slightly deviates from our formal definition of
PatternGuided k
Anonymity. However, since fully suppressed rows do not reveal any data, this potential violation of the
kanonymity requirement does not matter.
Algorithm 1 Greedy Heuristic ($M,P,k$) 
 1:
Sort pattern vectors P by cost (increasing order)  2:
for each $v\in P$ do  3:
for each instance ${R}^{\prime}$ of v do  4:
if $\ge k$ rows are compatible with ${R}^{\prime}$ then  5:
Assign all compatible rows of M to ${R}^{\prime}$  6:
Delete the assigned rows from M.

Our greedy heuristic clearly does not always provide optimal solutions. Our experiments indicate, however, that it is very fast and that it typically provides solutions close to the optimum and outperforms the Mondrian algorithm [
21] in most datasets we tested. While this demonstrates the practicality of our heuristic (Algorithm 1), the following result shows that from the viewpoint of polynomialtime approximation algorithmics, it is weak in the worst case.
Theorem 4. Algorithm 1 for PatternGuided kAnonymity runs in $O\left(pnm\right)$ time and provides a factor mapproximation. This approximation bound is asymptotically tight for Algorithm 1.
Proof. Since the running time is already discussed above, it remains to show the approximation factor. Let ${s}_{\text{heur}}$ be the number of suppressions in a solution provided by Algorithm 1 and ${s}_{\text{opt}}$ be the number of suppressions in an optimal solution. We show that for every instance, it holds that ${s}_{\text{heur}}\le m\xb7{s}_{\text{opt}}$. Let M be a matrix, ${M}_{\text{heur}}^{\prime}$ be the suppressed matrix produced by Algorithm 1, and ${M}_{\text{opt}}^{\prime}$ be the suppressed matrix corresponding to an optimal solution. First, observe that for any row in ${M}_{\text{opt}}^{\prime}$ not containing any suppressed entry, it follows that the corresponding row in ${M}_{\text{heur}}^{\prime}$ also does not contain any suppression. Clearly, each row in ${M}_{\text{heur}}^{\prime}$ has at most m entries suppressed. Thus, each row in ${M}_{\text{heur}}^{\prime}$ has at most m times more suppressed entries than the corresponding row in ${M}_{\text{opt}}^{\prime}$ and, hence, ${s}_{\text{heur}}\le m\xb7{s}_{\text{opt}}$.
To show that this upper bound is asymptotically tight, consider the following instance. Set $k=m$, and let M be as follows: The matrix M contains ktimes the row with the symbol 1 in every entry. Furthermore, for each $i\in \{1,\dots ,m\}$, there are $k1$ rows in M, such that all but the ${i}^{\text{th}}$ entry contains the symbol 1. In the ${i}^{\text{th}}$ entry, each of the $k1$ rows contains a uniquely occurring symbol. The pattern mask contains $m+2$ pattern vectors: For $i\in \{1,\dots ,m\}$, the ${i}^{\text{th}}$ pattern vector contains $m1$□symbols and one ☆symbol at the ${i}^{\text{th}}$ position. The last two pattern vectors are the all□ and all☆ vectors. Algorithm 1 will suppress nothing in the k all1 rows and will suppress every entry of the remaining rows. This gives ${s}_{\text{heur}}=(k1)\xb7{m}^{2}=(m1)\xb7{m}^{2}$ suppressions. However, an optimal solution suppresses in each row exactly one entry: The rows containing in all but the ${i}^{\text{th}}$ entry the symbol 1 are suppressed in the ${i}^{\text{th}}$ entry. Furthermore, to ensure the anonymity requirement, in the submatrix with the k rows containing the symbol 1 in every entry, the diagonal is suppressed. Thus, the number of suppressions is equal to the number of rows; that is, ${s}_{\text{opt}}=k+(k1)m={m}^{2}$. Hence, ${s}_{\text{heur}}=(m1){s}_{\text{opt}}$. ☐