Next Article in Journal
New Parallel Sparse Direct Solvers for Multicore Architectures
Previous Article in Journal
Sublinear Time Motif Discovery from Multiple Sequences

Algorithms 2013, 6(4), 678-701; https://doi.org/10.3390/a6040678

Article
Pattern-Guided k-Anonymity
Institut für Softwaretechnik und Theoretische Informatik, TU Berlin, Berlin, 10587, Germany
*
Author to whom correspondence should be addressed.
Received: 2 August 2013; in revised form: 9 October 2013 / Accepted: 9 October 2013 / Published: 17 October 2013

Abstract

:
We suggest a user-oriented approach to combinatorial data anonymization. A data matrix is called k-anonymous if every row appears at least k times—the goal of the NP-hard k-Anonymity problem then is to make a given matrix k-anonymous by suppressing (blanking out) as few entries as possible. Building on previous work and coping with corresponding deficiencies, we describe an enhanced k-anonymization problem called Pattern-Guided k-Anonymity, where the users specify in which combinations suppressions may occur. In this way, the user of the anonymized data can express the differing importance of various data features. We show that Pattern-Guided k-Anonymity is NP-hard. We complement this by a fixed-parameter tractability result based on a “data-driven parameterization” and, based on this, develop an exact integer linear program (ILP)-based solution method, as well as a simple, but very effective, greedy heuristic. Experiments on several real-world datasets show that our heuristic easily matches up to the established “Mondrian” algorithm for k-Anonymity in terms of the quality of the anonymization and outperforms it in terms of running time.
Keywords:
NP-hardness; parameterized complexity; integer linear programming; exact algorithms; heuristics; experiments

1. Introduction

Making a matrix k-anonymous, that is, each row has to occur at least k times, is a classic model for (combinatorial) data privacy [1,2]. We omit considerations on the also very popular model of “differential privacy” [3], which has a more statistical than a combinatorial flavor. It is well-known that there are certain weaknesses of the k-anonymity concept, for example, when the anonymized data is used multiple times [1]. Here, we focus on k-anonymity, which, due to its simplicity and good interpretability, continues to be of interest in current applications. The idea behind k-anonymity is that each row of the matrix represents an individual, and the k-fold appearance of the corresponding row shall avoid the situation in which the person or object behind can be identified. To reach this goal, clearly, some information loss has to be accepted, that is, some entries of the matrix have to be suppressed (blanked out); in this way, information about certain attributes (represented by the columns of the matrix) is lost. Thus, the natural goal is to minimize this loss of information when transforming an arbitrary data matrix into a k-anonymous one. The corresponding optimization problem k-Anonymity is NP-hard (even in special cases) and hard to approximate [4,5,6,7,8]. Nevertheless, it played a significant role in many applications, thereby mostly relying on heuristic approaches for making a matrix k-anonymous [2,9,10].
It was observed that care has to be taken concerning the “usefulness” (also in terms of expressiveness) of the anonymized data [11,12]. Indeed, depending on the application that has to work on the k-anonymized data, certain entry suppressions may “hurt” less than others. For instance, considering medical data records, the information about eye color may be less informative than information about blood pressure. Hence, it would be useful for the user of the anonymized data to specify information that may help in doing the anonymization process in a more sophisticated way. Thus, in recent work [13], we proposed a “pattern-guided” approach to data anonymization, in a way that allows the user to specify which combinations of attributes are less harmful to suppress than others. More specifically, the approach allows “pattern vectors”, which may be considered as blueprints for the structure of anonymized rows—each row has to be matched with exactly one of the pattern vectors (we will become more precise about this when formally introducing our new model). The corresponding proposed optimization problem [13], however, has the clear weakness that each pattern vector can only be used once, disallowing that there are different incarnations of the very same anonymization pattern. While this might be useful for the clustering perspective of the problem [13], we see no reason to justify this constraint from the viewpoint of data privacy. This leads us to proposing a modified model, whose usefulness for practical data anonymization tasks is supported by experiments on real-world data and comparison with a known k-anonymization algorithm.
Altogether, with our new model, we can improve both on k-Anonymity by letting the data user influence the anonymization process, as well as on the previous model [13] by allowing full flexibility for the data user to influence the anonymization process. Notably, the previous model is more suitable for homogeneous team formation instead of data anonymization [14].
An extended abstract of this work appeared in the Proceedings of the Joint Conference of the 7th International Frontiers of Algorithmics Workshop and the 9th International Conference on Algorithmic Aspects of Information and Management (FAW-AAIM ’13), Dalian, China, Volume 7924, of Lecture Notes in Computer Science, pages 350–361, June, 2013. © Springer. This full version contains all proof details and an extended experimental section. Furthermore, we provide the new result that Pattern-Guided 2-Anonymity is polynomial-time solvable (Theorem 2).
Formal Introduction of the New Model. 
A row type is a maximal set of identical rows of a matrix.
Definition 1. 
(k-anonymous [15,16,17]) A matrix is k-anonymous if every row type contains at least k rows in the matrix, that is, for every row in the matrix, one can find at least k 1 other identical rows.
Matrices are made k-anonymous by suppressing some of their entries. Formally, suppressing an entry M [ i , j ] of an n × m -matrix M over alphabet Σ with 1 i n and 1 j m means to simply replace M [ i , j ] Σ by the new symbol “☆”, ending up with a matrix over the alphabet Σ { } .
Our central enhancement of the k-Anonymity model lies in the user-specific pattern mask guiding the anonymization process: Every row in the k-anonymous output matrix has to conform to one of the given pattern vectors. Note that both the input table and the given patterns mathematically are matrices, but we use different terms to more easily distinguish between them: the “pattern mask” consists of “pattern vectors”, and the “input matrix” consists of “rows”.
Definition 2. 
A row r in a matrix M { Σ , } n × m matches a pattern vector v { , } m if and only if 1 i m : r [ i ] = v [ i ] = , that is, r and v have ☆-symbols at the same positions.
With these definitions, we can now formally define our central computational problem. The decisive difference with respect to our previous model [13] is that in our new model, two non-identical output rows can match the same pattern vector.
Pattern-Guided k-Anonymity
Input:A matrix M Σ n × m , a pattern mask P { , } p × m , and two positive integers k and s.
Question:Can one suppress at most s entries of M in order to obtain a k-anonymous matrix M , such that each row type of M matches to at least one pattern vector of P?
For some concrete examples, we refer to Section 3.6.
Our Results Describing a polynomial-time many-to-one reduction from the NP-hard 3-Set Cover problem, we show that Pattern-Guided k-Anonymity is NP-complete; even if the input matrix only consists of three columns, there are only two pattern vectors, and k = 3 . Motivated by this computational intractability result, we develop an exact algorithm that solves Pattern-Guided k-Anonymity in O ( 2 t p t 6 p 5 m + n m ) time for an n × m input matrix M, p pattern vectors, and the number of different rows in M being t. In other words, this shows that Pattern-Guided k-Anonymity is fixed-parameter tractable for the combined parameter ( t , p ) and actually can be solved in linear time if t and p take constant values. (The fundamental idea behind parameterized complexity analysis [18,19,20] is, given a computationally hard problem Q to identify a parameter (typically, a positive integer or a tuple of positive integers), for Q and to determine whether size-s instances of Q can be solved in f ( ) · s O ( 1 ) time, where f is an arbitrary computable function.) This result appears to be of practical interest only in special cases (“small” values for t and p are needed). It nevertheless paves the way for a formulation of an integer linear program for Pattern-Guided k-Anonymity that exactly solves moderate-size instances of Pattern-Guided k-Anonymity in reasonable time. Furthermore, our fixed-parameter tractability result also leads to a simple and efficient greedy heuristic, whose practical competitiveness is underlined by a set of experiments with real-world data, also favorably comparing with the Mondrian algorithm for k-Anonymity [21]. In particular, our empirical findings strongly indicate that, even when neglecting the aspect of potentially stronger expressiveness on the data user side provided by Pattern-Guided k-Anonymity, in combination with the greedy algorithm, it allows for high-quality and very fast data anonymization, being comparable in terms of anonymization quality with the established Mondrian algorithm [21], but significantly outperforming it in terms of time efficiency.

2. Complexity and Algorithms

This section is organized as follows. In Section 2.1, we prove the NP-hardness of Pattern-Guided 3-Anonymity restricted to two pattern vectors and three columns. To complement this intractability result, we also present a polynomial-time algorithm for Pattern-Guided 2-Anonymity and a fixed-parameter algorithm for Pattern-Guided k-Anonymity. In Section 2.2 and Section 2.3, we extract the basic ideas of the fixed-parameter algorithm for an integer linear program (ILP) formulation and a greedy heuristic.

2.1. Parameterized Complexity

One of the decisions made when developing fixed-parameter algorithms is the choice of the parameter. Natural parameters occurring in the problem definition of Pattern-Guided k-Anonymity are the number n of rows, the number m of columns, the alphabet size | Σ | , the number p of pattern vectors, the anonymity degree k, and the cost bound s. In general, the number of rows will arguably be large and, thus, also the cost bound s, tends to be large. Since fixed-parameter algorithms are fast when the parameter is small, trying to exploit these two parameters tends to be of little use in realistic scenarios. However, analyzing the adult dataset [22] prepared as described by Machanavajjhala et al. [23], it turns out that some of the other mentioned parameters are small: The dataset has m = 9 columns, and the alphabet size is 73. Furthermore, it is natural to assume that also the number of pattern vectors is not that large. Indeed, compared to the n = 32 , 561 rows, even the number of all possible pattern vectors 2 9 = 512 is relatively small. Finally, there are applications where k, the degree of anonymity, is small [24]. Summarizing, we can state that fixed-parameter tractability with respect to the parameters | Σ | , m, k, or p, could be of practical relevance. Unfortunately, by reducing from 3-Set Cover, we can show that Pattern-Guided k-Anonymity is NP-hard in very restricted cases.
Theorem 1. 
Pattern-Guided k-Anonymity is NP-complete even for two pattern vectors, three columns, and k = 3 .
Proof. We reduce from the NP-hard 3-Set Cover [25]: Given a set family F = { S 1 , , S α } with | S i | = 3 over a universe U = { u 1 , , u β } and a positive integer h, the task is to decide whether there is a subfamily F F of size at most h such that S F S = U . In the reduction, we need unique entries in the constructed input matrix M. For ease of notation, we introduce the -symbol with an unusual semantics. Each occurrence of a -symbol stands for a different unique symbol in the alphabet Σ. One could informally state this as “ ”. We now describe the construction. Let ( F , U , h ) be the 3-Set Cover instance. We construct an equivalent instance ( M , P , k , s ) of Pattern-Guided k-Anonymity as follows: Initialize M and P as empty matrices. Then, for each element u i U , add the row ( u i , , ) twice to the input matrix M. For each set S i F with S i = { u a , u b , u c } , add to M the three rows ( u a , S i , S i ) , ( u b , S i , S i ) , and ( u c , S i , S i ) . Finally, set k = 3 , s = 4 | U | + 3 | F | + 3 h , and add to P the pattern vectors (□, ☆, ☆), and (☆, □, □).
We show the correctness of the above construction by proving that ( F , U , h ) is a yes-instance of 3-Set Cover, if and only if ( M , P , 3 , s ) is a yes-instance of Pattern-Guided k-Anonymity.
“⇒:” If ( F , U , h ) is a yes-instance of 3-Set Cover, then there exists a set cover F of size at most h. We suppress the following elements in M: First, suppress all -entries in M. This gives 4 | U | suppressions. Then, for each S i F , suppress all S i -entries in M. This gives 6 | F | suppressions. Finally, for each S j F , suppress the first column of all rows containing the entry S j . These are 3 ( | F | | F | ) suppressions. Let M denote the matrix with the suppressed elements. Note that M contains 4 | U | + 3 | F | + 3 | F | s suppressed entries. Furthermore, in each row in M , either the first element is suppressed or the last two elements. Hence, each row of M matches to one of the two pattern vectors of P. Finally, observe that M is 3-anonymous: The three rows corresponding to the set S j F are identical: the first column is suppressed, and the next two columns contain the symbol S j . Since F is a set cover, there exists for each element u j a set S i F such that u j S i . Thus, by construction, the two rows corresponding to the element u j , and the row ( u j , S i , S i ) in M coincide in M : The first column contains the entry u j and the other two columns are suppressed. Finally, for each row ( u i , S j , S j ) in M that corresponds to a set S j F , the row in M coincides with the two rows corresponding to the element u i : Again, the first column contains the entry u i and the other two columns are suppressed.
“⇐:” If ( M , P , 3 , s ) is a yes-instance of Pattern-Guided k-Anonymity, then there is a 3-anonymous matrix M , that is obtained from M by suppressing at most s elements, and each row of M matches to one of the two pattern vectors in P. Since M and, so, M contain 2 | U | + 3 | F | rows, M contains at most s = 4 | U | + 3 | F | + 3 h suppressions and each pattern vector contains a ☆-symbol, there are at most 2 | U | + 3 h rows in M containing two suppressions and at least 3 | F | 3 h rows containing one suppression. Furthermore, since the 2 | U | rows in M corresponding to the elements of U contain the unique symbol in the last two columns in M , these rows are suppressed in the last two columns. Thus, at most 3 h rows corresponding to sets of F have two suppressions in M . Observe that for each set S i F the entries in the last two columns of the corresponding rows are S i . There is no other occurrence of this entry in M. Hence, the at least 3 | F | 3 h rows in M with one suppression correspond to | F | h sets in F . Thus, the at most 3 h rows in M that correspond to sets of F and contain two suppressions correspond to at most h sets of F . Denote these h sets by F . We now show that F is a set cover for the 3-Set Cover instance. Assume by contradiction that F is not a set cover, and hence, there is an element u U ( S F S ) . However, since M is 3-anonymous, there has to be a row r in M that corresponds to some set S i such that this row coincides with the two rows r 1 u and r 2 u corresponding to u. Since all rows in M corresponding to elements of U contain two suppressions in the last two columns, the row r also contains two suppressions in the last two columns. Thus, S i F . Furthermore, r has to coincide with r 1 u and r 2 u in the first column, that is, r contains as the entry in the first column the symbol u. Hence, u S i , a contradiction.                           ☐
Blocki and Williams [7] showed that, while 3-Anonymity is NP-complete [7,8], 2-Anonymity is polynomial-time solvable by reducing it in polynomial time to the polynomial-time solvable, Simplex Matching [26], defined as follows:
Simplex Matching
Input: 
A hypergraph H = ( V , E ) with hyperedges of size two and three, a positive integer h, and a cost function, cost : E N , such that:
  • { u , v , w } E { u , v } , { v , w } , { u , w } E and
  • cost ( { u , v } ) + cost ( { v , w } ) + cost ( { u , w } ) 2 · cost ( { u , v , w } ) .
Question: 
Is there a subset of the hyperedges E E , such that for all v V , there is exactly one edge in E containing v and e E cost ( e ) h ?
We slightly adjust their reduction to obtain polynomial-time solvability for Pattern-Guided 2-Anonymity, together with Theorem 1, yielding a complexity dichotomy for Pattern-Guided k-Anonymity with respect to the parameter k.
Theorem 2. Pattern-Guided 2-Anonymity is polynomial-time solvable.
Proof. We reduce Pattern-Guided 2-Anonymity to Simplex Matching. To this end, we first introduce some notation. Let ( M , P , 2 , s ) be the Pattern-Guided 2-Anonymity instance. For a set A of rows A = { r 1 , , r } and a pattern vector p in P the set A ( p ) is obtained from A by suppressing entries in the rows of A such that each row matches p (see Definition 2). The set P ( A ) contains all pattern vectors p such that A ( p ) is a set of identical rows. Intuitively, P ( A ) contains all “suitable” pattern vectors to make the rows in A identical.
Now, construct the hypergraph H = ( V , E ) as follows: Initialize V = and E = . For each row r in M add a vertex v r to V. For a vertex subset V V let M ( V ) be the set of the corresponding rows in M. For each vertex subset V V of size 2 | V | 3 add the hyperedge V if P ( M ( V ) ) . Let p ( V ) be a pattern vector in P ( M ( V ) ) with the minimum number of ☆-symbols. Denote this number of ☆-symbols of p ( V ) by . Then, set cost ( V ) = · | V | . Note that this is exactly the cost to “anonymize” the rows in M ( V ) with the pattern vector p. Finally, set the cost bound h = s . This completes the construction.
First, we show that Conditions 1 and 2 are fulfilled. Clearly, as each pattern vector that makes some row set A identical also makes each subset of A identical, it follows that for any V V and any V V , it holds P ( M ( V ) ) P ( M ( V ) ) . Hence, Condition 1 is fulfilled. Furthermore, it follows that 2 · cost ( { u , v , w } ) 3 · cost ( { u , v } ) for each u , v , w V , implying:
cost ( { u , v } ) + cost ( { v , w } ) + cost ( { u , w } ) M M 6 / 3 · cost ( { u , v , w } )
Thus, Condition 2 is fulfilled.
Observe that the construction can be easily performed in polynomial time. Hence, it remains to be shown that ( M , P , 2 , s ) is a yes-instance of Pattern-Guided 2-Anonymity, if and only if ( H , s , cost ) is a yes-instance of Simplex Matching.
“⇒:” Let M be a 2-anonymous matrix obtained from M by suppressing at most s elements, and each row of M matches a pattern vector in P. Let R be the set of all row types in M . We construct a matching E for H as follows: First, partition the rows in each row type, such that each part contains two or three rows. For each part Q, add to E the set of the vertices corresponding to the rows in Q. By construction, the cost bound is satisfied, and all vertices are matched.
“⇐:” Let E E be a matching, and let e E . Recall that M ( e ) denotes the set of rows corresponding to the vertices in e. By construction, P ( M ( e ) ) . We construct M from M by suppressing for each e E entries in the rows M ( e ) such that they match p ( e ) . Observe that M is k-anonymous, and each row matches a pattern vector. Furthermore, by construction, there are at most s suppressions in M . Thus, ( M , P , 2 , s ) is a yes-instance.                  ☐
Contrasting the general intractability result of Theorem 1, we will show fixed-parameter tractability with respect to the combined parameter ( | Σ | , m ) . To this end, we additionally use as a parameter the number t of different input rows. Indeed, we show fixed-parameter tractability with respect to the combined parameter ( t , p ) . This implies fixed-parameter tractability with respect to the combined parameter ( | Σ | , m ) , as | Σ | m t and | Σ | m 2 m p . This results from an adaption of combinatorial algorithms from previous work [13,27].
Before presenting the algorithm, we introduce some notation. We distinguish between the input row types of the input matrix M and the output row types of the matrix M . Note that in the beginning, we can compute the input row types of M in O ( n m ) time using a trie [28], but the output row types are unknown. By the definition of Pattern-Guided k-Anonymity, each output row type R has to match a pattern vector v P . We call R an instance of v.
Theorem 3. Pattern-Guided k-Anonymity can be solved in O ( 2 t p · t 6 p 5 · m + n m ) time, where p is the number of pattern vectors and t is the number of different rows in the input matrix M.
Proof. We present an algorithm running in two phases:
Phase 1: 
Guess for each possible output row type whether it is used in M . Denote with R the set of all output row types in M according to the guessing result.
Phase 2: 
Check whether there exists a k-anonymous matrix M that can be obtained from M by suppressing at most s elements, such that M respects the guessing result in Phase 1; that is, the set of row types in M is exactly R .
As to Phase 1, observe that the number of possible output row types is at most t · p : For each pattern vector, there exist at most t different instances—one for each input row type. Hence, Phase 1 can be realized by simply trying all 2 t · p possibilities. On the contrary, Phase 2 can be computed in polynomial time using the so-called Row Assignment problem [27]. To this end, we introduce T in : = { 1 , , t } and T out : = { 1 , , r } , where r is the number of used output row types according to the guessing result of Phase 1, formally, r = | R | . With this notation, we can state Row Assignment.
Row Assignment
Input: 
Nonnegative integers k, s, ω 1 , , ω r and n 1 , , n t with i = 1 t n i = n , and a function h : T in × T out { 0 , 1 } .
Question: 
Is there a function g : T in × T out { 0 , , n } , such that:
h ( i , j ) · n g ( i , j ) i T in j T out
i = 1 t g ( i , j ) k j T out
j = 1 p g ( i , j ) = n i i T in
i = 1 t j = 1 p g ( i , j ) · ω j s
We now discuss how we use Row Assignment to solve Phase 2. The function h captures the guessing in Phase 1: If the input row type i is “compatible” with the output row type j, then h ( i , j ) = 1 , otherwise, h ( i , j ) = 0 . Here, an input row type R is compatible with an output row type R if the rows in both row types are identical in the non-☆-positions or, equivalently, if any row of R can be made identical to any row of R by just replacing entries with the ☆-symbol. The integer ω i is set to the number of stars in the i th output row type R i in R ; that is, ω i captures the cost of “assigning” a compatible row of M to R i . In n i , the size (number of rows) of the i th input row type is stored. The integers with the same names in Row Assignment and Pattern-Guided k-Anonymity also store the same values.
Next, we show that solving Row Assignment indeed correctly realizes Phase 2: Since the output row types of M are given from Phase 1, it remains to specify how many rows each output row type contains, such that M can be obtained from M by suppressing at most s entries, and M is k-anonymous. Due to Phase 1, it is clear that each row in M matches a pattern vector in P. To ensure that M can be obtained from M by suppressing entries, we “assign” rows of M to compatible output row types. Herein, this assigning means to suppress the entries in the particular row, such that the modified row belongs to the particular output row type. This assigning is captured by the function g: The number of rows from the input row type R i that are assigned to the output row type R j is g ( i , j ) . Inequality (1) ensures that we only assign rows to compatible output row types. The k-anonymous requirement is guaranteed by Inequality (2). Equation (3) ensures that all rows of M are assigned. Finally, the cost bound is satisfied, due to Inequality (4). Hence, solving Row Assignment indeed solves Phase 2.
Analyzing the running time, we get the following: Computing the input row types in M can be done in O ( n m ) . In Phase 1, the algorithm tries 2 t p possibilities. For each of these possibilities, we have to check which input row types are compatible with which output row types. This is clearly doable in O ( t r m ) time. Finally, Row Assignment can be solved in O ( ( t + r ) · log ( t + r ) ( t · r + ( t + r ) log ( t + r ) ) ) (Lemma 1 in [27]). Since r t p , we roughly upper-bound this by O ( ( t p ) 4 ) . Putting all this together, we arrive at the statement of the theorem.                 ☐
In other words, Theorem 3 implies that Pattern-Guided k-Anonymity can be solved in linear time if t and p are constants.

2.2. ILP Formulation

Next, we describe an integer linear program (ILP) formulation for Pattern-Guided k-Anonymity employing the ideas behind the fixed-parameter algorithm of Theorem 3. More specifically, our ILP contains the integer variables x i , j denoting the number of rows from type i being assigned into an output row type compatible with pattern vector j. The binary variable u j , l is 1 if instance l of pattern vector j is used in the solution; that is, there is at least one row mapped to it, otherwise, it is set to 0. Furthermore, n i denotes the number of rows of type i, ω j denotes the costs of pattern vector j, and k is the required degree of anonymity. Let p ^ i t denote the number of instances of pattern vector i, and let c ( i , j , l ) be 1 if mapping row i to pattern vector j produces pattern vector instance l, otherwise c ( i , j , l ) = 0 . With this notation, we can state our ILP formulation:
min i = 1 t j = 1 p x i , j · ω j
i = 1 t c ( i , j , l ) · x i , j u j , l · n 1 l p ^ j 1 j p
i = 1 t c ( i , j , l ) · x i , j + k · ( 1 u j , l ) k 1 l p ^ j 1 j p
j = 1 p x i , j = n i 1 i t .
The goal function (5) ensures that the solution has a minimum number of suppressions. Constraint (6) ensures that the variables u j , l are consistently set with the variables x i , j ; that is, if there is some positive variable x i , j indicating that the instance l of pattern vector j is used, then u j , l = 1 . Constraint (7) ensures that every pattern vector instance that is used by the solution contains at least k rows. Constraint (8) ensures that the solution uses as many rows from each row type as available.
We remark that, as Theorem 3, our ILP formulation also yields fixed-parameter tractability with respect to the combined parameter ( t , p ) . This is due to the famous result of Lenstra [29], and the fact that the number of variables in the ILP is bounded by O ( t p ) . Theorem 3; however, provides a direct combinatorial algorithm with better worst-case running time bounds. Nevertheless, in the experimental section, we decided to use the ILP formulation and not the combinatorial algorithm based on the experience that there are very strong (commercial) ILP solvers that, in practice, typically perform much better than the worst-case analysis predicts.

2.3. Greedy Heuristic

In this section, we provide a greedy heuristic based on the ideas of the fixed-parameter algorithm of Theorem 3 presented in Section 2.1. The fixed-parameter algorithm basically does an exhaustive search on the assignment of rows to pattern vectors. More precisely, for each row type R and each pattern vector v it tries both possibilities of whether rows of R are assigned to v or not. Furthermore, in the ILP formulation, all assignments of rows to pattern vectors are possible. In contrast, our greedy heuristic will just pick for each input row type R the “cheapest” pattern vector v and, then, assigns all compatible rows of M to v. This is realized as follows: We consider all pattern vectors, one after the other, ordered by increasing number of ☆-symbols. This ensures that we start with the “cheapest” pattern vector. Then, we assign as many rows as possible of M to v: We just consider every instance R of v, and if there are more than k rows in M that are compatible with R , then, we assign all compatible rows to R . Once a row is assigned, it will not be reassigned to any other output row type, and hence, the row will be deleted from M. Overall this gives a running time of O ( p n m ) . See Algorithm 1 for the pseudo-code of the greedy heuristic.If at some point in time, there are less than k remaining rows in M, then, these rows will be fully suppressed. Note that this slightly deviates from our formal definition of Pattern-Guided k-Anonymity. However, since fully suppressed rows do not reveal any data, this potential violation of the k-anonymity requirement does not matter.
Algorithm 1 Greedy Heuristic ( M , P , k )
1:
Sort pattern vectors P by cost (increasing order)
2:
for each v P do
3:
  for each instance R of v do
4:
   if k rows are compatible with R then
5:
    Assign all compatible rows of M to R
6:
    Delete the assigned rows from M.
Our greedy heuristic clearly does not always provide optimal solutions. Our experiments indicate, however, that it is very fast and that it typically provides solutions close to the optimum and outperforms the Mondrian algorithm [21] in most datasets we tested. While this demonstrates the practicality of our heuristic (Algorithm 1), the following result shows that from the viewpoint of polynomial-time approximation algorithmics, it is weak in the worst case.
Theorem 4. Algorithm 1 for Pattern-Guided k-Anonymity runs in O ( p n m ) time and provides a factor m-approximation. This approximation bound is asymptotically tight for Algorithm 1.
Proof. Since the running time is already discussed above, it remains to show the approximation factor. Let s heur be the number of suppressions in a solution provided by Algorithm 1 and s opt be the number of suppressions in an optimal solution. We show that for every instance, it holds that s heur m · s opt . Let M be a matrix, M heur be the suppressed matrix produced by Algorithm 1, and M opt be the suppressed matrix corresponding to an optimal solution. First, observe that for any row in M opt not containing any suppressed entry, it follows that the corresponding row in M heur also does not contain any suppression. Clearly, each row in M heur has at most m entries suppressed. Thus, each row in M heur has at most m times more suppressed entries than the corresponding row in M opt and, hence, s heur m · s opt .
To show that this upper bound is asymptotically tight, consider the following instance. Set k = m , and let M be as follows: The matrix M contains k-times the row with the symbol 1 in every entry. Furthermore, for each i { 1 , , m } , there are k 1 rows in M, such that all but the i th entry contains the symbol 1. In the i th entry, each of the k 1 rows contains a uniquely occurring symbol. The pattern mask contains m + 2 pattern vectors: For i { 1 , , m } , the i th pattern vector contains m 1 □-symbols and one ☆-symbol at the i th position. The last two pattern vectors are the all-□ and all-☆ vectors. Algorithm 1 will suppress nothing in the k all-1 rows and will suppress every entry of the remaining rows. This gives s heur = ( k 1 ) · m 2 = ( m 1 ) · m 2 suppressions. However, an optimal solution suppresses in each row exactly one entry: The rows containing in all but the i th entry the symbol 1 are suppressed in the i th entry. Furthermore, to ensure the anonymity requirement, in the submatrix with the k rows containing the symbol 1 in every entry, the diagonal is suppressed. Thus, the number of suppressions is equal to the number of rows; that is, s opt = k + ( k 1 ) m = m 2 . Hence, s heur = ( m 1 ) s opt .                  ☐

3. Implementation and Experiments

In this section, we present the results of our experimental evaluation of the heuristic presented in Section 2.3 and the ILP-formulation presented in Section 2.2.

3.1. Data

We use the following datasets for our experimental evaluations. The first three datasets are taken from the UCI machine learning repository [22].
  • Adult ([30]) This was extracted from a dataset of the US Census Bureau Data Extraction System. It consists of 32,561 records over the 15 attributes: age, work class, final weight, education, education number, marital status, occupation, relationship, race, sex, capital gain, capital loss, hours per week, native country, and salary class. Since the final weight entry is unique for roughly one half of the records, we removed it from the dataset.
    Following Machanavajjhala et al. [23], we also prepared this dataset with the nine attributes, age, work class, education, marital status, occupation, race, sex, native country, and salary class. This second variant is called Adult-2 in the following.
  • Nursery ([31]) The Nursery dataset was derived from a hierarchical decision model originally developed to rank applications for nursery schools; see Olave et al. [32] for a detailed description. It contains 12,960 records over the eight attributes: parents, has nurse, form, children, housing, finance, social, and health. All entries are encoded as positive integers.
  • CMC ([33]) This dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey. It contains 1,473 records over 10 attributes. The attributes are wife’s age, wife’s education, husbands education, number of children, wife’s religion, wife working, husband occupation, standard of living, exposure, and contraceptive.
  • Canada ([34]) This dataset is taken from the Canada Vigilance Adverse Reaction Online Database and contains information about suspected adverse reactions (also known as side effects) to health products. The original dataset was collected in November 2012 and contains 324,489 records over 43 attributes. (See [35] for the list of the attribute names.) Since some values are contained in multiple attributes (as numerical code, in English and in French), other attributes contain unique values for each record and some attributes are empty in most records, we removed all these attributes from the dataset. We ended up with 324,489 records over the nine attributes: type of report, gender, age, report outcome, weight, height, serious adverse reaction, reporter type, and report source.

3.2. Implementation Setup

All our experiments are performed on an Intel Xeon E5-1620 3.6 GHz machine with 64 GB memory under the Debian GNU/Linux 6.0 operating system. The heuristic is implemented in Haskell, as Haskell is reasonably fast [36] and makes parallelization easy. Pattern vectors are stored in the standard list data structures provided by Haskell; the input matrix is stored as a list of lists. The ILP implementation uses ILOG CPLEXby its C++ API. Both implementations are licensed under GPLVersion 3. The source code is available. ([37]).

3.3. Quality Criteria

We now briefly describe the measurements used in the next subsection to evaluate the experimental results.
Obvious criteria for the evaluation are the number of suppressions and the running time. Furthermore, we use the average and the minimum size of the output row types as already done by Li et al. [38] and Machanavajjhala et al. [23], as well as the number of output row types. The perhaps most difficult to describe measurement we use is “usefulness” introduced by Loukides and Shao [11]. According to Loukides and Shao, usefulness “is based on the following observation: close values [] enhance usefulness, as they will require a small amount of modification to achieve a k-anonymisation. [] A small value in usefulness implies that tuples [= rows] are close together w.r.t.these attributes, hence require less modification to satisfy k-anonymity”.
Formally, it is defined as follows.
Definition 3 ([11]). Let M Σ n × m . Let 1 i m be an integer, and let Σ i Σ be the domain of the i th column in M, that is, the set of symbols used in the i th column. For a subset V i Σ i the attribute diversity, denoted by d A ( M , i , V i ) , is
d A ( M , i , V i ) = max ( V i ) min ( V i ) max ( Σ i ) min ( Σ i ) numerical attributes | V i | | Σ i | non-numerical attributes,
where max ( V i ) , min ( V i ) , max ( Σ i ) and min ( Σ i ) denote maximum and minimum values in V i and Σ i , respectively.
Informally speaking, the attribute diversity is a measurement of how many symbols of Σ i are in V i . The next definition extends this to diversity for a subset of rows of a given matrix.
Definition 4 ([11]). Let M Σ n × m be a matrix, and let R M be a matrix containing a subset of rows of M. The tuple diversity of R, denoted by d T ( M , R ) is
d T ( M , R ) = i = 1 m d A ( M , i , α ( R , i ) ) ,
where α ( R , i ) denotes the domain of the i th column of R.
With these notations, one can define the usefulness measure.
Definition 5 ([11]). Let M Σ n × m be a matrix and let M { Σ } n × m be a k-anonymous matrix obtained from M by suppressing entries. Let R = { R 1 , , R } be the row types of M . Further, denote with original ( R i ) the set of rows in M that form, after having applied the suppression operations, the output row type R i .
The usefulness of this partition M is:
usefulness = 1 i = 1 d T ( M , original ( R i ) )
Roughly speaking, the usefulness is the average tuple diversity of all output row types. In general, small usefulness values are better, and the values lie between zero and the number m of columns.

3.4. Evaluation

We tested our greedy heuristic in two types of experiments. In the first type (Section 3.5), we compare our results with those of the well-known Mondrian [21] heuristic. We decided to compare with an existing implementation of Mondrian ([39]), since we could not find a more recent implementation of a k-Anonymity algorithm that is freely available. By specifying all possible pattern vectors, we “misuse” our greedy heuristic to solve the classical k-Anonymity problem. In the second type (Section 3.6), we solve k-Anonymity and Pattern-Guided k-Anonymity and analyze the distance of the results provided by our greedy heuristic from an optimal solution (with a minimum number of suppressed entries). Such an optimal solution is provided by the ILP implementation. We provide tables comparing two algorithms, where a cell is highlighted by gray background whenever the value is at least as good as the corresponding value for the other algorithm.

3.5. Heuristic vs. Mondrian

In this subsection, we evaluate our experiments. Observe that the Mondrian algorithm does not suppress entries, but replaces them with some more general one. Hence, the number of suppressions as quality criteria is not suitable in the comparison; instead, we use the usefulness as defined in Section 3.3. Overall, we use the following criteria:
  • Usefulness value u;
  • Running time r in seconds;
  • Number # h of output row types;
  • Average size h avg of the output row types; and
  • Maximum size h max of the output row types.
Except for # h , lower values indicate better solutions.
For each dataset, we computed k-anonymous datasets with our greedy heuristic and Mondrian for k { 2, 3, ⋯, 10, 25, 50, 75, 100}. In the presented tables comparing the results of the Greedy Heuristic and Mondrian, we highlight the best obtained values with light gray background.
General Observations The running time behavior of the tested algorithms is somewhat unexpected. Whereas Mondrian gets faster with increasing k, our greedy heuristic gets faster with decreasing k. The reason why the greedy heuristic is faster for small values of k is that usually the cheap pattern vectors are used, and hence, the number of remaining input rows decreases soon. On the contrary, when k is large, the cheap pattern vectors cannot be used, and hence, the greedy heuristic tests many pattern vectors before it actually starts with removing rows from the input matrix. Thus, for larger values of k, the greedy heuristic comes closer to its worst-case running time of O ( p n m ) with p = 2 m .
Adult Our greedy heuristic could anonymize the Adult dataset in less than three minutes for all tested values of k. For k = 3 and k = 4 , Mondrian took more than half an hour to anonymize the dataset. However, in contrast to all other values of k, Mondrian was slightly faster for k = 75 and k = 100 . Except for h max with k 25 , all quality measures indicate that our heuristic produces the better solution.
Table 1. Heuristic vs. Mondrian: Results for the Adult dataset.
Table 1. Heuristic vs. Mondrian: Results for the Adult dataset.
Greedy HeuristicMondrian
k u r # h h avg h max k u r # h h avg h max
22.0625.50214,5892.2321623.5052,789.40011,1362.70961
32.29013.2269,2083.5361833.7821,803.5107,3064.12861
42.47019.5386,6704.8822544.0071,337.8605,4325.55361
52.61524.8675,1996.2633154.1911,061.9604,3256.97461
62.73829.6634,3157.5464264.362885.9393,5978.38561
72.85134.1263,6698.8755374.498754.6523,0539.87961
82.94237.6293,19310.1985384.622659.1842,66311.32661
93.02641.2162,83211.4985294.766588.3472,36812.73769
103.10644.7792,55912.72456104.875535.8722,14514.06269
253.84079.2811,04631.129161256.009229.24885035.48590
504.462117.00853760.635317506.729127.39243070.144135
754.873144.53635491.980317757.33993.621287105.094242
1005.151163.582274118.8363171007.80576.005209144.316242
The usefulness value of the Mondrian solutions is between 1 . 5 and 1 . 7 times the usefulness value of the heuristic for all tested k—this indicates the significantly better quality of the results of our heuristic. See Table 1 for details and Figure 1 for an illustration.
Adult-2 The solutions for Adult-2 behave similarly to those for Adult. Our greedy heuristic with a maximum running time of five seconds is significantly faster than Mondrian with a maximum running time of 20 min (at least 10 times faster for all tested k). However, the usefulness is quite similar for both algorithms. Mondrian beats the heuristic by less than 1% for k = 50 ; the heuristic is slightly better for each other tested k. See Table 2 for details.
Nursery For the Nursery dataset, the heuristic is at least eight times faster than Mondrian. Concerning solution quality, this dataset is the most ambiguous one. Except for k = 5 , Mondrian produces better solutions in terms of usefulness, whereas our heuristic performs better in terms of maximum and average output row type size. For the number of output row types, there is no clear winner. See Table 3 for details.
CMC For the CMC dataset, both algorithms were very fast in computing k-anonymous datasets for every tested k. Mondrian took at most 10 s, and our greedy heuristic took at most 1.2 s and was always faster than Mondrian. As for the solution quality, the heuristic can compete with Mondrian. The usefulness of the heuristic results is always slightly better. The Mondrian results have always at least 20% less output row types, and the average output row type size of the heuristic results is always smaller. Only for k = 5 , 6 , 7 , and 8, the Mondrian results have a lower maximum size of the output row types. See Table 4 for details.
Figure 1. Heuristic vs. Mondrian: Diagrams comparing running time and usefulness for the Adult dataset.
Figure 1. Heuristic vs. Mondrian: Diagrams comparing running time and usefulness for the Adult dataset.
Algorithms 06 00678 g001
Table 2. Heuristic vs. Mondrian: Results for the Adult-2 dataset.
Table 2. Heuristic vs. Mondrian: Results for the Adult-2 dataset.
Greedy HeuristicMondrian
k u r # h h avg h max k u r # h h avg h max
21.7601.04012,0222.7084521.8851,278.3807,9713.784113
31.8721.1817,9714.0854531.992887.6995,5435.441113
41.9621.3295,8905.5284542.074693.3854,3196.984113
52.0371.5044,6097.0654552.142565.3703,5258.557113
62.0991.6293,8368.4884562.201484.1603,0209.987113
72.1611.7073,2669.9705272.257417.9502,59611.619113
82.2121.7962,83711.4776382.291372.4692,30813.068113
92.2601.9362,51812.9316392.325338.9582,09514.397113
102.3022.0252,27314.32566102.366308.0581,89015.959113
252.7222.92691435.625164252.724139.03080137.655113
503.0943.87446070.785349503.07079.26341472.855145
753.3124.426310105.035552753.38559.847277108.888200
1003.4344.928245132.9025521003.57349.573210143.629279
Table 3. Heuristic vs. Mondrian: Results for the Nursery dataset.
Table 3. Heuristic vs. Mondrian: Results for the Nursery dataset.
Greedy HeuristicMondrian
k u r # h h avg h max k u r # h h avg h max
23.2000.8344,3203.000322.776484.5726,4682.0043
33.2000.2854,3203.000333.057233.7103,2943.9344
43.2830.3443,2404.000443.072221.7313,1864.0686
53.3330.3282,5925.000553.338122.6651,7227.5268
63.8670.3201,4409.000963.338122.7131,7227.5268
73.8670.3351,4409.000973.396104.5681,5188.53812
83.8670.3911,4409.000983.396104.6381,5188.53812
93.8670.3191,4409.000993.60767.63092214.05616
103.9500.4321,08012.00012103.60768.07992214.05616
254.5330.84648027.00027254.09128.22933438.80248
504.7501.17921660.00060504.49318.33017673.63696
754.8331.25916280.00080754.72013.638116111.724144
1005.2831.608120108.0001081004.86113.179100129.600144
Table 4. Heuristic vs. Mondrian: Results for the CMCdataset.
Table 4. Heuristic vs. Mondrian: Results for the CMCdataset.
Greedy HeuristicMondrian
k u r # h h avg h max k u r # h h avg h max
23.2740.0687182.052423.4699.1345992.4597
33.5080.1114613.195733.8146.3753913.7678
43.7350.1523344.410944.1394.8482735.39610
53.9340.1782585.7091554.3314.2052236.60511
64.1150.2122166.8191764.5383.7291848.00513
74.2190.2441838.0491774.7243.3281559.50316
84.4100.2511589.3231884.9133.08513510.91117
94.5000.28813910.5971895.0232.91412212.07421
104.5450.28212711.59818105.1782.71710813.63921
255.6410.4474830.68853256.4341.8624334.25657
506.3190.5592754.55677507.2721.5562266.95595
756.9260.6851786.647148757.8361.40413113.308148
1007.2710.75213113.3081671007.9811.36810147.300204
Table 5. Heuristic vs. Mondrian: Results for the Canada dataset.
Table 5. Heuristic vs. Mondrian: Results for the Canada dataset.
Greedy HeuristicMondrian
k u r # h h avg h max k u r # h h avg h max
21.40313.01863,1405.1392,44821.4763,504.56015,9842.4099
31.41513.59941,4087.8362,44831.5772,196.72010,2333.76311
41.42914.67931,65210.2522,44841.6541,600.5407,4585.16312
51.44214.87025,85212.5522,44851.7141,252.0505,8876.54013
61.45615.31822,15014.6502,44861.7661,040.7504,8567.92915
71.46915.43419,39916.7272,44871.803894.5104,1399.30217
81.48216.07117,27618.7832,44881.834783.0563,61810.64219
91.49516.19815,65120.7332,44891.863694.6253,19112.06622
101.50816.63114,24822.7742,448101.888622.4412,84013.55727
251.64620.1546,16752.6172,448252.119272.7391,12034.37857
501.77323.3832,988108.5972,448502.354158.41356368.389103
751.86125.7361,917169.2692,448752.516116.970356108.154154
1001.92927.6001,393232.9422,8381002.595103.402279138.004201
Canada Again, our heuristic outperforms Mondrian in terms of efficiency (at least three times faster). However, for this dataset, the quality measures are contradictory. Whereas the usefulness of the heuristic results is always slightly better and the number of output row types of the heuristic results is at least four times the number of output row types of Mondrian results, the measures concerning the size of the output row types are significantly better for Mondrian. The reason seems that our heuristic always produces one block of at least 2,448 identical rows. See Table 5 for details.
Conclusions for Classical k-Anonymity We showed that our greedy heuristic is very efficient, even for real-world datasets with more than 100,000 records and k 100 . Especially for smaller degrees of anonymity k 10 , Mondrian is at least ten times slower. Altogether, our heuristic outperforms Mondrian for all datasets, except Nursery, in terms of the quality of the solution. There is no clear winner for the Nursery dataset. Hence, we demonstrated that even when neglecting the feature of pattern-guidedness and simply specifying all possible pattern vectors, our heuristic already produces useful solutions that can at least compete with Mondrian’s solutions.

3.6. Heuristic vs. Exact Solution

In Section 3.5, we showed that our greedy heuristic is very efficient and produces good solutions, even if it is (mis)used to solve the classical k-Anonymity problem.
By design, the heuristic always produces solutions where every output row can be matched to some of the specified pattern vectors. However, the number of suppressions performed may be far from being optimal. Hence, by comparing with the exact solutions of the ILP implementation, we try to answer the question of how far the produced solutions are away from the optimum. We evaluate our experiments using the following criteria:
  • Number s of suppressions;
  • Usefulness value u;
  • Running time r in seconds;
  • Number # h of output row types;
  • Average size h avg of the output row types; and
  • Maximum size h max of the output row types.
Nursery Our ILP implementation was able to k-anonymize the Nursery dataset for k { 2 , , 10 , 25, 50, 75, 100} within two minutes for each input, that is, we could solve k-Anonymity with a minimum number of suppressions. In contrast, the ILP formulation could not k-anonymize the other datasets within 30 min for many values of k.
Surprisingly, the results computed by the heuristic were optimal (in terms of the number of suppressed entries) for all tested k, and many results are better in terms of the other quality measures. The reason seems to be that the ILP implementation tends to find, for a fixed number of suppressions, solutions with a high degree of anonymity. For example, the result of the ILP for k = 6 is already 15-anonymous, whereas the result of the heuristic is 9-anonymous, yielding more and smaller output row types. Summarizing, the heuristic is at least 25 times faster than the ILP implementation and also produces solutions with a minimum number of suppressions, which have a better quality concerning # h , h avg and h max values. See Table 6 for details.
CMC Consider the scenario where the user is interested in a k-anonymized version of the CMC dataset, where each row has at most two suppressed entries. To fulfill these constraints, we specified all possible pattern vectors with at most two ☆-symbols (plus the all-☆-vector to remove outliers) and applied our greedy heuristic and the ILP implementation for k { 2 , , 10 , 25, 50, 75, 100}.
As expected, the heuristic works much faster than the ILP implementation (at least by a factor of ten). The solution quality depends on the anonymity degree k. The results of the heuristic get closer to the optimum with increasing k. Whereas for k = 2 , the number of suppressions in the heuristic solution is 1 . 4 times the optimum, for k > 10 , the heuristic produces results with a minimum number of suppressions. Most other quality measures behave similarly, but the differences are less strong. The usefulness values of the heuristic results are at most as good as those of the ILP results for k 7 and k 50 . See Table 7 for details.
Adult-2 Consider a user who is interested in the Adult-2 dataset. Her main goal is to analyze correlations between the income of the individuals and the other attributes (to detect discrimination). To get useful data, she specifies four constraints for an anonymized record.
  • Each record should contain at most two suppressed entries.
  • The attributes “education” and “salary class” should not be suppressed, because she assumes a strong relation between them.
  • One of the attributes, “work class” or “occupation”, alone is useless for her, so either both should be suppressed or none of them.
Table 6. Heuristic vs. ILP: Results for the Nursery dataset specifying all pattern vectors.
Table 6. Heuristic vs. ILP: Results for the Nursery dataset specifying all pattern vectors.
Greedy HeuristicILP implementation
k s u r # h h avg h max k s u r # h h avg h max
212,6903.200.83443203.03212,9603.2062.8594,3203.03
312,6903.200.28543203.03312,9603.2062.6274,3203.03
412,6903.280.34432404.04412,9603.3361.0972,5925.05
512,6903.330.32825925.05512,9603.3360.5432,5925.05
625,9203.870.32014409.09625,9204.060.81286415.015
725,9203.870.33514409.09725,9204.047.49686415.015
825,9203.870.39114409.09825,9204.059.18886415.015
925,9203.870.31914409.09925,9204.059.71786415.015
1025,9203.950.432108012.0121025,9204.059.33486415.015
2538,8804.530.84648027.0272538,8804.7556.22121660.060
5038,8804.751.17921660.0605038,8804.7549.69921660.060
7538,8804.831.25916280.0807538,8804.8345.72816280.080
10051,8405.281.608120108.010810051,8405.544.08554240.0240
Table 7. Heuristic vs. ILP: Results for the CMC dataset specifying all pattern vectors with costs of at most two.
Table 7. Heuristic vs. ILP: Results for the CMC dataset specifying all pattern vectors with costs of at most two.
Greedy HeuristicILP implementation
k s u r # h h avg h max k s u r # h h avg h max
24,1123.180.0575332.76424922,9323.222.2056532.256100
36,5643.420.0542645.58050135,2163.463.6683494.221320
48,2523.570.0661539.62769647,0243.602.2402087.082528
58,9523.690.06910913.51477158,0653.733.93914610.089646
69,8213.760.0727818.88587469,0123.801.65510314.301765
710,3393.840.0846124.14893579,7513.841.4797619.382856
810,8783.950.0734731.340998810,2543.911.4546024.550918
911,4864.060.0853246.0311,074911,0514.001.2694433.4771016
1011,6784.080.0812852.6071,0981011,4624.051.3643542.0861066
2513,7225.690.0974368.251,3472513,7225.371.1065294.601347
5014,3147.120.1032736.501,4215014,3147.121.1742736.501421
7514,73010.00.10811,473.01,4737514,73010.01.16911,473.01473
10014,73010.00.09711,473.01,47310014,73010.01.14611,473.01473
Table 8. Heuristic vs. ILP: Results for the Adult-2 dataset with user-specified pattern vectors.
Table 8. Heuristic vs. ILP: Results for the Adult-2 dataset with user-specified pattern vectors.
Greedy HeuristicILP implementation
k s u r # h h avg h max k s u r # h h avg h max
238,3121.730.619,2143.532,356229,0561.7519.919,7653.33893
355,7491.810.575,3136.133,896343,8871.8441.366,3155.161,831
467,6181.870.553,6768.865,077454,1621.9153.134,7546.852,592
576,3631.910.602,77711.75,967561,7011.96163.94,0348.073,183
683,5981.950.602,21414.76,736668,2782.01183.13,3869.623,737
789,5011.990.651,84917.67,346774,1602.06322.72,83311.54,275
894,0862.020.621,58120.67,801879,1092.08110.32,50913.04,846
998,9992.040.631,36023.98,333984,0652.1168.582,04615.95,560
10103,6242.070.681,19427.38,8631088,0262.1552.391,76218.55,997
25141,6972.310.7539582.413,23725125,2332.4318.4368447.610,046
50173,9472.530.8516419817,11050161,0832.627.06728811314,636
75196,2182.570.939733620,04075185,8702.696.17915321318,063
100207,4172.570.937344621,465100197,4212.746.55210231919,648
4.
Since she assumes discrimination because of age, sex, and race, at most one of these attributes should be suppressed.
We generated the set of pattern vectors fulfilling her constraints (plus the all-☆-vector to remove outliers) and applied our greedy heuristic and the ILP implementation for k { 2 , 3 , 10 , 25 , 50 , 75 , 100 } .
The ILP implementation took up to six minutes to compute one single instance, whereas the greedy heuristic needs always less than one second. Moreover, the solution quality of the heuristic results is surprisingly good. The number of suppressed entries is at most 1 . 31 times the optimum. The ILP is slightly better concerning the measures # h and h avg . Only the maximum size of the output row types of the heuristic results is sometimes more than twice the maximum size of output row types of the ILP results for some k. Surprisingly, the usefulness values are always slightly better for the heuristic results. See Table 8 for details.

4. Conclusions

In three scenarios with real-world datasets, we showed that our greedy heuristic performs well in terms of solution quality compared with the optimal solution produced by the ILP implementation. The results of the heuristic are relatively close to the optimum, and in fact, for many cases, they were optimal, although our heuristic is much more efficient than the exact algorithm (the ILP was, on average, more than 1000 times slower). The heuristic results tend to get closer to the optimal number of suppressions with increasing degree k of anonymity.

5. Outlook

We introduced a promising approach to combinatorial data anonymization by enhancing the basic k-Anonymity problem with user-provided “suppression patterns.” It seems feasible to extend our model with weights on the attributes, thus making user influence on the anonymization process even more specific. A natural next step is to extend our model by replacing k-Anonymity by more refined data privacy concepts, such as domain generalization hierarchies [40], p-sensitivity [41], -diversity [23] and t-closeness [38].
On the theoretical side, we did no extensive analysis of the polynomial-time approximability of Pattern-Guided k-Anonymity. Are there provably good approximation algorithms for Pattern-Guided k-Anonymity? Concerning exact solutions, are there further polynomial-time solvable special cases beyond Pattern-Guided 2-Anonymity?
On the experimental side, several issues remain to be attacked. For instance, we used integer linear programming in a fairly straightforward way almost without any tuning tricks (e.g., using the heuristic solution or “standard heuristics” for speeding up integer linear program solving). It also remains to perform tests comparing our heuristic algorithm against methods other than Mondrian (unfortunately, for the others, no source code seems to be freely available).

Acknowledgments

We thank our students, Thomas Köhler and Kolja Stahl, for their great support in doing implementations and experiments. We are grateful to anonymous reviewers of Algorithms for constructive and extremely fast feedback (including the spotting of a bug in the proof of Theorem 4) that helped to improve our presentation. Robert Bredereck was supported by the DFG, research project PAWS, NI 369/10.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Fung, B.C.M.; Wang, K.; Chen, R.; Yu, P.S. Privacy-preserving data publishing: A survey of recent developments. ACM Comput. Surv. 2010, 42, 14:1–14:53. [Google Scholar] [CrossRef]
  2. Navarro-Arribas, G.; Torra, V.; Erola, A.; Castellà-Roca, J. User k-anonymity for privacy preserving data mining of query logs. Inf. Process. Manag. 2012, 48, 476–487. [Google Scholar] [CrossRef]
  3. Dwork, C. A firm foundation for private data analysis. Commun. ACM 2011, 54, 86–95. [Google Scholar] [CrossRef]
  4. Bonizzoni, P.; Della Vedova, G.; Dondi, R. Anonymizing binary and small tables is hard to approximate. J. Comb. Optim. 2011, 22, 97–119. [Google Scholar] [CrossRef]
  5. Bonizzoni, P.; Della Vedova, G.; Dondi, R.; Pirola, Y. Parameterized complexity of k-anonymity: Hardness and tractability. J. Comb. Optim. 2013, 26, 19–43. [Google Scholar] [CrossRef]
  6. Chakaravarthy, V.T.; Pandit, V.; Sabharwal, Y. On the complexity of the k-anonymization problem. 2010; arXiv:1004.4729. [Google Scholar]
  7. Blocki, J.; Williams, R. Resolving the Complexity of Some Data Privacy Problems. In Proceedings of the 37th International Colloquium on Automata, Languages and Programming (ICALP ’10), Bordeaux, France, 6–10 July 2010; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6199, LNCS. pp. 393–404. [Google Scholar]
  8. Meyerson, A.; Williams, R. On the Complexity of Optimal k-Anonymity. In Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS ’04), Paris, France, 14–16 June 2004; ACM: New York, NY, USA, 2004; pp. 223–228. [Google Scholar]
  9. Campan, A.; Truta, T.M. Data and Structural k-Anonymity in Social Networks. In Proceedings of the 2nd ACM SIGKDD International Workshop on Privacy, Security, and Trust in KDD (PinKDD ’08), Las Vegas, NV, USA, 24 August 2008; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5456, LNCS. pp. 33–54. [Google Scholar]
  10. Gkoulalas-Divanis, A.; Kalnis, P.; Verykios, V.S. Providing k-Anonymity in location based services. ACM SIGKDD Explor. Newslett. 2010, 12, 3–10. [Google Scholar] [CrossRef]
  11. Loukides, G.; Shao, J. Capturing Data Usefulness and Privacy Protection in k-Anonymisation. In Proceedings of the 2007 ACM Symposium on Applied Computing, Seoul, Korea, 11–15 March 2007; ACM: New York, NY, USA, 2007; pp. 370–374. [Google Scholar]
  12. Rastogi, V.; Suciu, D.; Hong, S. The Boundary between Privacy and Utility in Data Publishing. In Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB Endowment, Vienna, Austria, 23–27 September 2007; pp. 531–542.
  13. Bredereck, R.; Nichterlein, A.; Niedermeier, R.; Philip, G. Pattern-Guided Data Anonymization and Clustering. In Proceedings of the 36th International Symposium on Mathematical Foundations of Computer Science (MFCS ’11), Warsaw, Poland, 22–26 August 2011; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6907, LNCS. pp. 182–193. [Google Scholar]
  14. Bredereck, R.; Köhler, T.; Nichterlein, A.; Niedermeier, R.; Philip, G. Using patterns to form homogeneous teams. Algorithmica 2013. [Google Scholar] [CrossRef]
  15. Samarati, P. Protecting respondents identities in microdata release. IEEE Trans. Knowl. Data Eng. 2001, 13, 1010–1027. [Google Scholar] [CrossRef]
  16. Samarati, P.; Sweeney, L. Generalizing Data to Provide Anonymity When Disclosing Information. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS ’98), Seattle, WA, USA, 1–3 June 1998; ACM: New York, NY, USA, 1998; pp. 188–188. [Google Scholar]
  17. Sweeney, L. k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 2002, 10, 557–570. [Google Scholar] [CrossRef]
  18. Downey, R.G.; Fellows, M.R. Parameterized Complexity; Springer: Berlin/Heidelberg, Germany, 1999. [Google Scholar]
  19. Flum, J.; Grohe, M. Parameterized Complexity Theory; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  20. Niedermeier, R. Invitation to Fixed-Parameter Algorithms; Oxford University Press: Oxford, UK, 2006. [Google Scholar]
  21. LeFevre, K.; DeWitt, D.; Ramakrishnan, R. Mondrian Multidimensional k-anonymity. In Proceedings of the IEEE 22nd International Conference on Data Engineering (ICDE ’06), Atlanta, GA, USA, 3–7 April 2006; IEEE Computer Society: Washington, DC, USA, 2006; pp. 25–25. [Google Scholar]
  22. Frank, A.; Asuncion, A. UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences. 2010. http://archive.ics.uci.edu/ml.
  23. Machanavajjhala, A.; Kifer, D.; Gehrke, J.; Venkitasubramaniam, M. -diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 2007, 1. [Google Scholar] [CrossRef]
  24. Evans, P.A.; Wareham, T.; Chaytor, R. Fixed-parameter tractability of anonymizing data by suppressing entries. J. Comb. Optim. 2009, 18, 362–375. [Google Scholar] [CrossRef]
  25. Karp, R.M. Reducibility Among Combinatorial Problems. In Complexity of Computer Computations; Miller, R.E., Thatcher, J.W., Eds.; Plenum Press: New York, NY, USA, 1972; pp. 85–103. [Google Scholar]
  26. Anshelevich, E.; Karagiozova, A. Terminal backup, 3D matching, and covering cubic graphs. SIAM J. Comput. 2011, 40, 678–708. [Google Scholar] [CrossRef]
  27. Bredereck, R.; Nichterlein, A.; Niedermeier, R.; Philip, G. The effect of homogeneity on the computational complexity of combinatorial data anonymization. Data Min. Knowl. Discov. 2012. [Google Scholar] [CrossRef]
  28. Fredkin, E. Trie memory. Commun. ACM 1960, 3, 490–499. [Google Scholar] [CrossRef]
  29. Lenstra, H.W. Integer programming with a fixed number of variables. Math. Oper. Res. 1983, 8, 538–548. [Google Scholar] [CrossRef]
  30. Adult dataset. Available online: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/adult/ (accessed on 16 October 2013).
  31. Nursery dataset. Available online: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/nursery/ (accessed on 16 October 2013).
  32. Olave, M.; Rajkovic, V.; Bohanec, M. An application for admission in public school systems. Expert Syst. Public Adm. 1989, 145, 145–160. [Google Scholar]
  33. CMC dataset. Available online: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/cmc/ (accessed on 16 October 2013).
  34. Canada dataset. Available online: http://www.hc-sc.gc.ca/dhp-mps/medeff/databasdon/index-eng.php (accessed on 16 October 2013).
  35. Canada attribute names. Available online: http://www.hc-sc.gc.ca/dhp-mps/medeff/databasdon/structure-eng.php#a1 (accessed on 16 October 2013).
  36. Mainland, G.; Leshchinskiy, R.; Peyton Jones, S. Exploiting Vector Instructions with Generalized Stream Fusion. In Proceedings of the 18th ACM SIGPLAN International Conference on Functional Programming (ICFP ’13), Boston, MA, USA, 25–27 September 2013; ACM: New York, NY, USA, 2013; pp. 37–48. [Google Scholar]
  37. Pattern-Guided k-Anonymity heuristic. Available online: http://akt.tu-berlin.de/menue/software/ (accessed on 16 October 2013).
  38. Li, N.; Li, T.; Venkatasubramanian, S. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. In Proceedings of the IEEE 23rd International Conference on Data Engineering (ICDE ’07), Istanbul, Turkey, 15–20 April 2007; pp. 106–115.
  39. Mondrian implementation. Available online: http://cs.utdallas.edu/dspl/cgi-bin/toolbox/ index.php?go=home (accessed on 16 October 2013).
  40. Sweeney, L. Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 2002, 10, 571–588. [Google Scholar] [CrossRef]
  41. Truta, T.M.; Vinay, B. Privacy Protection: p-Sensitive k-Anonymity Property. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDE ’06), Atlanta, GA, USA, 3–7 April 2006; IEEE Computer Society: Washington, DC, USA, 2006; p. 94. [Google Scholar]
Back to TopTop