Privacy-Preserving Feature Selection with Fully Homomorphic Encryption

For the feature selection problem, we propose an efficient privacy-preserving algorithm. Let $D$, $F$, and $C$ be data, feature, and class sets, respectively, where the feature value $x(F_i)$ and the class label $x(C)$ are given for each $x\in D$ and $F_i \in F$. For a triple $(D,F,C)$, the feature selection problem is to find a consistent and minimal subset $F' \subseteq F$, where `consistent' means that, for any $x,y\in D$, $x(C)=y(C)$ if $x(F_i)=y(F_i)$ for $F_i\in F'$, and `minimal' means that any proper subset of $F'$ is no longer consistent. On distributed datasets, we consider feature selection as a privacy-preserving problem: Assume that semi-honest parties $\textsf A$ and $\textsf B$ have their own personal $D_{\textsf A}$ and $D_{\textsf B}$. The goal is to solve the feature selection problem for $D_{\textsf A}\cup D_{\textsf B}$ without revealing their privacy. In this paper, we propose a secure and efficient algorithm based on fully homomorphic encryption, and we implement our algorithm to show its effectiveness for various practical data. The proposed algorithm is the first one that can directly simulate the CWC (Combination of Weakest Components) algorithm on ciphertext, which is one of the best performers for the feature selection problem on the plaintext.


Motivation
Feature selection is one of the most common problems in machine learning. For example, the human genome contains of 3.1 billion base pairs, only a few dozens of which are thought to affect a specific disease. Various machine learning algorithms make use of favorable features extracted from such sparse data.
Consider a data set D associated with a feature set F and a class variable C, where all feature values x(F i ) (F i ∈ F ) and the corresponding class label x(C) are defined for each data x ∈ D. In Table 1, for example, we show a concrete example. Given a triple (D, F, C), the feature selection problem is to find a minimal F ⊆ F that is relevant to the class C. The relevance of F is evaluated, for example, by I(F ; C), which measures the mutual information between F and C. On the other hand, F is minimal, if any proper subset of F is no longer consistent.
To the best of our knowledge, the most common method for identifying favorable features is to choose features that show higher relevance in some statistical measure. Individual feature relevance can be estimated using statistical measures such as mutual information and Bayesian risk. For example, at the bottom row of  Table 1: An example dataset shown in [1].
0.189 0.189 0.049 0.000 0.000 information score I(F 1 ; C) of each feature F i to class labels is described. We can see that F 1 is more important than F 5 , because I(F 1 ; C) > I(F 5 ; C). F 1 and F 2 of Table 1 will be chosen to explain C based on the mutual information score. However, a closer examination of D reveals that F 1 and F 2 cannot uniquely determine C. In fact, we find x 2 and x 5 with x 2 (F 1 ) = x 5 (F 1 ) and x 2 (F 2 ) = x 5 (F 2 ) but x 2 (C) = x 5 (C). On the other hand, we can see that F 4 and F 5 uniquely determine C using the formula C = F 4 ⊕ F 5 while I(F 4 ; C) = I(F 5 ; C) = 0. As a result, the traditional method based on individual features relevance scores misses the right answer.
So we concentrate on the concept of consistency: F ⊆ F is called to be consistent if, for any x, y ∈ D, x(F i ) = y(F i ) for all F i ∈ F implies x(C) = y(C). In machine learning research, consistency-based feature selection has received a lot of attention [2][3][4][5][6]. CWC (Combination of Weakest Components) [2] is the simplest of such consistencybased feature selection algorithms, and even though CWC uses the most rigorous measure, it shows one of the best performances in terms of accuracy as well as computational speed compared to other methods [1].
To design a secure protocol for feature selection, we focus on the framework of homomorphic encryption. can also be computed, then E is said to be fully homomorphic. Furthermore, modern public-key encryption must be probabilistic: when the same message m is encrypted multiple times, the encryption algorithm produces different ciphertexts of E[m].
Various homomorphic encryption schemes have been proposed to satisfy those homomorphic properties over the last two decades. The first additive homomorphic encryption was proposed by Paillier [7]. Somewhat homomorphic encryption that allows a sufficient number of additions and a limited number of multiplications has also been proposed [8][9][10], and we can use these cryptosystems to compute more difficult problems, such as the inner product of two vectors. Gentry [11] proposed the first fully homomorphic encryption (FHE) with unlimited number of additions and multiplications, and since then, useful libraries for fully homomorphic encryption have been developed, particularly for bitwise operations and floating-point operations. TFHE [12,13] is known as a fastest fully homomorphic encryption that is optimized for bitwise operations.
For the private feature selection problem, we use TFHE to design and implement our algorithm. In this case, we assume two semi-honest parties A and B: each party complies the protocol but tries to infer as much as possible about the secret from the information obtained. The parties have their own private data D A and D B and they jointly compute advantageous features for D A ∪D B while maintaining their privacy. The goal is to jointly compute the CWC algorithm result on D = D A ∪ D B without revealing any other information.
It should be a realistic requirement, if one wants to draw some conclusions from data that are privately distributed over more than one parties. Multi-party computation (MPC) can provide effective technical solutions to realize this requirement in many cases. In MPC, certain computation which essentially rely on the distributed data is performed through cooperation among the parties. In particular, fully homomorphic encryption (FHE) is one of the critical tools of MPC. One of the most significant advantages of FHE-based MPC is thought to be that FHE realizes outsourced computation in a simple and straightforward manner: Parties encrypt their private data with their public keys and send the encrypted data to a single trusted party with sufficient computational power to perform the required computation; Although the computational results by the trusted party may be wrong, if some malicious parties send wrong data, honest parties are at least convinced that their private data have not been stolen as far as the cryptosystem used is secure. In contrast, when a party shares his/her secret with other parties to perform MPC, even if it uses a secure secret sharing scheme, collusion of a sufficient number of compromised parties may reveal the party's secret. In general, it is hard to prove the security of MPC protocols for the situation where we cannot deny the existence of active malicious parties, and hence, the security is very often proven assuming that all the parties are at worst semi-honest. In the Table 2: Time and space complexities of the baseline and improved algorithms for secure CWC, where k is the number of features and m, n are the numbers of positive and negative data, respectively. We assume that the time of respective operation (e.g., encryption/addition/multiplication/comparison) in FHE is O(1). Algorithm Time Space CWC on plaintext [2] O reality, however, even this relaxed assumption is hard to hold. Thus, the property that a party can protect its private data only relying on its own efforts should be counted as an important advantage of FHE-based MPC.
On the other hand, the current implementations of FHE are thought to be significantly inefficient, and consequently, their ranges of application is actually limited. This is currently true, but may not be true in the future: The Goldwasser-Mmicali (GM) cryptosystem [14] is thought as the first scheme with provable security; Unfortunately, because the GM cryptosystem encrypts data in a bit-wise manner, it has turned out not to have sufficient efficiency in time and memory to be used in the real world; In 2001, however, RSA-OAEP was finally proven to have both provable security and realistic efficiency [15,16], and is widely used through SSL/TLS. Thus, studying FHE-based MPC does not merely have theoretical meaning, but also will yield significant contributions in terms of application to the real world in the future.
In this paper, we propose a MPC protocol which relies on FHE-based outsourced computation as well as mutual cooperation among parties. The target of our protocol is to perform the computation of CWC, a feature selection algorithm known to be accurate and efficient, preserving privacy of participating parties. If we fully perform CWC by FHE-based outsourced computation, we have to pay unnecessarily large costs in time in the phase of sorting features of CWC. Therefore, in our proposed scheme, we add ingenuity so that two parties cooperate with each other to sort features efficiently.
Converting CWC into its privacy-preserving version based on different primitives of MPC, for example based on secret sharing techniques, is not only interesting but also useful both in theory and in practice. We will pursue this direction as well as our future work. Table 2 summarizes the complexities of proposed algorithms in comparison to the original CWC on plaintext. The baseline is a naive algorithm that can simulate the original CWC [2] over ciphertext using TFHE operations. The bottleneck of private feature selection exists in the sorting task over ciphertext, as we mention in the related work below. Our main contribution is the improved algorithm, shown as 'improved', which significantly reduces the time complexity caused by the sorting task. We also implement the improved algorithm and demonstrate its efficiency through experiments in comparison to the baseline.

Our contribution and related work
In this section, we discuss related work on the private feature selection as well as the benefits of our method. Rao [17] et al. proposed a homomorphic encryption-based private feature selection algorithm. Their protocol allows the additive homomorphic property only, which invariably leaks statistical information about the data. Anaraki and Samet [18] proposed a different method based on the rough set theory, but their method suffers from the same limitations as Rao et al., and neither method has been implemented. Banerjee et al. [19], and Sheikhalishahi and Martinellil [20] have proposed MPC-based algorithms that guarantee security by decomposing the plaintext into shares as a different approach to the private feature selection, while achieving cooperative computation. Li et al. [21] improved the MPC protocol on aforementioned flaw and demonstrated its effectiveness through experiments.
These methods avoid partially decoding under the assumption that the mean of feature values provides a good criterion for feature selection. This assumption, however, is heavily dependent on data. The most important task in general feature selection is feature-value-based sorting, and CWC and its variants [1,2,5] demonstrated the effectiveness of sorting with the consistency measure and its superiority over other methods. On ciphertext, this study realizes the sorting-based feature selection algorithm (e.g., CWC).
We focus on the learning decision tree by MPC [22] as another study that employs sorting for private machine learning, where the sorting is limited to the comparison of N values of fixed-length in O(N log 2 N ) time by a sorting network.
In the case of CWC, however, the algorithm must sort N data points, each of which has a variable-length of up to M , so a naive method requires O(M N log N + N log 2 N ) time. Our algorithm reduces this complexity to O(M N + Table 3: The data consists of two positive data {x 1 , x 2 } and five negative data {y 1 , y 2 , y 3 , y 4 , y 5 }.
, that is significantly smaller than the naive algorithm depending on M and N . By experiments, we confirm this for various data including real datasets for machine learning.

CWC algorithm over plaintext
We generally assume that the dataset D associated with F and C contains no errors, i.e., if When D contains such errors, they are removed beforehand and D contains not more than one x ∈ D with the same feature values.
In Algorithm 1, we describe the original algorithm for finding a minimal consistent features. Given D with F i and C = {0, 1}, a data x ∈ D of x(C) = 1 is referred to as a positive data and y ∈ D of y(C) = 0 is referred to as a negative data. Let n represent the number of positive data and m = |D| − n. Let x p represent the p-th positive data (1 ≤ p ≤ n) and y q represent the q-th negative data (1 ≤ q ≤ m). Then, the bit string B i of length nm is defined by: for any x, y ∈ D. As a result, ||B i || is defined to be the number of 1s in B i .
For a subset F ⊆ F , F is said to be consistent, if for any p ∈ [1, n] and q ∈ [1, m], there exists i such that F i ∈ F and B i [m(p − 1) + q] = 1 hold. CWC uses this to remove irrelevant features from F in order to build a minimal consistent feature set 1 .
end if 9: end for Table 3 shows an example of D, and Table 4 shows the corresponding B i . Consider the behavior of CWC in this case.
be sequences of random variables such that X k and Y k are defined over the same sample space. We say that X and Y are indistinguishable, denoted by X ≡ c Y , if, and only if, is a negligible function.

Security of multi-party computation (MPC)
Although the discussion of this section can be extended to MPC schemes which involve more than two parties, just for simplicity, we focus on the case where only two parties are involved.
A two-party protocol is a pair Π = (P 1 , P 2 ) of PPT Turing machines with input and random tapes. Let x i be an input of P i and y i be an output of P i , respectively.
We assume a semi-honest adversary A and consider a protocol (A, P 2 ), replacing P 1 in Π by A, where A takes x 1 as input and apparently follows the protocol. Let REAL Π,A (x 1 , x 2 ) denote the random variable representing the output (y 1 , y 2 ) of (A, P 2 ), and we define the class On the other hand, let F denote the functionality that the protocol Π is trying to realize, i.e., F is a PPT that simulates the honest (P 1 , P 2 ) so that F(x 1 , x 2 ) ≡ (P 1 (x 1 ), P 2 (x 2 )). Here, we assume a completely reliable third party, denoted by F. In this ideal world, for this F and any adversary B acting as P 1 with input x 1 possibly Using such random variables, we define the security of protocol Π as follows.
Definition 1 It is said that a protocol Π securely realizes a functionality F if for any attacker A against Π, there exists an adversary B, REAL Π,A ≡ c IDEAL F ,B holds.
The definitions stated above can be intuitively explained as follows. Exactly conforming to the protocol, a semi-honest adversary A plays the role of P 1 to steal any secrets. The information sources which A can take advantage of are the following three: 1. The input tape to P 1 ; 2. the conversation with P 2 ; 3. the execution of the protocol.
While the information that A can obtain from the first and third sources are exactly x 1 and y 1 respectively, we call the information from the second source a view. To denote it, we use the symbol View P1 .
Since the protocol enevitably requires that A obtains the information of x 1 and y 1 , the security of the protocol questions about what A can obtain in addition to what can be computationally inferred from x 1 and y 1 . If there exists such information, its source must be View P1 .
The security criterion of simulatability requires that View P1 can be simulated on input of x 1 and y 1 . To be formal, there exists a PPT Turing machine Sim that outputs a view on input of x 1 and y 1 such that the output view cannot be distinguished from View P1 by any PPT Turing machine. When View P1 is simulatable, we see that Sim can generate by itself what Sim can obtain from View P1 . Therefore, Sim cannot cannot obtain any information in addition to what Sim can compute from x 1 and y 1 .

IND-CPA
Indistinguishability against chosen plaintext attack (IND-CPA) is an important criterion for secrecy of a public key cryptosystem. We let Π = (Gen, Enc, Dec) denote a public key cryptosystem consisting of key generation, encryption, and decryption algorithms. To describe IND-CPA, we introduce the IND-CPA game played between an adversary A and an oracle O: A is a PPT Turing machine, and k is the security parameter.
1. O generates a public key pair (sk, pk) ← Gen(1 k ). We view b and b as random variables whose underlying probability space is defined to represent the choices of the public key pair, b and b . The advantage of the adversary A is defined as follows to represent the advantage of A over tossing a fair coin to guess O's secret b:

When we let
This definition of the advantage is consistent with the common definition found in many textbooks: Definition 2 A public key cryptosystem Π is secure in the sense of IND-CPA, or simply IND-CPA secure, if Adv A as a function in k is a negligible function.

TFHE: a faster fully homomorphic encryption
The proposed private feature selection is based on FHE. We review the TFHE [13], one of the fastest libraries for bitwise addition (this means XOR '⊕') and bitwise multiplication (AND '·') over ciphertext. . In this section, we will go over how to build the adder and comparison operations. Let x, y represent -bit integers and x i , y i represent the i-th bit of x, y respectively. Let c i represent the i-th carry-in bit and s i is the i-th bit of the sum x + y. Then, we can get E[x + y] by the bitwise operations of ciphertexts using Adopting those operations of TFHE, we design a secure multi-party CWC. In this paper, we omit the details of TFHE (see e.g., [12,13]).
We should note that the secrecy of TFHE definitely impacts the security of our scheme. In fact, in our two-party feature selection scheme, the party B sends his/her inputs in an encrypted form to the party A, and A performs the computation of feature selection on the encrypted inputs. If the encrypted inputs could be easily cracked, any ingenious devices to secure the scheme would be meaningless.
Therefore, in designing our scheme, it was a matter of course to require our FHE cryptosystem to be IND-CPA secure. In fact, TFHE is known to be IND-CPA secure. Regarding this, we should note the following 3 Algorithms

Baseline algorithm
We present the baseline algorithm, a privacy-preserving variant of CWC. In this subsection, we consider a two-party protocol, in which a party A has his private data and outsources CWC computation to another party B, but the baseline algorithm is easily extended to more than two data owners case, e.g., parties A and C send their private data to party In the baseline algorithm, all inputs are encrypted and they are not decrypted until the computation is completed. Thus, for simplicity, we omit the notation E in the following presentation.

Computing
, where x p (d) and y q (d) represent the dummy bits for data x p and y q , respectively. (x p (F i ) ⊕ y q (F i )) becomes 0 iff F i is inconsistent for the pair of x p and y q .
Since we want to ignore the influence of dummy data, the part "∨x p (d) ∨ y q (d)" is added to make the whole value 1 (meaning that it is consistent) when one of x p and y q is a dummy. It takes O(kmn) time and space in total.

Sorting B's
We can compute B i in encrypted form by summing up values in B i in O(mn log(mn)) time (noting that each operation on integers of log(mn) bits takes O(log(mn)) time). Instead, we can set an upper bound b max of the bits used to store consistency measure to reduce the time complexity to O(mnb max ).
Then, sorting B's in the incremental order of consistency measures can be accomplished using any sorting network in which comparison and swap are performed in encrypted form without leaking information about feature ordering. It should be noted that in this approach, the algorithm must spend Θ(mn + log k) time to swap (or pretend to swap) twobit strings and original feature indices of log k bits regardless that two features are actually swapped or not. Because this is the most complex part of our baseline algorithm, we will demonstrate how to improve it. Using AKS sorting network [23] of size O(k log k), the total time for sorting B i 's is O(mnb max + (mn + b max + log k)k log k).
In our experiments, we employ a more practical sorting network of Batcher's odd-even mergesort [24] of size O(k log 2 k). A a simple oblivious radix sort [25] in O(k log k) algorithm under the assumption that the bit length of each integer is constant was recently proposed.
. . , F π(j ) } have been selected at the moment. Assume that we are in the i-th iteration of the for loop of Algorithm 1. Note that, at the moment, F contains features {F π(i) , F π(i+1) , . . . , F π(k) } and currently selected features, and time. Therefore, the total computational time is O(kmn).

Summing up analysis
The sorting step takes O(mnb max + (mn + b max + log k)k log k) time. Because CWC works with any consistent measure, we do not need to use B i in full accuracy, so we assume that b max is set to be a constant. Under the assumption, we obtain the following theorem.
Theorem 1 For the two party feature selection problem, we can securely simulate CWC in O(kmn log k + k log 2 k) time and O(kmn) space without revealing the private data of the parties under the assumption that TFHE is secure.
proof. According to the discussion above, computing B i for all features takes O(kmn) time and space, sorting features takes O(mnb max + (mn + b max + log k)k log k) = O(kmn log k + k log 2 k) time, and selecting features takes O(kmn) time.
Finally, party B computes in O(k log k) time an integer array P with P [h] = R[h] · π(h), which stores the original indices of selected features. In outsourcing scenario, party B simply sends P to party A as the result of CWC. In joint computing scenario, party B randomly shuffles P to conceal π to A. As a result, we can securely simulate CWC in O(kmn log k + k log 2 k) time and O(kmn) space.

Improvement of secure CWC
Sorting is a major bottleneck for private CWC. The reason for this is that pointers cannot be moved across ciphertexts. For example, consider the case of secure integer sort. Let the variables x and y contain integers a and b, respectively. In this case, by performing the secure operation a <?b, the result is obtained as a <?b = c ∈ {0, 1}. Using this logical bit c, we can swap the values of x and y in O(1) time satisfying x < y by the secure operation x ← c · a +c · b and y ←c · a + c · b.
In the case of CWC; however, each integer i of feature F i is associated with the bit string B i . Since any x cannot be decrypted, we cannot swap the pointers appropriately. Therefore, the baseline algorithm swaps B i explicitly. As a result, the computation time for sorting increases to O(mnk log 2 k). Our main contribution of this study is to improve this complexity to O(mnk + k log 2 k) by reducing the cost for such explicit sorting.
Based on the FHE, we propose the improved secure CWC (Algorithm 2), which reduces the time complexity to O(mnk + k log 2 k). An example run of Algorithm 2 is illustrated in Fig. 1. As shown in this example, the party A can securely sort k randomized features in O(k log 2 k) time using a suitable sorting network, and then, according to the result of sorting, A swaps each associated bit string of length nm in O(kmn) time. Following this preprocessing, the parties securely obtain minimal consistent features by decrypting the output of CWC. Finally, we get the following result.
Theorem 2 Algorithm 2 can simulate CWC in O(kmn + k log 2 k + k log k log mn) time and O(kmn) space under the assumption that FHE executes each bit operation in O(1) time.
Algorithm 2 Improved secure CWC between parties A and B 1: Preprocessing: Generates r i for i = 1, . . . , n uniformly at random.
Generates a permutation π ∈ S n uniformly at random and memorizes it.
Generates r i for i = 1, . . . , n uniformly at random.
proof. Compared to the baseline, the additional space is required for π and r i and r i . Thus, the space complexity remains O(kmn). For the time complexity, the main task is to sort k-triple proof. We show the security by constructing simulators for parties A and B, respectively.
B's view (what B can obtain from A) is the following: Their probability distributions are uniform and independent of each other. Hence, the simulator for B can replace them with  On the other hand, the sequence (i 1 , . . . , i n ) is not explicitly given to A, and A recognizes it through the alignment between • Therefore, we define A's view to be We evaluate Attck's advantage as follows. We assume D B (c) = x 1 . The probability of this case is 1/2. The probability that D replies with A to the first query or D replies with Sim to the second query is since the first and second queries are mutually independent.
When assuming D B (c) = Since we assume that α is not negligible, neither is α/4.

Experiments
We implemented the baseline and improved algorithms for secure CWC in C++ using TFHE library 2 . The experiments were carried out on a machine equipped with Intel Core i7-6567U (3.30GHz) processor and 16GB of RAM. In the following, m (resp. n) is the number of positive (resp. negative) data and k is the number of features. Table 6 summarizes the running time of the baseline algorithm (naive implementation of Algorithm 1 using TFHE) for random data generated for k ∈ {10, 50, 100} and mn ∈ {100, 500, 1000}. The complexity analysis shows that the running time increases in proportion to mn. This experimental result confirms this in real data. The table clearly shows that the sorting process is the bottleneck. Table 7 compares the running time of preprocessing in baseline and improved algorithms. According to the results, the proposed algorithm significantly improves the bottleneck in naive CWC for secure computing. We should note that baseline and improved algorithms both compute exactly the same solution as the CWC on plaintexts. We also show the details of improved algorithm: 'sorting' means the time for sorting of the triples (F i , ||B i ||, i) of integers. 'other task' means the time for remaining tasks including generating/adding/subtracting random noise r i , moving B i , decrypting integers, etc.   Table 8 displays the running time of improved algorithm for real data available from UCI Machine Learning Repository 3 , because since these datasets contain more than three feature/class values, we treated them as a binary classification between one feature/class and the other.
We demonstrated that the proposed algorithm works well for real-world multi-level feature selection problems. We only evaluated the running time in this experiment, but the relevance of the extracted features is guaranteed because the secure CWC algorithm produces the same solution as the original [2].

Conclusion
On the basis of fully homomorphic encryption, we proposed a faster private feature selection algorithm that allow us to securely compute functional features from distribute private datasets. Our algorithm can simulate the original CWC algorithm, which chooses favorable features by sorting. In addition to the improvement in computational complexity, the proposed algorithm solves the private feature selection problem in practical time for a variety of real data. One of the remaining challenges is to improve sorting at a lower cost because CWC does not always require exact sorting. Then, ambiguous sorting possibly reduces the computation time maintaining solution quality. At this time, the proposed algorithm is not applicable to real number for feature value. This is because TFHE is not good at floating point operations. Extending the TFHE library to enable secure feature selection for real-valued data is a future challenge.