S C WC / S L CC : Highly Scalable Feature Selection Algorithms

: Feature selection is a useful tool for identifying which features, or attributes, of a dataset cause or explain the phenomena that the dataset describes, and improving the efﬁciency and accuracy of learning algorithms for discovering such phenomena. Consequently, feature selection has been studied intensively in machine learning research. However, while feature selection algorithms that exhibit excellent accuracy have been developed, they are seldom used for analysis of high-dimensional data because high-dimensional data usually include too many instances and features, which make traditional feature selection algorithms inefﬁcient. To eliminate this limitation, we tried to improve the run-time performance of two of the most accurate feature selection algorithms known in the literature. The result is two accurate and fast algorithms, namely sCWC and sLCC. Multiple experiments with real social media datasets have demonstrated that our algorithms improve the performance of their original algorithms remarkably. For example, we have two datasets, one with 15,568 instances and 15,741 features, and another with 200,569 instances and 99,672 features. sC WC performed feature selection on these datasets in 1.4 seconds and in 405 seconds, respectively. In addition, sL CC has turned out to be as fast as sC WC on average. This is a remarkable improvement because it is estimated that the original algorithms would need several hours to dozens of days to process the same datasets. In addition, we introduce a fast implementation of our algorithms: sC WC does not require any adjusting parameter, while S L CC requires a threshold parameter, which we can use to control the number of features that the algorithm selects.


Introduction
Accurate and fast feature selection is a useful tool of data analysis.In particular, feature selection on categorical data is important in real world applications.Features, or attributes, and, in particular, features to specify class labels, which represent the phenomena to explain, and/or the targets to predict are often categorical.In this paper, we propose two new feature selection algorithms that are as accurate as, and drastically faster than, any other methods represented in the literature.In fact, our algorithms are the first accurate feature selection algorithms that scale well to big data.
The importance of feature selection can be demonstrated with an example.Figure 1 depicts the result of clustering tweets posted to Twitter during two different one-hour windows on the day of the Great East Japan Earthquake, which hit Japan at 2:46 p.m. on 11 March 2011 and inflicted catastrophic damage.Figure 1a plots 97,977 authors who posted 351,491 tweets in total between 2:00 p.m. and 3:00 p.m. on the day of the quake (the quake occurred in the midst of this period of time), while Figure 1b plots 161,853 authors who posted 978,155 tweets between 3:00 p.m. and 4:00 p.m.To plot, we used word-count-based distances between authors and a multidimensional scaling algorithm.Moreover, we grouped the authors into different groups using the k-means clustering algorithm based on the same distances.Dot colors visualize that clustering.We observe a big change in clustering between the hour during which the quake occurred, and the hour after the quake.1a to Figure 1b?Answering these questions requires a method for selecting words that best characterize each cluster; in other words, a method for feature selection.
To illustrate, we construct two datasets, one for the timeframe represented in Figure 1a and one for the time-frame represented in Figure 1b, called dataset A and dataset B, respectively.Each dataset consists of a word count vector for each author that reflects all words in all of their tweets.Dataset A has 73,543 unique words, and dataset B has 71,345 unique words, so datasets A and B have 73,543 and 71,345 features, respectively.In addition, each author was given a class label reflecting the category he or she was assigned to from the k-means clustering process.
It was our goal to select a relatively small number of features (words) that were relevant to class labels.We say that a set of features is relevant to class labels, if the values of the features uniquely determine class labels with high likelihood.Table 1 depicts an example of a dataset for explanation.F 1 , . . ., F 5 are features, and the symbol C denotes a variable to represent class labels.The feature F 5 , for example, is totally irrelevant to class labels.In fact, we have four instances with F 5 = 0, and a half of them have the class label 0, while the other half have the class label 1.The same holds true for the case of F 5 = 1.Therefore, F 5 cannot explain class labels at all and is useless to predict class labels.In fact, predicting class labels based on F 5 has the same success probability as guessing them by tossing a fair coin (the Bayesian risk of F 5 to C is 0.5, which is the theoretical worst).On the other hand, F 1 is more relevant than F 5 because the values of F 1 explain 75% of the class labels, and, in other words, the prediction based on F 1 will be right with a probability of 0.75 (that is, the Bayesian risk is 0.25).
The relevance of individual features can be estimated using statistical measures such as mutual information, symmetrical uncertainty, Bayesian risk and Matthew's correlation coefficients.For example, at the bottom row of Table 1, the mutual information score I(F i , C) of each feature F i to class labels is described.We see that F 1 is more relevant than F 5 , since I(F 1 , C) > I(F 5 , C).
To our knowledge, the most common method deployed in big data analysis to select features that characterize class labels is to select features that show higher relevance in some statistical measure.For example, in the example of Table 1, F 1 and F 2 will be selected to explain class labels.
However, when we look into the dataset of Table 1 more closely, we understand that F 1 and F 2 cannot determine class labels uniquely.In fact, we have two instances with F 1 = F 2 = 1, whose class labels are 0 and 1.On the other hand, F 4 and F 5 as a combination uniquely determine the class labels by the formula of C = F 1 ⊕ F 2 , where ⊕ denotes the addition modulo two.Therefore, the traditional method based on relevance scores of individual features misses the right answer.Table 1.An example dataset.
0.189 0.189 0.049 0.000 0.000 This problem is well known as the problem of feature interaction in feature selection research.Feature selection has been intensively studied in machine learning research.The literature describes a class of feature selection algorithms that can solve this problem, referred to as consistency-based feature selection (for example, [1][2][3][4][5]).
This result contains two interesting findings.First, the word ranked 141th is translated as "Mr.", "Mrs.", or "Ms.", which is a polite form of address in Japanese.This form of address is common in Japanese writing, so it seems odd that the word would identify a cluster of authors well.In fact, the relevance of the word is as low as 0.028.However, if we understand the nature of CWC, we can guess that the word must interact with other features to determine which cluster the author falls inside.In fact, it turns out that the word interacts with the 125th-ranked word, "worry".Hence, we realize that a portion of those tweets must have been asking about the safety of someone who was not an author's family member-in other words, someone who the author would have addressed with the polite form of their name.
The second interesting finding is that the words with the highest relevance to class labels have not been selected.For example, the word that means "quake" was ranked at the top but not selected.This is because the word was likely to be used in the tweets with other selected words such as words that translated to "tsunami alert" (ranked 19th), "the Hanshin quake" (55th), "fire" (66th), "tsunami" (75th) and "the Chu-Etsu quake" (106th), so that CWC judged the word "quake" to be redundant once the co-occurring words had been selected.Our interpretation is that these co-occurring words represent the contexts in which the word "quake" was used, and selecting these words gave us more information than selecting "quake", which is too general in this case.
Thus, the consistency-based algorithms do not simply select features with higher relevance; instead, they give us knowledge that we cannot obtain from selection based on the relevance of individual features.In spite of these advantages, however, consistency-based feature selection is seldom used in big data analysis.Consistency-based algorithms require heavy computation and the amount of computation increases as the size of data increases so greatly as to make application to large data sets unfeasible.This paper's contribution is to improve the run-time performance of two consistency-based algorithms that are known the most accurate, namely CWC and LCC (Linear Consistency Constrained feature selection) [4].We introduce two algorithms that perform well on big data: SCWC and SLCC.They always select the same features as CWC and LCC, respectively, and, therefore, perform with the same accuracy.SLCC accepts a threshold parameter to control the number of features to select and has turned out to be as fast as SCWC on average in our experiments.

Feature Selection on Categorical Data in Machine Learning Research
In this section, we give a brief review of feature selection research focusing on categorical data.The literature describes three broad approaches: filter, wrapper and embedded.Filter approaches aim to select features based on the intrinsic properties of datasets leveraging statistics and information theory, while wrapper and embedded approaches aim to optimize the performance of particular classification algorithms.We are interested in the filter approach in this paper.We first introduce a legacy feature selection framework and identify two problems in that framework.Then, we introduce the consistency-based approach to solve these problems.For convenience, we will describe a feature or a feature set that is relevant to class labels simply as relevant.

The Legacy Framework: Sum of Relevance (SR)
In the legacy and fundamental framework of feature selection, which underlies most of the known practical feature selection algorithms, we use sum-of-relevance (SR) functions to evaluate collective relevance of feature sets.

Sum of relevance
Computing the sum of relevance of individual features is an efficient method for estimating the collective relevance.
For example, let I(F, C) denote the mutual information of an individual feature F and the class variable C. To be specific, I(F, C) is defined by .
The values of x and y are selected from the sample spaces of F and C. It is well known that, the larger I(F, C) is, the more tightly F and C correlate with each other.If we do not know the population distribution Pr, we use the empirical distribution derived from a dataset.The sum-of-relevance for a feature set {F 1 , . . ., F n } based on I is determined by and estimates the collective relevance of {F 1 , . . ., F n }.
The principle of SR-based feature selection is to find a good balance to the trade-off between the SR value of and the number of features to select.This can be achieved efficiently by computing the relevance of individual features and sorting the features with respect to the computed relevance scores.For example, Table 1 shows a dataset, and we see the relevance of each feature measured by the mutual information at the bottom row.Since I(F 1 , C) = I(F 2 , C) ≈ 0.13, I(F 3 , C) ≈ 0.03 and I(F 4 , C) = I(F 5 , C) = 0 hold, if the requirement is to select two features that maximize the SR value, the best choice is definitely F 1 and F 2 .If the requirement is to select the smallest feature set whose SR value is no smaller than 0.25, the answer should be F 1 and F 2 as well.As a substitute for mutual information, we can use the Bayesian risk, the symmetrical uncertainty and Matthews correlation coefficients, for example.
RELIEF-F [6] is a well-known example of a feature selection algorithm that relies only on SR functions.For the underlying relevance function, RELIEF-F uses a distance-based randomized function.Since computing this distance-based relevance function requires relatively heavy computation, RELIEF-F is not very fast, but, in general, the simple SR-based feature selection scales and can be applied to high-dimensional data.

The Problem of Redundancy
The simple SR-based feature selection has, however, two important problems that will harm the collective relevance of selected features.One of them is the problem caused by internal correlation, which is also known as the problem of redundancy.The problem is described as follows.

Problem of redundancy
Feature selection by SR may select features that are highly mutually correlated, and such high internal correlation definitely decreases the collective relevance of the features.
The dataset of Table 2 is obtained by adding the feature F 6 to the dataset of Table 1.Eventually, F 6 is a copy of F 1 , and, hence, I(F 6 , C) = I(F 1 , C) ≈ 1.3 holds.To select two features that maximize the SR value, we have three answer candidates this time, that is, {F 1 , F 2 }, {F 1 , F 6 } and {F 2 , F 6 }.Among the candidates, {F 1 , F 6 } is clearly a wrong answer, since its joint relevance has no gain over the individual relevance of F 1 and F 6 .
This thought experiment inspires us to pay attention to the internal correlation among features.If the internal correlation among features is greater, the features include more redundancy when they determine classes.For example, if we use mutual information to evaluate internal correlation, the internal correlation of {F 1 , F 2 } is computed to be I(F 1 , F 2 ) = 0, that is, F 1 and F 2 are independent of each other.On the other hand, the internal correlation of {F 1 , F 6 } is I(F 1 , F 6 ) = H(F 1 ) ≈ 0.68.Therefore, the set of {F 1 , F 6 } includes more redundancy than {F 1 , F 2 }, and, hence, we should select {F 1 , F 2 } rather than {F 1 , F 6 }.The principle of minimizing redundancy (MR) is to design feature selection algorithms so that they avoid selecting features that have high internal correlation.
The algorithm of mRMR (Minimum Redundancy and Maximum Relevance) [7] is a well-known greedy forward selection algorithm that maximizes the sum of relevance (SR) with respect to the mutual information and minimizes the internal redundancy determined by FCBF (Fast Correlation-Based Filter) [8] and CFS (Correlation-based Feature Selection) [9] are also known to be based on the principle of minimizing redundancy.
Although the principle of minimizing redundancy definitely improves the actual collective relevance of features to select, it cannot solve the other problem of the SR framework, which we state next.

The Problem of Feature Interaction
We start with describing the problem.

Problem of feature interaction
Feature selection by SR may miss features if their individual relevance is low but they show high collective relevance by interacting one another.
For the datasets of Tables 1 and 2, F 4 and F 5 determine the class C by the formula of C = F 4 ⊕ F 5 , where ⊕ denotes the addition modulo two, and, hence, their collective relevance is the highest.Nevertheless, the sum of relevance for {F 4 , F 5 } is zero, and, hence, the feature selection algorithms that we saw in Sections 2.1 and 2.2 have no chance to select {F 4 , F 5 }.
This problem is explained by interaction among features: when more than one features that individually show only low relevance exhibit high collective relevance, we say that the features interact with each other.As shown in the example above, neither the SR principle nor the MR principle can incorporate feature interaction into the results of feature selection.
The literature provides two approaches to solve this problem: rule-based, and consistency-based.We will define these approaches here.FRFS (FOIL Rule based Feature subset Selection) [10] is a characteristic example of a rule-based feature selection algorithm.The algorithm first extracts events of feature interaction from a dataset as frequent rules.Each rule is in the form of ( f 1 , . . ., f k ) ⇒ c such that the antecedent f 1 , . . ., f k are feature values and the consequent c is a class label.FRFS can perform the rule extraction efficiently by leveraging First Order Inductive Learner (FOIL) [11].However, it is still too slow to apply to big data analysis.

Consistency-Based Feature Selection
The consistency-based approach solves the problem of feature interaction by leveraging consistency measures.A consistency measure is a function that takes sets of features as input rather than individual features.Furthermore, a consistency measure is a function that represents collective irrelevance of the feature set input, and, hence, the smaller a value of a consistency measure is, the more relevant the input feature set is.
Moreover, a consistency measure is required to have the determinacy property: its measurement is zero, if, and only if, the input feature set uniquely determines classes.

Definition 1.
A feature set of a dataset is consistent, if, and only if, it uniquely determines classes, that is, any two instances of the dataset that are identical with respective to the values of the features of the feature set have the identical class label as well.
Hence, a consistency measure function returns the value zero, if, and only if, its input is a consistent feature set.An important example of the consistency measure is the Bayesian risk, also known as the inconsistency rate [2]: The variable x i moves in the sample space of F i , while the variable y moves in the sample space of C. It is evident that the Bayesian risk is non-negative, and determinacy follows from Another important example of the consistency measure is the binary consistency measure, defined as follows: FOCUS [1], the first consistency-based algorithms in the literature, performs an exhaustive search to find the smallest feature set {F 1 , . . ., F n } with Bn(F 1 , . . ., F n ) = 0.
Apparently, FOCUS cannot be practically fast.In general, consistency-based feature selection has problems in time-efficiency because of the broadness of the search space.In fact, the search space should be the power set of the entire set of features, and its size is an exponential function of the number of features.

Problem of consistency measures
When N features describe a dataset, the number of the possible input to a consistency measure is as large as 2 N .
The monotonicity property of consistency measures helps to solve this problem.The Bayesian risk, for example, has this property: if F ⊆ G, Br(F ) ≥ Br(G) holds, where F and G are feature subsets of a dataset.Almost all of the known consistency measures such as the binary consistency measure and the conditional entropy H(C | F ) = H(C) − I(F , C) have this property as well.In [5], the consistency measure is formally defined as a non-negative function that has the determinacy and monotonicity properties.
Although some of the algorithms in the literature such as ABB (Automatic Branch and Bound) [2] took advantage of the monotonicity property to narrow their search space, the real breakthrough was yielded by Zhao and Liu in their algorithm INTERACT [3].INTERACT uses the combination of the sum-of-relevance function based on the symmetrical uncertainty and the Bayesian risk.
The symmetrical uncertainty is a harmonic mean of the ratios of I(F, C)/H(F) and I(F, C)/H(C) and hence turns out to be .
The basic idea of INTERACT is to narrow down the search space boldly and to take SR values of the symmetrical uncertainty into account to cover the caused decrease of relevance.Although the search space of INTERACT is very narrow, the combination of the SR function and the consistency measure keeps the accuracy performance good.LCC [4] improves INTERACT and can exhibit better accuracy.Although INTERACT and LCC are much faster than previous consistency-based algorithms described in the literature, they are not fast enough to apply to large datasets with thousands of instances and features.
CWC [5] is a further improvement and replaces the Bayesian risk with the binary consistency measure, which can be computed faster.CWC is reported to be about 50 times faster than INTERACT and LCC on average.In fact, CWC performs feature selection for a dataset with 800 instances and 100,000 features in 544 s, while LCC does it in 13,906 s.Although the improvement was remarkable, CWC is not fast enough to apply to big data analysis.

Summary
Figure 3 summarizes the progress of feature selection in the literature.The legacy framework of sum-of-relevance (SR) has the problems of redundancy and feature interaction.The principle of minimizing redundancy (MR) in combination with SR solves the problem of redundancy and provides practical algorithms such as mRMR.Furthermore, using consistency measures (CM) solves the problem of feature interaction, but is time-consuming because complete (exhaustive) search (CS) is necessary.On the other hand, the combination of SR and CM allows linear search (LS) and improves the low time-efficiency of the consistency-based feature selection dramatically.In particular, CWC is the fastest and the most accurate consistency-based algorithm and is comparable with FRFS, which is rule-based.Nevertheless, CWC or FRFS does not scale well for big data analysis.

SCWC and SLCC
SCWC and SLCC improve the time efficiency of CWC and LCC significantly.The letter "s" of SCWC and SLCC represents "scalable", "swift" and "superb".

The Algorithms
We start by explaining the CWC algorithm.Figure 1 depicts the algorithm.Given a dataset described by a feature set {F 1 , . . ., F N }, CWC aims to output a minimal consistent subset S ⊆ {F 1 , . . ., F N }.Definition 2. A minimal consistent subset S satisfies Bn(S) = 0 and Bn(T) > 0 for any proper subset T ⊂ S.
Achieving this goal is, however, impossible if Bn(F 1 , . . ., F N ) > 0 holds, since Bn(S) ≥ Bn(F 1 , . . . ,F N ) > 0 always holds by the monotonicity property of Bn.Therefore, the preliminary step of CWC is to remove the cause of Bn(F 1 , . . ., F N ) > 0. To be specific, if Bn(F 1 , . . ., F N ) > 0, there exists at least one inconsistent pair of instances, which are identical with respect to the feature values but with different class labels.The process of denoising is thus to modify the original dataset so that it includes no inconsistent pairs.To denoise, we have two approaches as follows: 1. We can add a dummy feature F to {F 1 , . . ., F N } and can assign a value of F to an instance so that, if the instance is not included in any inconsistent pair, the assigned value is zero; otherwise, the assigned value is determined depending on the class label of the instance.2. We can eliminate at least a part of the instances that are included in inconsistent pairs.Although both of the approaches can result in Bn(F 1 , . . ., F N ) = 0, the former seems better because useful information may be lost by eliminating instances.Fortunately, high-dimensional data usually have the property of Bn(F 1 , . . ., F N ) = 0 from the beginning since N is very large.When Bn(F 1 , . . . ,F N ) = 0, denoising is benign and does nothing.
On the other hand, to incorporate sum-of-relevance into consistency-based feature selection, we sort features in the incremental order of their symmetrical uncertainty scores, that is, we renumber F i s so that SU(F i ) ≤ SU(F j ) if i < j.The symmetrical uncertainty, however, is not the mandatory choice, and we can use any measure to evaluate relevance of an individual feature so that, the greater a value of the measure is, the more relevant the feature is.For example, we can replace the symmetrical uncertainty with the mutual information I(F, C).
CWC deploys a backward elimination approach: it first sets a variable S to the entire set {F 1 , . . ., F N } and then investigates whether each F i can be eliminated from S without violating the condition of Bn(S) = 0.That is, S is updated by S = S \ {F i }, if, and only if, Bn(S \ {F i }) = 0. Hence, CWC continues to eliminate features until S becomes a minimal consistent subset.Algorithm 1 describes the algorithm of CWC.
The order of investigating F i is the incremental order of i, and, hence, the incremental order of SU(F i ).Since F i is more likely to be eliminated than F j with i < j, we see that CWC stochastically outputs minimal consistent subsets with higher sum-of-relevance scores.
1: Sort F 1 , . . ., F N in the incremental order of SU(F i ; C).
update S by S = S \ {F i }. 6:

end if 7: end for
To improve the time efficiency of CWC, we restate the algorithm of CWC as follows.To illustrate, let S 0 be a snapshot of S immediately after CWC has selected F k .In the next step, CWC investigates whether Bn(S 0 \ {F k+1 }) = 0 holds.If so, CWC investigates whether Bn(S 0 \ {F k+1 , F k+2 }) = 0 holds.CWC continues the same procedure, until it finds F with Bn(S 0 \ {F k+1 , F k+2 , . . ., F }) > 0. This time, CWC does not eliminate F . S is set to S 0 \ {F k+1 , F k+2 , . . ., F −1 }, and CWC investigate F +1 next.Thus, to find F to be selected, CWC solves the problem to find such that To solve the problem, CWC relies on linear search.On the other hand, the idea of improving CWC is obtained by looking at the same problem from a different direction.By the monotonicity property of Bn, Bn(S 0 \ {F k+1 , F k+2 , . . ., F i }) ≥ Bn(S 0 \ {F k+1 , F k+2 , . . ., F }) > 0 holds for any i ≥ , and, therefore, the formula This characterization of indicates that we can take advantage of binary search instead of linear search to find (Algorithm 2).Since the average time complexity of the binary search is O(log(N − k)), we can expect significant improvement compared with the time complexity of O(N − k) of the linear search used in CWC.Algorithm 3 depicts our improved algorithm, SCWC.= high.

14: end if
In addition, Algorithm 4 depicts the algorithm of LCC [4].In contrast to CWC, LCC accepts a threshold parameter δ ≥ 0. The parameter determines the strictness of its elimination criteria.The greater δ is, the looser the criteria is, and, therefore, the smaller features LCC selects.
1: Sort F 1 , . . ., F N in the incremental order of SU(F i ; C).

Complexity Analysis
When N F and N I denote the number of features and instances of a dataset, the average time complexity of CWC is estimated by O(N F N I (N F + log N I )) [12].The first term O(N 2 F N I ) represents the feature selection computation, while the second term O(N F N I log N I ) shows the instance sorting process.In [12], it is shown that sorting instances at the initial stage of the algorithm is highly effective for investigating Bn(S \ {F i }) = 0 efficiently (see also Section 6.1).Since SCWC improves the time-efficiency of selecting features by replacing the linear search of CWC with binary search, we can estimate its time complexity by O(N F N I (log N F + log N I )).By the same analysis, we can conclude that the same estimate of time complexity applies to SLCC.
We have verified this estimate of time complexity through experiments using high-dimensional datasets described in Table 3, whose dimensions vary from 15,741 to 38,822.Figure 4 plots the experimental results: the x-axis represents N F N I (log N F + log N I ), while the y-axis represents run-time of SCWC in milliseconds.We observe that the plots are approximately aligned along a straight line, and, hence, can conclude that the aforementioned estimate is right.

Comparison of Feature Selection Algorithms
We compare SCWC and SLCC with four benchmark algorithms, namely, FRFS, CFS, RELIEF-F and FCBF, with respect to the accuracy, the run-time and the number of features selected.FRFS [10] is a rule-based algorithm, while CFS [9], RELIEF-F [13] and FCBF [8] are sum-of-relevance-based algorithms.In addition, CFS and RELIEF-F are designed to avoid redundant selection of features.

Datasets to Use
For the comparison, we use 15 relatively large datasets, since the benchmark algorithms are not fast enough to apply to really high-dimensional datasets such as those described in Table 3. Table 4 describes the datasets, and Figure 5 plots the number of features N F (x-axis) and the number of instances N I (y-axis) of each dataset.
To make the comparison fair, ten of the datasets are chosen from the feature selection challenges of Neural Information Processing Systems (NIPS) 2003 [14] and World Congress on Computational Intelligence (WCCI) 2006 [15].The datasets of NIPS 2003 emphasize the largeness of the feature number N F , while those of WCCI 2006 do the instance number N I .The remaining five datasets are retrieved from the University of California, Irvine (UCI) repository of machine learning databases [16].

Comparison of Accuracy
When the input dataset is described by a consistent feature set, that is, when Bn({F 1 , . . ., F N }) = 0 holds, SLCC with δ = 0 selects the same features as SCWC does.Hence, we compare the best the area under an receiver operating characteristic curve (AUC-ROC) scores of SLCC when the parameter δ varies from 0 to 0.02 at interval of 0.002 with the AUC-ROC scores of the other benchmark algorithms.In addition, since many of the benchmark algorithms cannot finish feature selection for the dataset DOROTHEA within a reasonable time allowance, we do not use the dataset for the purpose of comparison in accuracy.

Method
We generate 10 pairs of training and test data subsets from each dataset of Figure 5 by distributing the instances to test and training data subsets at random with a ratio of 4:1, and perform the following for each pair: (1) We run the feature selection algorithms on the training data subset and then reduce the training dataset so that the selected features describe the reduced training data subset.(2) We reduce the test data subset so that the selected features describe the reduced test data subset.For SLCC, we run experiments with SLCC changing δ from 0 to 0.02 at interval of 0.002 and select the best scores.

Results and Analysis
Tables 5-10 describe the result of the comparison.For each combination of a classifier and an accuracy measure, we see the raw scores of the six feature selection algorithms in the upper rows and their rankings in the lower rows.Figures 6 and 7 also depict the same information, where SLCC, FRFS, CFS, RELIEF-F and FCBF are displayed in the colors of blue, orange, gray, yellow and light blue, respectively.
Remarkably, for all the combinations of classifiers and accuracy measures, SLCC and FRFS monopolize the first and second places with respect to both of the averaged raw scores and ranks.Furthermore, SLCC outperforms the others except for the combination of Naïve Bayes and AUC-ROC with respect to the averaged raw scores.With respect to the averaged ranks, SLCC is ranked top for the combinations of SVM and AUC-ROC, SVM and F-Score and Naïve Bayes and F-Score, while FRFS outperforms SLCC for the other three combinations.
Table 11 shows for each feature selection algorithm its averaged scores of AUC-ROC score and F-measure across the three classifiers and the 15 datasets.We see that SLCC outperforms the other algorithms for both AUC-ROC and F-Scores.Table 11 also shows the averaged ranks of the feature selection algorithms across the two accuracy measures, three classifiers and the 14 datasets.SLCC and FRFS turn out to have the same averaged rank, and they are evidently superior to the other three benchmark algorithms.
To verify the observed superiority of SLCC and FRFS, we conduct non-parametric multiple comparison tests following the recommendation by Demšar [17].To be specific, we have performed the Friedman test and then the Hommel test.To avoid the type II error of a multiple comparison test, the tests are conducted only once based on the averaged ranks displayed in Table 11, which are computed across all of the combinations of a classifier, an accuracy measures and a dataset.The results of the tests are described in the left column of Table 11.For the Freedman test, the observed p-value is extremely small and displayed as 0.00, and, therefore, we reject the null hypothesis and conclude that there exists a statistically significant difference among the feature selection algorithms.In the Hommel test, which follows the Friedman test, we use SLCC as a control.The calculated p-values are negligibly small for CFS, RELIEF-F and FCBF, and, hence, we can conclude that the observed superiority of SLCC to CFS, RELIEF-F and FCBF is statistically significant.On the other hand, the results of the Hommel test indicates that SLCC and FRFS are compatible with each other, since the corresponding p-value is as great as 0.981, which is very close to 1.0.
As a conclusion, to obtain high accuracy, we can recommend to use SLCC and FRFS for feature selection.To emphasize the difference between these two algorithms, SLCC will perform better when used with SVM, while FRFS will perform better when used with C4.5.For the seven datasets of GINA, GISETTE, HIVA, MUSHROOM, NOVA and SILVA, that is, for almost half of the datasets investigated, the AUC-ROC score reaches the maximum when δ = 0. On the other hand, we should note that the results for KR-VS-KP and MADELON show steep drop-offs of the AUC-ROC score at δ = 0. Since SLCC outputs the same results as SCWC does in most of cases when δ = 0, these results imply a clever way to use SCWC and SLCC: we can try SCWC first and then apply SLCC if good results are not obtained from SCWC.
Figure 8 also includes plots of the numbers of features selected (plots and lines in blue).In theory, running SLCC with a greater δ will results in selection of a smaller number of features.The results of the experiments support this only except for GINA.

Time-Efficiency
Table 12 describes the experimental results of the run-time performance of SLCC, SCWC, FRFS, CFS, RELIEF-F, FCBF, LCC, CWC and INTERACT.The feature selection algorithm of INTERACT was not used for the comparison in accuracy, since it was shown in [12] that LCC always outperform INTERACT with respect to accuracy.In the experiment, we use six datasets out of the 15 datasets described in Table 4.These six datasets need a relatively long time to perform feature selection and are appropriate for the purpose of comparing the time-efficiency of feature selection algorithms.Furthermore, we use a Mac Book Pro (2016, Apple Inc., Cupertino, California, USA.) with Quad Core i7 2.5 GHz processor and 8 GB memory.The threshold parameter δ for SLCC and LCC is set to 0.01.
From the result, we see that SLCC and SCWC outperform the others, and they have greatly improved the performance of CWC and LCC.In particular, the extent of the improvement of SLCC from LCC is remarkably greater than that of SCWC from CWC.In fact, from Table 12, we see that, even though there is a significant difference in run-time between LCC and CWC, the run-time performance of SLCC and SCWC appears comparable with each other.This can be explained as follows: with a greater δ, SLCC/LCC eliminates more features; in other words, the intervals between adjacent selected features become wider; this implies that the number of features investigated by binary search decreases, while the number of features investigated by linear search remains the same; thus, as δ increases, the extent of improvement by SLCC over LCC becomes more significant.The experimental results depicted by Figure 9 supports this discussion.Since LCC relies on linear search, every feature is evaluated exactly one time regardless of the value of δ.Hence, the run-time of LCC remains the same even if δ changes.By contrast, as Figure 9 shows, the run-time of SLCC decreases as δ increases.This is because the number of features selected decreases as δ increases, and, consequently, SLCC investigates a fewer number of features.When looking at Figure 8 from this viewpoint, we realize that it is a basic tendency that the number of features selected is a decreasing function of δ (the dataset of GINA is the only exception).
Table 12 also shows the results of the Hommel test to compare SLCC with FRFS, CWC and INTERACT, selected from the feature selection algorithms that can finish feature selection within a reasonable time allowance for all of the six datasets: FRFS is the fastest of the benchmark algorithms; CWC is included to show the degree of improvement by SLCC; INTERACT is included because it is well known as the first consistency-based feature selection algorithm that is practically efficient.The displayed p-values indicate that the observed superiority of SLCC is statistically significant for the significance level of 5%.These datasets are, however, significantly smaller in size than the data to which we intend to apply SLCC and SCWC.We further investigate the efficiency performance of SLCC and SCWC in the next section.

Performance of SLCC and SCWC for High-Dimensional Data
In this section, we look into both of the accuracy and time-efficiency performance of SLCC and SCWC when applied to high-dimensional data.We use 26 real datasets studied in social network analysis, which were described in Section 1 as well.These datasets were generated from the large volume of tweets sent to Twitter on the day of the Great East Japan Earthquake, which hit Japan at 2:46 p.m. on 11 March 2011 and inflicted catastrophic damage.Each dataset was generated from a collection of tweets posted during a particular time window of an hour in length and consists of a word count vector for each author of Twitter that reflects all words in all they sent during that time window.In addition, each author was given a class label reflecting the category he or she was assigned from the k-means clustering process.We expect that this annotation represents the extent to which authors are related to the Great East Japan Earthquake.
Table 13 shows the AUC-ROC scores of C-SVM that run on the features selected by SCWC.We measured the scores using the method described in Section 4.2.1.Given time constraints, we use only 18 datasets out of the 26 datasets prepared.We see that the scores are significantly high, and the features selected well characterize the classes.
From Table 3, we observe that the run-times of SCWC on the aforementioned 26 datasets range from 1.397 s to 461.130 s, and the average is 170.017s.Thus, this experiment shows that the time-efficiency of SCWC is satisfactory enough to apply it to high-dimensional data analysis.For a more precise measurement, we compare SCWC with FRFS.Table 14 shows the results.Since running FRFS takes much longer, we test only three datasets and use a more powerful computer with CentOS release 5.11 (The CentOS Project), Intel Xeon X5690 6-Cores 3.47 GHz processor and 192 GB memory (Santa Clara, California, USA).Although we tested only a few datasets, the superiority of SCWC to FRFS is evident: SCWC is more than twenty times faster than FRFS.In addition, we can conclude that SCWC remarkably improves the time-efficiency of CWC.Running CWC on the smallest dataset with 15,567 instances and 15,741 features in Table 3 requires several hours to finish feature selection.Based on this, we estimate that it will take up to ten days to process the largest dataset with 200,569 instances and 99,672 features because we know the time complexity of CWC is O(N F N I (N F + log N I )).Surprisingly, SCWC has finished feature selection of this dataset in only 405 s.
Lastly, we investigate how the parameter δ can affect the performance of SLCC.As described in Section 4.3, with greater δ, SLCC will eliminate more features, and, consequently, the run-time will decrease.To verify this, we run an experiment with SLCC with the dataset with 161,425 instances and 38,822 features.Figure 10 exhibits plots of the results.In fact, the number of features selected by and the run-time of SLCC decrease as the threshold δ increases.In addition, we see that, although SLCC selects the same features as SCWC when δ = 0, the run-time is greater than SCWC.This is because computing the Bayesian risk (Br) is computationally heavier than computing the binary consistency measure (Bn).It is also interesting to note that SLCC becomes faster than SCWC for greater thresholds, and their averaged run-time performance appears comparable.

An Implementation
In this section, we show an implementation of our algorithms, which is also used in the experiments stated above.The data structure deployed by the implementation is a secret ingredient that makes the implementation fast.In addition, parallel computation will be possible thanks to the data structure.

The Data Structure
We consider the moment when SCWC selected F k 1 , F k 2 , . . ., F k −1 in this order and has just decided to select the feature F k .We let F k = F k and call the sequence (F k , F k −1 , . . ., F k 1 ) a prefix.Note that In the next round, F k+1 , F k+2 , . . ., F N are targets of investigation.Hence, SCWC selects one of F k+1 , F k+2 , . . ., F N , denoted by F k +1 , and eliminates all of F k+1 , F k+2 , . . ., F k +1 −1 .
In our implementation of SLCC and SCWC, at this moment, every instance of a dataset is represented as a vector of values for the sequence of features (F k , F k −1 , . . ., F k 1 , F N , F N−1 , . . ., F k+2 , F k+1 ), a concatenation of the prefix and the target features, and all of the instances are aligned in the lexicographical order of the feature values.Figure 11 shows an example.In the example, the prefix is (F 5 , F 3 , F 2 ), and the targets in the next round are F 9 , F 8 , F 7 , F 6 .For simplicity, we assume that all of the features and the class are binary variables, which take either 0 or 1 as values.The COUNT column shows the number of instances that are identical in all of the remaining features (F 2 , F 3 , F 5 , F 6 , F 7 , F 8 and F 9 ) and the class.
The following are advantages of this data structure.To illustrate, we let 1. To find inconsistent pairs of instances with respect to S, we only have to compare adjacent vectors (rows) in the data structure.We say that two instances compose an inconsistent pair with respect to S, if, and only if, the instances have the same value for every feature in S but are different in class labels.2. To evaluate Br(S \ {F k+1 , . . ., F i }) and Bn(S \ {F k+1 , . . ., F i }), we only have to evaluate the measures in the reduced data structure obtained by simply eliminating the columns that correspond to F k+1 , . . ., F i .In the reduced data structure, instances are still aligned in the lexicographical order of values with respect to (F k , F k −1 , . . ., F k 1 , F N , F N−1 , . . ., F i+2 , F i+1 ), and, hence, by investing adjacent vectors (rows), we can evaluate the measures.For the example of Figure 11, Bn(F 5 , F 3 , F 2 , F 9 , F 8 , F 7 , F 6 ) = 0 is derived, since no adjacent vectors are congruent with respect to the features F 5 , F 3 , F 2 , F 9 , F 8 , F 7 , F 6 .

PREFIX TARGETS
An example of data structure.
Next, to evaluate Bn(F 5 , F 3 , F 2 , F 9 , F 8 ), we temporarily eliminate the columns of F 6 and F 7 .Figure 12 shows the resulting data structure.By investigating adjacent vectors (rows), we see that the first and second instances are inconsistent with each other with respect to the features F 5 , F 3 , F 2 , F 9 , F 8 .Hence, we have Bn(F 5 , Since we can verify Bn(F 5 , F 3 , F 2 , F 9 , F 8 , F 7 ) = 0 by the same means, SCWC selects F 7 and eliminates F 6 .The left chart of Figure 13 shows the resulting data structure after eliminating F 6 (F 7 is moved to the top of the prefix), while the right chart exhibits the result of applying bucket sort according to the value of F 7 .We should note that the vectors are aligned in the lexicographical order of the values of (F 7 , F 5 , F 3 , F 2 , F 9 , F 8 ), and the data structure is ready to go to the next round of selection.

The Program spcwc.jar
Instructions for using this program, namely spcwc.jar,are given in Table 15.The program contains implementation of both SCWC and SLCC.Using SLCC requires specifying a threshold value, which should be between Br(F 1 , . . ., F N ) and Br(∅), where F 1 , . . ., F N are the entire features of a dataset.Br(∅) is given by Br(∅) = 1 − max y Pr[C = y].
In addition, we may change the measure used to sort features in the first step of SCWC and SLCC.Different feature selection results can be obtained with different measures.The measure can be either symmetrical uncertainty (default), mutual information, Bayesian risk, or Matthews correlation coefficient.
The program also outputs a log file with the extension .log.This log file contains the run-time record of the program, the features selected, the numbers of instances and features of the dataset input, and the measurements of individual features in the symmetrical uncertainty, the mutual information, the Bayesian risk and Matthews correlation coefficient.
The recommended method for using the program is to run it first with only the i option.Features will be sorted according to their symmetrical uncertainty scores, and, then, SCWC will select features.If we are not satisfied with the program's result, we can try other options; for example, we can try SLCC with optimized threshold values.To obtain optimized threshold, we can take advantage of any methods for hyper-parameter optimization such as the grid search and the Bayesian optimization [18].

Parallelization
Another important advantage of the data structure described in Section 6.1 is its suitability for parallel computing.Since evaluation of the Bayesian risk and the binary consistency measure can be performed only by investigating whether adjacent instances are inconsistent with each other, we can partition the entire data structure into multiple partitions and can investigate them in parallel.The data structure must be partitioned so that two adjacent instances that belong to different partitions are not congruent with respect to values of the current features because such adjacent instances cannot be inconsistent.
For example, to evaluate Bn(F 5 , F 3 , F 2 , F 9 , F 8 ) in Figure 12, we can partition the data structure of Figure 12 as Figure 14 depicts and investigate the partitions in parallel: Bn(F 5 , F 3 , F 2 , F 9 , F 8 ) = 0 holds, if, and only if, any of the partitions includes no adjacent instances that are mutually inconsistent.For example, the first three instances of Figure 12 must belong to the same partition because they have identical values for the features F 5 , F 3 , F 2 , F 9 , F 8 .The value should be in the interval [0,1).When the value 0 is specified, SCWC will run, even if -a lcc is specified.-s <measure> A statistical measure to use when sorting features.su The symmetrical uncertainty will be used (default).mi The mutual information will be used.br The Bayesian risk will be used.mc Matthews correlation coefficient will be used.

Conclusions
Feature selection is a useful tool for data analysis, and, in particular, is useful to interpret phenomena that you find in data.Consequently, feature selection has been studied intensively in machine learning research, and multiple algorithms that exhibit excellent accuracy have been developed.Nevertheless, such algorithms are seldom used for analyzing huge data because the algorithms usually take too much time.In this paper, we have introduced two new feature selection algorithms, namely SCWC and SLCC, that scale well to huge data.They are based on the algorithms that exhibited excellent accuracy in the literature and do not harm the accuracy of the original algorithms.We have also introduced an implementation of our new algorithms and have described a recommended usage of the implementation.

Figure 4 .
Figure 4.A relationship between N F N I (log N F + log N I ) (the x-axis) and run-time of SCWC (the y-axis).

Figure 5 .
Figure 5.The fifteen datasets used in the experiment.The blue plots (•) represent the five datasets used in the feature selection challenge of NIPS 2003[14], while the red plots (•) do those used in the challenge of WCCI 2006[15].The other five are retrieved from the USI repository[16].

( 3 )
We train three classifiers with the reduced training data subset.The classifiers to use are C Support Vector Machine with Radial Base Function (RBF-Kernel-C-SVM), Naïve Bayes and C4.5.Optimal values for the γ and C parameters of the RBF-kernel-C-SVM and the confidence factor of C4.5 are chosen through grid search with ten-fold cross validation on the reduced training data subset.(4) We make the trained classifiers predict class labels for all of the instances of the reduced test data subset and compute scores of the accuracy measures of AUC-ROC (Area Under Curve of ROC curve) and F-measure by comparing the obtained prediction and the true class labels.

Figure 8
Figure 8 shows the results of experiments of performing SLCC changing the value of δ from 0 to 0.02 at interval of 0002: the AUC-ROC scores computed based on the results when using the RBF-kernel-C-SVM as a classifier for different values of δ are displayed in orange per dataset.

Figure 8 .
Figure 8.The Area Under an Receiver Operating Characteristic Curve (AUC-ROC) scores and numbers of features selected by SLCC changing δ from 0.0 to 0.02 at interval of 0.002.The lines and plots in blue represent feature numbers, while those in orange do AUC-ROC scores.

Figure 9 .
Figure 9. Relation between the run-time of SLCC and the value of δ.The x-axis represents the value of δ, while the y-axis does the run-time of SLCC in milliseconds.
(a) Numbers of feature selected by SLCC with different values for the parameter δ.(b) Run-time of SLCC in millisecond with different values for the parameter δ.

Figure 10 .
Figure 10.The effects of different values upon the threshold parameter δ.The x-axis represents the value of δ, while the y-axis represents (a) the number of feature selected by and (b) the run-time of SLCC.The orange lines indicate the corresponding values by SCWC.

F 7 F 5 F 3 F 2 F 9 F 8 CLF 7 F 5 F 3 F 2 F 9 F 8 CLFigure 13 .
Figure 13.Eliminating F 6 and sorting with respect to the value of F 7 .

PREFIX TARGETS F 5 F 3 F 2 F 9 F 8 F 7 F 6 Figure 14 .
Figure 14.An example of data structure.

Table 2 .
An example dataset.

Table 3 .
Run-time of SCWC when applied to real high-dimensional data.

Table 4 .
Attributes of the 15 datasets used in the experiment for comparison of accuracy.

Table 5 .
Support Vector Machine (SVM) and the Area Under an Receiver Operating Characteristic Curve (AUC-ROC).Av. denotes averaged values.

Table 11 .
Overall comparison of the feature selection algorithms.The scores of the Area Under an Receiver Operating Characteristic Curve (AUC-ROC) and F-Score are the averaged values across the three classifiers and the 14 datasets.On the other hand, the averaged ranks are computed across all of the combinations of classifiers, accuracy measures and datasets.The Friedman and Hommel tests are conducted based on the averaged ranks computed here.

Table 12 .
Comparison of run-time (seconds) with relatively large datasets.

Table 13 .
AUC-ROC of SCWC when applied to real high-dimensional data.

of Instances # of Features SCWC (s) FRFS (s) Ratio
3. Assume that the algorithm selects F k +1 and eliminates the features F k+1 , F k+2 , ..., F k +1 −1 .The necessary update of the data structure can be carried out in time linear to the number of instances: first, we simply eliminate the columns corresponding to F k+1 , F k+2 , ..., F k +1 −1 ; in the reduced data structure, instances are aligned in the lexicographical order of values with respect to (F k ,F k −1 , . .., F k 1 , F N , F N−1 , . .., F k +1 ); to update the prefix from (F k , F k −1 , . .., F k 1 ) to (F k +1 , F k , . .., F k 1 ), we only have to apply the bucket sort with respect to the value of F k +1 .