Next Article in Journal
3D Capsule Hand Pose Estimation Network Based on Structural Relationship Information
Previous Article in Journal
Isomorphism of Binary Operations in Differential Geometry
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

New Online Streaming Feature Selection Based on Neighborhood Rough Set for Medical Data

1
School of Business, Central South University, Changsha 410083, China
2
Third Xiangya Hospital, Central South University, Changsha 410013, China
*
Author to whom correspondence should be addressed.
Symmetry 2020, 12(10), 1635; https://doi.org/10.3390/sym12101635
Submission received: 7 August 2020 / Revised: 13 September 2020 / Accepted: 22 September 2020 / Published: 3 October 2020
(This article belongs to the Section Computer)

Abstract

:
Not all features in many real-world applications, such as medical diagnosis and fraud detection, are available from the start. They are formed and individually flow over time. Online streaming feature selection (OSFS) has recently attracted much attention due to its ability to select the best feature subset with growing features. Rough set theory is widely used as an effective tool for feature selection, specifically the neighborhood rough set. However, the two main neighborhood relations, namely k-neighborhood and δ neighborhood, cannot efficiently deal with the uneven distribution of data. The traditional method of dependency calculation does not take into account the structure of neighborhood covering. In this study, a novel neighborhood relation combined with k-neighborhood and δ neighborhood relations is initially defined. Then, we propose a weighted dependency degree computation method considering the structure of the neighborhood relation. In addition, we propose a new OSFS approach named OSFS-KW considering the challenge of learning class imbalanced data. OSFS-KW has no adjustable parameters and pretraining requirements. The experimental results on 19 datasets demonstrate that OSFS-KW not only outperforms traditional methods but, also, exceeds the state-of-the-art OSFS approaches.

1. Introduction

The number of features increases with the growth of data. A large feature space can provide much information that is useful for decision-making [1,2,3], but such a feature space includes many irrelevant or redundant features that are useless for a given concept. It is of necessity to remove the irrelevant features so that the curse of dimensionality can be relieved. This motivates some sort of research for feature selection methods. Feature selection, as a significant preprocessing step of data mining, can select a small subset, including the most significant and discriminative condition features [4]. Traditional methods are developed based on the assumption that all features are available. Many typical approaches exist, such as ReliefF [5], Fisher Score [6], mutual information (MI) [4], Laplacian Score [7], LASSO [8], and so on [9]. The main benefits of feature selection include speeding up the model training, avoiding overfitting, and reducing the impact of dimensionality during the process of data analysis [4].
However, features in many real-world applications are individually generated one-by-one over time. Traditional feature selection can no longer meet the required efficiency with the growing volume of features. For example, in the medical field, a doctor cannot easily obtain the entire features of a patient. In bioinformatic and clinical medicine situations, acquiring the entire features in a feature space is expensive and inefficient because of high-cost laboratory experiments [10]. In addition, for the task of medical image segmentation, acquiring the entire features is infeasible due to the infinite number of filters [11]. Furthermore, the symptom of a patient persistently changes over time during the treatment, and judging whether the feature contains useful information is essential for identifying the patient’s disease after a new feature has emerged [12]. In these cases, waiting a long time until entire features are available and then performing the feature selection process is the primary method.
Online streaming feature selection (OSFS), presenting a feasible precept to solve feature streaming in an online way, has recently attracted wide concern [13]. The OSFS method must meet the following three criteria [14]: (1) Not all features are available, (2) the efficient incremental updating process for selected features is essential, and (3) accuracy is vital each time.
Many previous studies have proposed some different OSFS methods. For example, a grafting algorithm [15], which employed a stagewise gradient descent approach to feature selection, during which a conjugate gradient procedure was used to carry out its parameters. However, as well as the grafting algorithm, both fast OSFS [16] and a scalable and accurate online approach (SAOLA) [13] need to specify some parameters, which requires the domain information in advance. Rough set (RS) theory [17], which is an effective mathematic tool for features selection, rules extracting, or knowledge acquisition [18], needs no domain knowledge other than the given datasets [19]. In the real world, we usually encounter many numerical features in datasets, such as medical datasets. Under this circumstance, a neighborhood rough set is feasible to analyze discrete and continuous data [20,21]. Nevertheless, all these methods proposed have some adjustable parameters. Considering that selecting unified and optimal values for all different datasets is unrealistic [22], a new OSFS method based on an adapted neighborhood rough set is proposed, in which the number of neighbors for each object is determined by its surrounding instance distribution [22]. Furthermore, in the view of multi-granulation, multi-granulation rough sets is used to compute the neighborhoods of each sample and extract neighborhood size [23]. For the above OSFS methods based on neighborhood relation, dependency degree calculation is a key step. However, very little work has considered the neighborhood structure in the granulation view during this calculation. In additional, the phenomenon of uneven distribution of some data, including medical data, is common, and few works focus on the challenge of the uneven distribution of data.
In this paper, focusing on the strength and weakness of the neighborhood rough set, we proposed a novel neighborhood relation. Further, a weighted dependency degree was developed by considering the neighborhood structure of each object. Finally, our approach, named OSFS-KW, was established. Our contributions were as follows:
(1)
We proposed a novel neighborhood relation, and on this basis, we developed a weighted dependency computation method.
(2)
We developed an OSFS framework, named OSFS-KW, which can select a small subset made up of the most significant and discriminative features.
(3)
The OSFS-KW was established based on [24] and can deal with the class imbalance problem.
(4)
The results indicate that the OSFS-KW cannot only obtain better performance than traditional feature selection methods but, also, better than the state-of-the-art OSFS methods.
The remainder of the paper is organized as follows. In Section 2, we briefly review the main concepts of neighborhood RS theory. Section 3 discusses our new neighborhood relations and proposes the OSFS-KW. Then, Section 4 performs some experiments and discusses the experimental results. Finally, Section 5 concludes the paper.

2. Background

Neighborhood RS has been proposed to deal with numerical data or heterogeneous data. In general, a decision table (DT) for classification problem can be represented as I S = < U , R , V , f > [25], where U is a nonempty finite set of samples. R can be divided condition attributes C and decision attributes D, C D = . V = { V r | r R } is a set of attributes domains, such that V r denotes the domains of an attribute r. For each r R and x U , a mapping f : U × R V denotes an information function.
There are two main kinds of neighborhood relations: (1) the k-nearest neighborhood relation shown in Figure 1a and (2) the δ neighborhood relation shown in Figure 1b.
Definition 1 
[26]. Given DT, a metric is a distance function, and (x, y) represents the distance between x and y. Then, for x , y , z U , it must satisfy the following:
(1) 
( x , y ) 0 , when x = y and ( x , y ) = 0 ,
(2) 
( x , y ) = ( y , x ) , and
(3) 
( x , z ) ( x , y ) + ( y , z ) .
Definition 2 
( δ neighborhood [22]). Given DT, a feature subset B C , the neighborhood δ B r ( x ) of any object x U is defined as follows:
δ B r ( x ) = { y | B ( x , y ) r , y U , y x }
where r is the distance radius, and δ B r ( x ) satisfies:
(1) 
y δ B r ( x ) x δ B r ( y ) ,
(2) 
1 c a r d ( δ B r ( x ) ) c a r d ( U ) , where c a r d ( ) denotes the number of elements in the set, and
(3) 
x U δ B r ( x ) = U .
Definition 3 
(k-nearest neighborhood [22]). Given DT and B C , the k-nearest neighborhood K B k ( x ) of any object x U on the feature subset B is defined as follows:
K B k ( x ) = { y | y M I N k { Δ B ( x , y ) } , y U , y x }
where M I N k { Δ B ( x , y ) } represents the k neighbors closest to x on a subset B, and K B k ( x ) satisfies:
(1) 
K B k ( x ) ,
(2) 
c a r d ( K B k ( x ) ) = k , and
(3) 
x U K B k ( x ) = U .
Then, the concepts of the lower and upper approximations of these two neighborhood relations are defined as follows:
Definition 4. 
Given DT, for any X U , two subsets of objects, called the lower and upper approximations of X with regard to the δ neighborhood relation, are defined as follows [27]:
B δ _ X = { x i | δ ( x i ) X , x i U }
B δ ¯ X = { x i | δ ( x i ) X , x i U }
If x B δ _ X , then x certainly belongs to X , but if x B δ ¯ X , then it may or may not belong to X .
Definition 5. 
Given DT, for any X U , the lower and upper approximations concerning the k-nearest neighborhood relation are defined as [24]:
B k _ X = { x i | k ( x i ) X , x i U }
B k ¯ X = { x i | k ( x i ) X , x i U }
Figure 1a shows that the k-nearest neighbor (k = 4) samples of x 1 , x 2 , and x 3 have different class labels. In detail, the k-nearest neighborhood samples of x 1 are from class C 2 with the mark “·” and class C 3 with the mark “🟌”; k-nearest neighborhood samples of x 2 are from classes, C 2 , C 3 , and C 1 with the mark “★”; the k-nearest neighbor samples of x 3 are from classes C 1 and C 2 . Figure 1b depicts that all δ neighbor samples of x 1 , x 2 ,and x 3 also come from different class labels. We define the samples of x 1 , x 2 , and x 3 as all the boundary objects. The size of the boundary area can increase the uncertain in DT, because it reflects the roughness of X in the approximate space.
By Definition 5, The object space X can be partitioned into positive, boundary, and negative regions [28], which are defined as follows, respectively:
P O S B ( X ) = B _ X
B O U B ( X ) = B ¯ X B _ X
N E G B ( X ) = U B ¯ X
In the data analysis, computing dependencies between attributes is an important issue. We give the definition of the dependency degree as follows:
Definition 6. 
Given DT, for any B C , the dependency degree of B to decision attribute set D is defined as [22]
γ B ( D ) = C A R D ( P O S B ( D ) ) C A R D ( U )
The aim of the feature selection is to select a subset B from C and gain the maximal dependency degree of B to D . Since the features are available one-by-one over time, it is a necessity to measure each feature’s importance in the candidate features.
Definition 7. 
Given DT, for B C and D , the significance of a feature c ( c B ) to B is defined as follows [22]:
σ B ( D , c ) = γ B ( D ) γ B \ { c } ( D )
In a real application, specifically in the medical field, the instances are often unevenly distributed in the feature space; that is, the distribution around some example points is sparse, while the distribution around others is tight. Neither the k-nearest neighborhood relation nor the δ neighborhood relation can portray sample category information well, since the setting of the parameters like r and k can hardly meet both the sparse and tight distributions. For example, the feature space has two classes, as shown in Figure 2—namely, red and green. Red and green represent two different classes respectively, which have different symbols including pentacle and hexagon as well. Around a sample point x 1 , the sample distribution is sparse. The three nearest points to x 1 all have different class from x 1 . If applying the k-nearest neighborhood relation (k = 3), x 1 will be misclassified. However, if we employ the δ neighborhood relation method, then the category of x 1 is consistent with that of most samples in the neighbors of x 1 . On the other hand, the sample distribution around point x 2 is tight, and two class samples are included in its neighborhood, and sample x 2 will be misclassified when applying the δ neighborhood relation denoted by the red circle. In fact, if applying the k-nearest neighborhood relation (k = 3), x 2 will be classified correctly. Therefore, in Section 3, we proposed a novel neighborhood rough set combining the advantages of the k-nearest neighborhood rough set and the δ neighborhood rough set.

3. Method

In this section, we initially introduce a definition of OSFS. Then, we propose a new neighborhood relation and an approach of a weighted dependency degree. Based on three evaluation criteria—namely, maximal dependency, maximal relevance, and maximal significance—our new method OSFS-KW is presented finally.

3.1. Problem Statement

D S t = ( U t , A t , V t ) denotes the decision system (DS) at time t, where A t = C t D t is a feature space including the condition feature C t and decision feature D t . U t = { x 1 , x 2 , , x n t } is a nonempty finite set of objects. A new feature arrives individually, while the number of objects in U t is fixed. OSFS aims to derive a mapping of f t : x i D   ( x i U ) at each timestamp t, which obtains an optimal subset of features available so far.
Contrary to the traditional feature selection methods, we cannot access the full feature space in the scenarios of the OSFS. However, the two main neighborhood relations cannot make up the shortage caused by the uneven distribution data. Moreover, the class imbalanced issue of medical data is common. For example, abnormal cases attract more attention than the normal ones in the field of medical diagnosis. It is also crucial for the proposed framework to handle the class imbalanced problem.

3.2. Our New Neighborhood Relation

3.2.1. k δ Neighborhood Relation

The standard European distance method is applied to eliminate the effect of variance on the distance among the samples. Given any samples x i and x j , and an attribute subset c C , the distance between x i and x j is defined as follows:
Δ B ( x i , x j ) = c B ( f ( x i , c ) f ( x j , c ) σ c ) 2
where f ( x i , c ) denotes the values of xi relative to the attribute c, and σ c represents standard deviation of attribute c.
To overcome the challenge of the uneven distribution of medical data, we proposed a novel neighborhood rough set as follows.
Definition 8. 
Given a decision system D S = { U , C , D , f } , where U = { x 1 , x 2 , x 3 , , x n } is the finite sample set, B C is a condition attribute subset, and D is the decision attribute set, the k δ neighborhood relation is defined as follows:
κ B ( x i ) = { x j U | x j δ B r ( x i ) k B k ( x i ) }
where δ B r ( x i ) is defined as Definition 2, and k B k ( x ) is defined as Definition 3. k = 0.15 n , where n is the number of instances.
Meanwhile, based on [22], r = 1.5 G m e a n and Gmean are defined as follows:
G m e a n = G max G min n 1
More specifically, G max represents the maximum distance from x i to its neighbors, and G min denotes the minimum distance from x i to its neighbors.
Definition 9. 
Given D S = { U , C , D , f } and its neighborhood relations R on U , for X U , the lower and supper approximation regions of X in terms of the k δ neighborhood relation are defined as follows:
B κ X _ = { x i | κ B ( x i ) X , x i U }
B κ X ¯ = { x i | κ B ( x i ) X , x i U }
Similar to Definition 4, B κ X _ is also called the positive region, denoted by P O S B κ ( D ) .

3.2.2. Weighted Dependency Computation

The traditional dependency degree only considers the samples correctly classified instead of that of neighborhood covering. To solve this problem, we propose an approach of weighted dependency degree, which considers the granular information for features.
Definition 10. 
Given D S = { U , C , D , f } , the weighted dependency of B C is defined as follows:
τ B κ ( D ) = w ( B ) · γ B κ ( D )
where
w ( B ) = log 2 ( 2 φ ( B ) )
γ B κ ( D ) = C A R D ( P O S B κ ( D ) ) C A R D ( U )
φ ( B ) = i = 1 | U | C A R D ( κ B ( x i ) ) C A R D ( U ) 2
Theorem 1. 
Given D S = { U , C , D , f } , P O U . The weight w ( · ) is monotonic, defined as follows:
w ( P ) w ( O )
Proof. 
According to the monotonicity of neighborhood relation defined in Definition 8, κ O ( x i ) κ P ( x i ) and C A R D ( κ O ( x i ) ) C A R D ( κ P ( x i ) ) . According to Equation (20), φ ( O ) φ ( P ) . Considering Definition 10, 0 < φ ( O ) 1 , 0 < φ ( P ) 1 , and   0 w ( P ) = log 2 ( 2 φ ( P ) )   w ( O ) = log 2 ( 2 φ ( O ) ) < 1 .  □
Theorem 2. 
Given D S = { U , C , D , f } , B C . Then, τ B κ ( D ) = τ C κ ( D ) is equivalent to γ B κ ( D ) = γ C κ ( D ) , and log 2 ( 2 φ ( P ) ) = log 2 ( 2 φ ( O ) ) .
The proof of Theorem 2 is easy according to the monotonicity of the neighborhood relation and Theorem 1.
Definition 11. 
Given B C and a decision attribute set D , the significance of feature c ( c B ) to B can be rewritten as follows:
σ B ( D , c ) = τ B κ ( D ) τ B \ { c } κ ( D )

3.3. Three Evaluation Criteria

During the OSFS process, many irrelevant and redundant features should be removed for high-dimensional datasets. There are three evaluation criteria used during the process, such as max-dependency, max-relevance, and max-significance.

3.3.1. Max-Dependency

C = { c 1 , c 2 , , c m } denotes the set of m condition attributes. The task of OSFS is to find a feature subset B C , which has the maximal dependency D on the decision attributes set D . At the moment, the number of features denoted as d in the feature space should be as small as possible.
max D ( B , D ) , D = τ B ( D )
where τ B ( D ) denotes the weighted dependency between the attribute subset B and target class label D . The dependency D can be rewritten as τ B t ( D ) , where B t = { B t 1 , c t } . Hence, the increment search algorithm optimizes the following problem for selecting the t th feature from the attribute set { C - B t 1 } :
max c i { C - B t 1 } { τ { B t 1 , c i } ( D ) }
which is also equivalent to optimizing the following problem:
max c i { C - B t 1 } { τ { B t 1 , c i } ( D ) - τ { B t 1 } ( D ) } = max c i { C - B t 1 } { σ B t ( D , c i ) }
The max-dependency maximizes either the joint dependency between the select feature subset and the decision attribute or the significance of the candidate feature to the already-selected features. However, the high-dimensional space has two limitations that lead to failure in generating the resultant equivalent classes: (1) the number of samples is often insufficient, and (2) during the multivariate density estimation process, computing the inverse of the high-dimensional covariance matric is generally an ill-posed problem [29]. Specifically, these problems are evident for continuous feature variables in real-life applications, such as in the medical field. In addition, the computational speed of max-dependency is slow. Meanwhile, max-dependency is inappropriate for OSFS, because each timestamp can only know one feature instead of the entire feature space in advance.

3.3.2. Max-Relevance

Max-relevance is introduced as an alternative in selecting features, as implementing max-dependency is hard. The max-relevance search feature approximates D ( B , D ) in Equation (23) with the mean value of all dependency values between individual feature Bi and the decision attribute D .
M a x   ( B , D ) , = 1 | B | c i B τ c i ( D )
where B is the already-selected feature subsets.
A rich redundancy likely exists among the features selected according to max-relevance. For example, if two features c i and c j among the large features space highly depend on each other, then after removing any one of them, the class differentiation ability of the other one would not substantially change. Therefore, the following max-significance criterion is added to solve the redundancy problem by selecting mutually exclusive features.

3.3.3. Max-Significance

Based on Equation (22), the importance of each candidate feature can be calculated. The max-significance can select mutually exclusive features as follows:
M a x { S } S = 1 | B | c i B { σ B ( D , c i ) }
The feature flows individually over time for the OSFS. Testing all combinations of the candidate features and maximizing the dependency of the selected feature set are not appropriate. However, we can initially employ the “max-relevance” criteria to remove the irrelevant features. Then, we employ the “max-significance” criteria to remove the unimportant features in the selected feature set. Finally, the “max-dependency” criteria will be used to select the feature set with the maximal dependency. Based on the three criteria mentioned previously, in the next subsection, a novel online feature selection framework will be proposed.

3.4. OSFS-KW Framework

The proposed weighted dependency computation method based on the k δ neighborhood RS in this study is shown in Algorithm 1. First, we calculate the card value of each sample x i and obtain the sum for the final weighted dependency at steps 5–14. The c a r d ( · ) [ 0 , 1 ] denotes the consistency between the decision attribute of x i and its neighbor’s decision attributes. The k δ neighborhood relation is used to calculate the dependency of attribute subset B . The value of τ B κ ( D ) reveals not only the distribution of labels nearby x i but, also, the structure granular structure information around x i .
In the real world, we generally encounter the issue of high-dimension class imbalance, specifically in medical diagnosis. Then, we employ the method proposed in [24], named the class imbalance function, as shown in Algorithm 2. For imbalanced medical data, we apply Algorithm 2 to compute c a r d ( κ B ( x i ) ) at step 9 in Algorithm 1.
Algorithm 1 Weighted dependency computation
Require:
   B: The target attribute subset;
   X B : Sample values on B;
Ensure:
1: τ B κ ( D ) : Dependency between B and decision attribute D;
2: c a r d ( B ) : the number of positive samples on B, initial 0 c a r d ( B ) ;
3: | U | : the number of instances in universe U;
4:  SB: the number of instance of neighbors on B, initial 0 S B
5: for each x i in X B
6:    Calculate the distance from x i to other instances;
7:    Sort the neighbors of x i from the nearest to the farthest;
8:    Find the neighbor sample of x i as κ B ( x i ) ;
9:    Calculate the card value of x i as c a r d ( κ B ( x i ) ) ;
10:     c a r d ( B ) = c a r d ( B ) + c a r d ( κ B ( x i ) ) ;
11:   Calculate the number of neighbors κ B ( x i ) with the same class label of x i as S x i ;
12:    S B = S B + S x i ;
13: end
14: τ B κ ( D ) = log 2 ( 2 - S B / | U | 2 ) ) c a r d ( B ) / | U | ;
15: return τ B κ ( D )
In Algorithm 2, D l arg e denotes the large class, while D s m a l l is the small class. The D l arg e sample is different from the D s m a l l sample at steps 3–11. For x i in the large class, if the number of neighbors with the same class label is more than 95% of the number of its total neighbors, then we will set the value of c a r d ( κ B ( x i ) ) to 1; otherwise, the value is set to 0. For x i in the small class, we calculate the ratio of the number of neighbors with D s m a l l to the total number of neighbors as the c a r d ( κ B ( x i ) ) . The method in Algorithm 2 can strengthen the consistency constraints of D l arg e and weaken the consistency constraints of D s m a l l , so D s m a l l is prevented from being overpowered by the samples in D l arg e .
Algorithm 2 Class imbalance function
Require:
D x i : The class label of x i ;
B: The target attribute subset;
Ensure:
   c a r d ( κ B ( x i ) ) : the card value of x i on B;
1:   N B : the number of neighbors κ B ( x i ) with the same class label of x i ;
2:   N R : the number of neighbors of x i on B;
3:  if D x i = = D l arg e then
4:    if N B ( x i ) > = 0.95 N R then
5:       c a r d ( κ B ( x i ) ) =1;
6:    else
7:       c a r d ( κ B ( x i ) ) = 0 ;
8:    end
9:  else then
10:     c a r d ( κ B ( x i ) ) = N B / N R ;
11: end
12: return c a r d ( κ B ( x i ) ) ;
Based on the k δ neighborhood relation and the weighted dependency computation method mentioned above, we introduce our novel OSFS method, named “OSFS-KW”, as shown in Algorithm 3. The main aim of the OSFS-KW is to maximize D ( B , D ) with the minimal number of feature subsets.
Algorithm 3 OSFS-KW
Require:
   C: the condition attribute set;
   D: the decision attribute;
Ensure:
   B: the selected attribute set
1: B initialized to ;
2:  τ B κ ( D ) : the dependency between B and D, initialized to 0;
3:  M e a n τ B κ ( D ) : the mean dependency of attributes in B, initialized to 0;
4:  Repeat
5:  Get a new attribute c i of C at timestamp i;
6:  Calculated the dependency of c i as τ c i κ ( D ) according to Algorithm 1;
7:  if τ c i κ ( D ) < M e a n τ B κ ( D ) then
8:    Discard attribute c i ; and go to Step 25;
9: end
10: if τ B c i κ ( D ) > τ B κ ( D ) then
11:     B = B c i ;
12:     τ B κ ( D ) = τ B c i κ ( D ) ;
13:     M e a n τ B κ ( D ) = c i B τ c i κ ( D ) c a r d ( B ) ;
14: else if τ B c i κ ( D ) = = τ B κ ( D ) then
15:     B = B c i ;
16:     random the feature order in B;
17:     for each attribute c j in B
18:     calculate the significance of c j as σ B ( D , c j ) ;
19:      if σ B ( D , c j ) = = 0
20:       B = B { c i } ;
21:       M e a n τ B κ ( D ) = 1 c a r d ( B ) c i B τ B κ ( D ) ;
22:     end
23:    end
24: end
25: Until no attributes are available;
26: return B;
Specifically, in Algorithm 1, we calculate the dependency of B i when a new attribute c i arrives at timestamp i. Then, the dependency of c i is compared with the mean dependency of the selected attribute subset B at step 7. If γ c i > M e a n D ( B , D ) , then c i is added into B and goes to step 10. Otherwise, c i is discarded and goes to step 25 when γ c i < M e a n D ( B , D ) due to the “max-relevance” constraint.
When c i satisfies the “max-relevance” constraint, γ c i > M e a n D ( B , D ) , going to step 10 and comparing the dependency of the current attribute subset B with B c i . If τ B c i κ ( D ) > τ B κ ( D ) , then adding attribute c i into B will increase the dependency of B, so c i is added into B with the “max-dependency” constraint; that is, B = B c i . On the other hand, if τ B c i κ ( D ) = τ B κ ( D ) , then some redundant attributes exist in B c i . In this condition, we add c i into B firstly. Then, we remove some redundant attributes by steps 16–24. With the “max-significance” constraint, we randomly select an attribute from B and compute its significance according to Equation (22). Some attributes with a significance equal to 0 will be removed from B. Ultimately, we can obtain the best feature subset for decision-making through the aforementioned three evaluation constraints.

3.5. Time Complexity of OFS-KW

In the process of OSFS-KW, the weighted dependency degree computation, shown in Algorithm 1, is a substantially important step. The number of examples in DS is n, and the number of attributes C is m. Table 1 shows the time complexity for different steps of OSFS-KW. In Algorithm 1, we compute the distance between x i and its neighbors for each sample x i U . The time complexity of this process is O ( n ) ( n = c a r d ( U ) . Sorting all neighbors of x i by instance is essential to find the neighbors of x i . The time complex of the quick sorting process is O ( n l o g n ) . Thus, the time complexity of Algorithm 1 is O ( n 2 l o g n ) .
At timestamp i, as a new attribute c i is present to the OSFS-KW, the time complexity of steps 6–9 is O ( n 2 l o g n ) . If the dependency of c i is smaller than M e a n τ B κ ( D ) , then c i will be discarded. Otherwise, comparing the dependency of B c i with B, the time complexity is also O ( n 2 l o g n ) . If τ B c i κ ( D ) > τ B κ ( D ) , then c i can be added into B, and step 25 is repeated. However, if τ B c i κ ( D ) = τ B κ ( D ) , then the time complexity of steps 14–24 is O ( c a r d ( B ) · n 2 l o g n ) . Thus, the complexity of the OSFS-KW is O ( m 2 n 2 l o g n ) . Choosing all features in real-world datasets is impossible. Therefore, the time complexity will be smaller than O ( m 2 n 2 l o g n ) .

4. Experiments

4.1. Data and Preprocessing

We use a high-dimensional medical dataset as our test bench to compare the performance of the proposed OSFS-KW with the existing streaming feature selection algorithm. Table 2 summarizes the 19 high-dimensional datasets used in our experiments.
In Table 2, the BREAST CANCER and OVARIAN CANCER datasets are biomedical datasets [30]. LYMPHOMA and SIDO0 datasets are from the WCCI 2008 Performance Prediction Challenges [31]. MADELON and ARCENE are from the NIPS 2003 feature selection challenge [16]. WDBC, HILL, HILL (NOISE), and COLON TUMOR are four UCI datasets, the web can be accessed at https://archive.ics.uci.edu/ml/index.php. And DLBCL, CAR, LUNG-STD, GLIOMA, LEU, LUNG, MLL, PROSTATE, and SRBCT are nine microarray datasets [32,33].
In our experiments, we employ K-nearest neighbor (KNN), support vector machines (SVM), and random forest (RF) as the basic classifiers to evaluate a selected feature subset. The radial basis function is used in SVM, and the Gini coefficient is used to comprehensively measure all variables’ importance in RF. Furthermore, a grid search cross-validation is applied to train and optimize these three classifiers to give the best prediction results. Then, search ranges of some adjustable parameters for each basic classifier are shown in Table 3.
As listed below, there are three key metrics employed to evaluate the OSFS-KW with other streaming feature selection methods.
(1)
Compactness: the number of selected features,
(2)
Time: the running time of each algorithm,
(3)
Prediction accuracy: the percentage of the correctly classified test samples.
The results are collected in the MATLAB 2017b platform with Windows 10, Intel(R) Core (TM)i5-8265U,1.8 GHz CPU, and 8 GB memory. In addition, we applied the Friedman test at a 95% significance level under the null hypothesis to validate whether the OSFS-KW and its rivals have a significant difference in the prediction accuracy, compactness, and running time [34]. Then, accepting the null hypothesis means that the performance of the OSFS-KW has no significant difference with its rivals. However, if the null hypothesis is rejected, then conducting follow-up inspections is necessary. If so, we employed the Nemenyi test [35], with which the performances of the two methods were significantly different if their corresponding average rankings (AR) were greater than the value of the critical difference (CD).

4.2. Experiments and Dicussions

4.2.1. OSFS-KW versus k-Nearest Neighborhood

In this section, we compare OSFS-KW with the k-nearest neighborhood relation. We employ the same algorithm framework for both neighborhood relations to reduce the impact of other factors. In addition, for the k-nearest neighborhood relation, the value of k varies from 3 to 13 in the experiments.
The experiment results are shown in Appendix A. Table A1 and Table A2 show the compactness and running time. The p-values of the Friedman test are 5.07 × 10−9 and 5.47 × 10−10, respectively. In addition, Table A3, Table A4 and Table A5 show the experimental results about the prediction accuracy on these datasets. The p-values on KNN, SVM, and RF are 0.6949, 0.9884, and 0.5388, respectively. Table A6 shows the test results of the OFS-KW versus k-nearest neighborhood. Therefore, a significant difference exists among these 19 datasets on compactness and running time. On the contrary, no significant difference is observed on accuracy with KNN, SVM, and RF. According to the Nemenyi test, the value of the CD is 3.8215, and we have the following observations from Table A1, Table A2, Table A3, Table A4 and Table A5.
In terms of compactness, a significant difference is just observed between OSFS-KW and k-nearest neighborhood when k = 10, 11, 12, and 13, but OSFS-KW selects the smallest average number of features. According to the running time, there is a significant difference between OSFS-KW and k-nearest neighborhood when k = 3, 4, 5, 6, 7, and 8. In general, the k-nearest neighborhood is faster than OSFS-KW, mainly because the number of neighbors for OSFS-KW is uncertain but fixed for the k-nearest neighborhood. According to the value of AR and CD, there is no significant difference between the OSFS-KW and k-nearest neighborhood, with three basic classifiers’ prediction accuracy for the value of k from 3 to 13. On some datasets, such as COLON, TUMOR, DLBCL, CAR, LYMPHOMA, and LUNG-STD, if a proper k is chosen, the k-nearest neighborhood would have a higher prediction accuracy than the OSFS-KW with KNN, SVM, and RF. This finding means that k-nearest neighborhood can perform well with the proper parameter k.

4.2.2. OSFS-KW versus δ Neighborhood

In this section, the OSFS-KW is compared with the δ neighborhood relation. We employ the algorithm framework for both neighborhood relations for equality. In addition, we employ δ = r × D max and conduct experiments with values of r from 0.1 to 0.5 with step 0.05. The experiment results can be seen in Appendix B. Table A7 shows the compactness of different methods on 19 datasets, and the p-value of the Friedman test is 2.92 × 10−15. Table A8 shows the running time on these datasets, and the p-value is 3.65 × 10−10. Table A9, Table A10 and Table A11 show the results of the prediction accuracy of the OSFS-KW versus δ neighborhood. The p-values on KNN, SVM, and RF are 0.0275, 0.7815, and 0.6683, respectively, shown in Table A12. There is a significant difference among the different algorithms on compactness, running time, and prediction accuracy using KNN, but no significant difference exits in the prediction accuracy of SVM and RF. In addition, the value of CD is 3.1049.
On the number of selected features shown in Table A7, a significant difference is observed between the OSFS-KW and δ neighborhood when r = 0.4, 0.45, and 0.5. Our proposed method OSFS-KW selects the smallest mean number of features. The number of selected features using δ neighborhood increases with the r value increasing. In terms of the running time shown in Table A8, a significant difference exists between the OSFS-KW and δ neighborhood when r = 0.3~0.5, and no significant difference is found when r = 0.1~0.3. The OSFS-KW has the smallest mean running time. On the average ranks shown in Table A9, Table A10 and Table A11, for the value of CD, no significant difference is observed with KNN when r = 0.2 and 0.25 and RF when r = 0.1, 0.15, 0.2, 0.25, 0.35, 0.4, and 0.5. Particularly, no significant difference exists with RF under any value of r. On the prediction accuracy, the OSFS-KW has the highest mean of the prediction accuracy δ neighborhood than among these datasets. However, the δ neighborhood can also obtain the highest prediction accuracy with different r values on some datasets, such as DLBCL and LUNG-STD. However, it is impossible for the δ neighborhood relation to uniform the parameters on all different kinds of datasets.

4.2.3. Influence of Feature Stream Order

In this section, we carry out the experiments on the OSFS-KW with three types of feature steam orders, including original, inverse, and random. Figure 3 depicts the results of the compactness of the OSFS-KW on the datasets. Figure 4, Figure 5 and Figure 6 show the prediction accuracy about KNN, SVM, and RF, respectively.
In addition, we execute the Friedman test at a 95% significance level under the null hypothesis to verify whether there is a significant difference in the compactness, running time, and predictive accuracy. Table A12 in Appendix C shows the calculated p-values. Moreover, it is clear that there is no significant difference, except for the running time, with random order and prediction accuracy, using KNN with the random order. The number of features in the feature space has a remarkable impact on the running time between the original and random orders, specifically when the number of features is very large. For example, the number of features of ARCENE is 10,000, and the difference of the running time on the dataset between the original and random is 157.2334 s.
Figure 3, Figure 4, Figure 5 and Figure 6 show minor fluctuations in some datasets. However, these three orders have no significant difference with each other on most of the datasets. This result denotes that the feature stream orders have a limited impact on the OSFS-KW.

4.2.4. OSFS-KW versus Traditional Feature Selection Methods

In this section, 11 representative traditional feature selection methods are employed to compare with OSFS-KW, including Fisher [36], spectral feature selection [37], Pearson correlation coefficient (PCC) [38], ReliefF [39], Laplacian Score [7], a unsupervised feature selection method with ordinal locality (UFSOL) [40], mutual information (MI) [41], the infinite latent feature selection method (ILFS) [42], lasso regression (Lasso) [43], a fast correlation-based filter method (FCBF) [44], and a correlation based feature selection approach (CFS) [45].
We implement all these algorithms in MATLAB. The k value of ReliefF is set to 7 for the best performance. We rank all features and select the same number of features as the OSFS-KW, considering that all these 11 traditional feature selection methods cannot be applied to the scenario of an OSFS. In addition, we employ three methods as basic classifiers—namely, KNN, SVM, and RF. The results of the prediction accuracy of the three classifiers with five-fold validation are used to evaluate the OSFS-KW and all competing ones.
The experiment results are shown in Appendix D. Table A14, Table A15 and Table A16 show the prediction accuracy of the three basic classifiers. The p-values on the accuracy with KNN, SVM, and RF are 1.20 × 10−15, 3.81 × 10−12, and 5.99 × 10−13, respectively. Table A17 shows the test results. Thus, a significant difference is observed between OSFS-KW and the compared algorithms on the prediction accuracy with the three classifiers. According to the value of CD, which is 3.8215, we can observe the following results from Table A14, Table A15 and Table A16.
(1)
OSFS-KW versus Fisher. According to the values of AR and CD, no significant difference is found between these two methods on the prediction accuracy at a 95% significance level. However, OSFS-KW has a better performance than Fisher on most datasets with the three classifiers.
(2)
OSFS-KW versus SPEC. A significant difference exists between these two algorithms on the prediction accuracy with KNN, SVM, and RF. Furthermore, OSFS-KW outperforms SPEC on most of the datasets. On the whole, SPEC cannot handle some datasets well.
(3)
OSFS-KW versus PCC. A significant difference is found between OSFS-KW and PCC on the prediction accuracy with KNN and RF but not with SVM. On many datasets, OSFS-KW outperforms PCC.
(4)
OSFS-KW versus ReliefF. No significant difference is observed between OSFS-KW and ReliefF on the accuracy with KNN, SVM, and RF. The performance of ReliefF decreases with fewer data, as ReliefF cannot be distinguished among redundant features.
(5)
OSFS-KW versus MI. No significant difference is observed between these two algorithms with KNN, SVM, and RF.
(6)
OSFS-KW versus Laplacian. A Laplacian score is an unsupervised feature selection algorithm using no class information for each instance. This computes the power of locality, preserving a feature to evaluate the importance of the corresponding feature. A significant difference is found between these two algorithms with KNN, SVM, and RF. OSFS-KW outperforms the Laplacian score on all datasets.
(7)
OSFS-KW versus UFSOL. UFSOL is an unsupervised feature selection method that can preserve the topology information, named the ordinal locality. A significant difference is observed between these two algorithms on the prediction accuracy with KNN, SVM, and RF.
(8)
OSFS-KW versus ILFS. ILFS is a feature selection algorithm based on a probabilistic latent graph. A significant difference exists between these two algorithms on the prediction accuracy. OSFS-KW outperforms ILFS on most datasets. ILFS has a prediction accuracy of approximately 0.6439, 0.6677, and 0.7152 on dataset OVARIAN CANCER with KNN, SVM, and RF, respectively. By contrast, OSFS-KW has a prediction accuracy of 0.996, 0.9923, and 0.9881. OSFS-KW has a higher ILFS of over 30% on the prediction accuracy among the datasets.
(9)
OSFS-KW versus Lasso. Lasso is a regularization method for linear regression and has a weight coefficient of each feature. No significant difference is observed between these two algorithms. On datasets GLIOMA and MADELON, the prediction accuracy of OSFS-KW is more than that of Lasso of nearly 22% on with KNN, SVM, and RF.
(10)
OSFS-KW versus FCBF. FCBF, addressing explicitly the correlation between features, ranks the features according to their MI, with the class to be predicted. No significant difference is found between OSFS-KW and FCBF on the prediction accuracy with KNN, SVM, and RF.
(11)
OSFS-KW versus CFS. CFS is a correlation-based feature selection method. No significant difference exists between OSFS-KW and CFS. OSFS-KW outperforms CFS on most of the 19 datasets. On some datasets, such as MADELON and ARCENE, the prediction accuracy of CFS is lower than that of OSFS-KW for nearly 30%.
Overall, OSFS-KW not only performs best among the 19 datasets but, also, has the highest average prediction accuracy among KNN, SVM, and RF.

4.2.5. OSFS-KW versus OSFS Methods

In this section, our algorithm is compared to five state-of-the-art OSFS methods—namely, OSFS-A3M [22], OSFS [16], Alpha-investing [46], Fast-OSFS [47], and SAOLA [13].
We implement all aforementioned algorithms in MATLAB [48], and the significant level α is set to 0.05 for the above five algorithms. The threshold a and the wealth w of Alpha-investing are set to 0.5. As shown in Appendix E, Table A18 and Table A19 show the compactness and running time of OSFS-KW against the other five algorithms. The p-values of the Friedman test on these three classifiers are 0.0248 and 3.62 × 10−23. Table A20, Table A21 and Table A22 summarize the prediction accuracy on these 19 datasets using the KNN, SVM, and RF classifiers with p-values of 0.0337, 0.0032, and 0.0533, respectively. Table A23 shows the test results. A significant difference is found between the six algorithms on the number of selected features, running time, and the prediction accuracy using KNN and SVM, but no significant difference is observed using RF. According to the value of CD, which is 1.7296, we can observe the following results from Table A18, Table A19, Table A20 and Table A21.
(1)
In terms of compactness, no significant difference is observed between OSFS-KW and the other competing algorithms. Fast-OSFS has the smallest mean number of selected features. In addition, for SNAOLA, the number of selected features on some datasets is remarkably large but on some other datasets is zero. This finding demonstrates that SNAOLA cannot handle some types of datasets well.
(2)
On the running time, Alpha-investing is the fastest algorithm among all these six algorithms and has the smallest mean running time among these datasets. According to the values of AR and CD, a significant difference exists among OSFS-KW, Alpha-investing, Fast-OSFS, and SAOLA. The difference between OSFS-KW and OFS-A3M on the running time is small.
(3)
According to the prediction accuracy, OSFS-KW has the highest mean prediction accuracy on these datasets using all three classifiers. OSFS-KW outperforms the five competing algorithms. No significant difference is observed between OSFS-KW and the other competing methods, except for SAOLA.
In summary, although our method, OSFS-KW, is slower than some competing methods, including Fast-OSFS and SAOLA, OSFS-KW is superior among the six methods in prediction accuracy of the 19 datasets.

5. Conclusions

Most of the exiting OSFS methods cannot deal well with the problem of uneven distribution data. In this study, we defined a new k σ neighborhood relation, combining the advantages of k-neighborhood relation and δ neighborhood relation. Then, we proposed a weighted dependency degree considering the structure of neighborhood covering. Finally, we proposed a new OSFS framework named OSFS-KW, which need not specify any parameters in advance. In addition, this method can also handle the problem of imbalance classes in medical datasets. With three evaluation criteria, this approach can select the optimal feature subset mapping decision attributes. Finally, we used KNN, SVM, and RF as the basic classifiers in conducting the experiments to validate the effectiveness of our method. The results of the Friedman test indicate that a significant difference exists between the OSFS-KW and other neighborhood relations on compactness and running time, but there was no significant difference on the predictive accuracy. Moreover, when comparing with the 11 traditional feature selection methods and five existing OSFS algorithms, the performance of OFS-KW is better than that of the traditional feature selection methods and outperforms that of the state-of-the-art OSFS. However, we only focused on the challenges of medical data and used only medical datasets to verify the validity of our approach. Virtually, our method can be applied into other similar fields, generally. In the future, we will test and evaluate this method using some multidisciplinary datasets.

Author Contributions

Conceptualization, D.L. and P.L.; methodology, D.L.; software, D.L. and P.L.; validation, J.H. and Y.Y.; formal analysis, D.L., P.L. and J.H.; investigation, Y.Y.; resources, Y.Y.; data curation, D.L. and P.L.; writing—original draft preparation, D.L. and Y.Y.; writing—review and editing, P.L. and J.H.; visualization, D.L.; supervision, J.H.; project administration, D.L., P.L., J.H. and Y.Y.; and funding acquisition, J.H. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported partially by the Hunan Provincial Natural Science Foundation of China under grant number 2017JJ3472 and the National Natural Science Foundation of China under grant number 71871229.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. The Results of OSFS-KW versus k-Nearest Neighborhood

It is noteworthy that some values in bold are minimum gained by different techniques when analyzing one same data set.
Table A1. OSFS-KW versus k-nearest neighborhood (compactness).
Table A1. OSFS-KW versus k-nearest neighborhood (compactness).
Data SetOSFS-KWk = 3k = 4k = 5k = 6k = 7k = 8k = 9k = 10k = 11k = 12k = 13
WDBC182322212419181921222220
HILL562521135326
HILL (NOISE)1133399156151241
COLON TUMOR242226273339263839273342
DLBCL132330324045436047715658
CAR359696969696969696969696
LYMPHOMA73434465247485270595861
LUNG-STD122632423335584358777367
GLIOMA131416149101287945
LEU629446055615466711077280
LUNG166265849111089115130140137151
MLL102523262038302836333346
PROSTATE253231424246445045457146
SRBCT15108810101099876
ARCENE5152489286978810894119140132
MADELON221121212121110141515
BREAST CANCER34111111392783661
OVARIAN CANCER65755811158599128147156148156
SIDO01811441343123219
AVERAGE16.8927.2628.3236.6338.6340.1139.3246.4747.6357.2654.6856.21
RANKS3.894.473.765.636.116.535.877.377.879.118.399.00
Table A2. OSFS-KW versus k-nearest neighborhood (running time).
Table A2. OSFS-KW versus k-nearest neighborhood (running time).
Data SetOSFS-KWk = 3k = 4k = 5k = 6k = 7k = 8k = 9k = 10k = 11k = 12k = 13
WDBC1.79341.10391.06731.02731.01801.70061.76631.91281.90201.83951.74371.6871
HILL5.38305.67825.38935.94086.73866.28225.68355.97135.45915.91745.43915.6756
HILL (NOISE)3.93423.43053.39363.33103.34273.34143.37153.43863.66143.66243.63083.6760
COLON TUMOR2.06760.955661.04151.04331.03391.06211.05341.13511.10911.11151.21671.1767
DLBCL10.54984.36844.39314.43884.74614.87284.54784.71164.60095.24605.12644.7404
CAR200.672651.86953.30435.399423.826340.258335.084929.17433.193137.783533.35733.2482
LYMPHOMA12.42785.3355.2625.12065.20074.86785.29635.33486.27015.60555.88686.1197
LUNG-STD55.996836.43448.274939.947745.057943.170647.422945.08543.572951.413049.91548.3410
GLIOMA4.81911.33121.32291.27281.28661.26001.31111.25881.26401.33881.28521.3367
LEU4.63642.24952.31312.40162.39332.49482.46152.54963.11123.40173.21123.4583
LUNG31.980424.499730.03531.563432.724034.00432.178134.82534.734435.023734.94831.5197
MLL11.26424.37275.75695.24125.20534.56914.42515.26105.110785.15465.11955.1029
PROSTATE24.23649.66869.022710.44848.140707.97147.91918.71417.89637.89658.05177.8299
SRBCT3.44750.94480.89831.00841.07220.94490.91820.90201.19231.20721.21021.2988
ARCENE87.955236.480740.04843.038340.569944.318943.074046.62645.876647.302748.93549.1771
MADELON538.4690563.6911563.0505586.00479994.3255562.1598651.1277953.06221002.01251021.341920.22131009.9971
BREAST CANCER114.8965110.914110.857110.859110.336111.137110.530112.239111.258115.893112.236112.990
OVARIAN CANCER80.832470.04170.30872.13575.09572.29673.67876.29778.25378.40878.02578.547
SIDO064.282260.645462.64265.304165.360463.809965.870365.27063.032466.095563.21950.2334
AVERAGE66.3052.3253.6053.98548.8153.1957.7773.8876.5078.7272.7876.64
RANKS10.003.954.424.895.535.265.377.166.589.477.797.58
Table A3. Predictive accuracy of OSFS-KW versus k-nearest neighborhood (KNN).
Table A3. Predictive accuracy of OSFS-KW versus k-nearest neighborhood (KNN).
Data SetOSFS-KWk = 3k = 4k = 5k = 6k = 7k = 8k = 9k = 10k = 11k = 12k = 13
WDBC0.97790.97190.96850.95440.97540.96360.97370.97020.97180.97190.97200.9720
HILL0.57100.56440.55780.56270.57270.54940.54940.56280.564440.56280.53960.5478
HILL (NOISE)0.55940.56750.56240.56590.54950.54440.56100.55270.56090.56100.57240.5380
COLON TUMOR0.88330.93460.90260.96790.92050.90130.87180.88590.87050.91670.88720.8859
DLBCL0.92230.94640.98750.9751.00.98750.98751.00.9751.00.98750.975
CAR0.88990.89690.87440.89510.90220.88540.90280.89250.90910.89630.90240.8838
LYMPHOMA0.9700.96750.98570.981810.9818111111
LUNG-STD0.95030.99441111111111
GLIOMA0.86550.90050.88730.86230.84410.90550.90550.88050.86230.84410.86230.8641
LEU10.98670.97240.98570.98570.98570.98570.98570.985710.98570.9857
LUNG0.95550.96060.95490.97510.95060.96460.970.95060.96470.96990.95060.9597
MLL10.98461111111111
PROSTATE0.93140.92140.9510.95140.9510.95140.9610.9310.96050.96050.96050.9605
SRBCT0.94040.98820.9778110.98890.98890.98890.98890.97780.97780.9408
ARCENE0.90560.86020.90040.90940.88970.93010.89570.90060.89960.91510.91040.9101
MADELON0.88850.54810.54120.89580.89580.89580.89580.89080.91150.87380.87770.88
BREAST CANCER0.73070.66850.70620.68170.70640.68170.70640.71670.72380.71950.73020.7656
OVARIAN CANCER0.99610.9960.996110.9960.9960.996110.996
SIDO00.9480.9720.59220.9680.9680.9640.970.9680.970.96010.990.994
AVERAGE0.89 0.88 0.86 0.900.900.900.900.900.900.900.900.90
RANKS7.55 7.24 8.24 6.34 5.766.63 5.426.92 5.87 5.45 5.71 6.87
Table A4. Predictive accuracy of OSFS-KW versus k-nearest neighborhood (SVM).
Table A4. Predictive accuracy of OSFS-KW versus k-nearest neighborhood (SVM).
Data SetOSFS-KWk = 3k = 4k = 5k = 6k = 7k = 8k = 9k = 10k = 11k = 12k = 13
WDBC0.97540.97010.97540.97890.97370.95430.96320.97720.97530.97540.97540.9736
HILL0.51810.50650.51650.51310.51320.50980.50980.51310.50650.50990.50660.5082
HILL (NOISE)0.52310.51320.50000.50990.52480.51820.53140.52480.52640.52310.50820.5049
COLON TUMOR0.91790.87050.86920.88590.90510.87050.87050.87050.86920.86920.86920.8846
DLBCL0.97320.93570.97750.9750.98750.98750.9750.9750.9751.00.9750.9625
CAR0.95340.92290.91210.91650.92240.91660.91720.91180.91840.93360.93980.9103
LYMPHOMA0.98580.95320.96750.981810.98181110.981811
LUNG-STD0.977810.9890.99441110.994410.994411
GLIOMA0.92550.89860.90550.86050.80410.86230.89860.86230.82410.78770.88050.8823
LEU0.98570.98570.98570.98570.98570.97140.98570.98570.98570.98570.98570.9857
LUNG0.97510.97010.95470.97970.96490.970.97530.94540.96510.96060.94540.9504
MLL0.98670.98461.01.01.01.01.01.00.98461.01.01.0
PROSTATE0.94140.9410.9510.9410.9410.9510.9410.9210.9510.9510.9410.951
SRBCT0.95190.98820.988910.988910.98890.98890.98890.98890.97780.9402
ARCENE0.91020.89540.89010.91960.91020.94510.91520.92030.940.95010.940.8992
MADELON0.88720.55460.55650.87690.87690.87690.87690.87120.88150.86880.8750.8785
BREAST CANCER0.70910.67490.67490.67490.67490.67490.67490.75140.66770.73070.75470.7447
OVARIAN CANCER0.9921.00.9961.01.01.00.9960.9961.00.9961.01.0
SIDO00.9720.9240.57150.9660.9660.8820.970.9720.970.970.990.99
AVERAGE0.89800.8679 0.8517 0.8926 0.8915 0.8880 0.8942 0.8937 0.8910 0.8935 0.89800.8930
RANKS5.73688.2368 7.8158 6.0789 5.7895 6.3947 6.0526 6.5000 6.3684 6.2368 6.1579 6.6316
Table A5. Predictive accuracy of OSFS-KW versus k-nearest neighborhood (RF).
Table A5. Predictive accuracy of OSFS-KW versus k-nearest neighborhood (RF).
Data SetOSFS-KWk = 3k = 4k = 5k = 6k = 7k = 8k = 9k = 10k = 11k = 12k = 13
WDBC0.97790.95270.95440.96140.96310.97190.97370.95450.95450.93860.96140.9544
HILL0.54610.56930.54120.55770.54950.50490.50990.53630.57270.55110.56430.5528
HILL (NOISE)0.56420.55120.55770.55950.54780.52800.58400.55100.52780.56430.56420.5017
COLON TUMOR0.82050.77050.87050.85380.77690.82180.81920.86920.82310.87050.83720.8859
DLBCL0.84300.87230.92320.92320.93570.92320.90980.89730.93750.94820.93660.9375
CAR0.83270.80.81100.85440.79200.80620.81930.83470.82260.81020.80830.7905
LYMPHOMA0.95100.94980.80880.85540.97140.85560.91990.90930.98180.89150.9390.9041
LUNG-STD0.96680.96680.96140.97790.97780.95570.96160.97240.98890.97780.9890.9557
GLIOMA0.83770.82910.90730.78770.80590.84230.76590.78770.78090.77140.88050.8241
LEU0.98570.94380.90290.92860.92950.90290.87520.94290.90290.91620.91620.9571
LUNG0.91190.89120.92010.90220.93630.93470.93070.94020.94070.95050.93580.9157
MLL0.88620.960.90160.94160.87290.97330.94160.94370.97030.90260.94460.9457
PROSTATE0.88190.9010.9010.92190.88290.91190.93140.92050.90190.93140.88290.941
SRBCT0.85390.95220.97570.98750.95210.95150.97570.97570.97640.98750.97640.9513
ARCENE0.8050.82580.84380.83970.78590.83020.78010.81580.83490.82980.84040.7953
MADELON0.74360.51420.51880.86350.86190.87120.86850.86770.87310.85380.85460.8569
BREAST CANCER0.69270.64750.68620.64590.65430.6610.66820.64720.69190.640.67170.7031
OVARIAN CANCER0.98811.00.97220.98410.98810.99620.99620.97620.9960.992110.996
SIDO00.9740.990.56450.9520.960.94990.9440.960.9560.96590.9820.982
AVERAGE0.8454 0.8362 0.8170 0.8578 0.8497 0.8522 0.8513 0.8580 0.86490.8575 0.8676 0.8606
RANKS6.8684 7.0789 7.6579 6.2105 7.4211 7.0526 7.1316 6.6579 4.84216.0789 4.6579 6.3421
Table A6. Test results of OFS-KW versus k-nearest neighborhood.
Table A6. Test results of OFS-KW versus k-nearest neighborhood.
Evaluation CriteriaFriedman Test
Compactness5.07 × 10−9
Running time5.47 × 10−10
Accuracy (KNN)0.6949
Accuracy (SVM)0.9884
Accuracy (RF)0.5388

Appendix B. The Results of OSFS-KW versus δ Neighborhood

Table A7. OFS-KW versus δ neighborhood (compactness).
Table A7. OFS-KW versus δ neighborhood (compactness).
Data SetOFS-KW r = 0.1 r = 0.15 r = 0.2 r = 0.25 r = 0.3 r = 0.35 r = 0.4 r = 0.45 r = 0.5
WDBC18161516161514141010
HILL51896813571013
HILL (NOISE)11181811817711818
COLON TUMOR2417924384770828172
DLBCL13121013141917141635
CAR35323237373832364147
LYMPHOMA756696881311
LUNG-STD1264798891419
GLIOMA13141310161715204337
LEU6651210910131216
LUNG16201323251826233339
MLL101111119811121617
PROSTATE25142724201910654142170
SRBCT1581411151718202224
ARCENE5130303345566359138139
MADELON281339354051615655
BREAST CANCER34614250494679558481
OVARIAN CANCER66875911119241364
SIDO01825222068968998107166
AVERAGE16.8917.21 15.84 18.95 22.95 26.21 33.68 37.63 57.21 70.16
RANKS3.713.74 3.26 4.21 4.89 5.39 5.74 6.87 8.13 9.05
Table A8. OFS-KW versus δ neighborhood (running time).
Table A8. OFS-KW versus δ neighborhood (running time).
Data SetOFS-KW r = 0.1 r = 0.15 r = 0.2 r = 0.25 r = 0.3 r = 0.35 r = 0.4 r = 0.45 r = 0.5
WDBC1.79341.77351.91802.15492.05732.16591.94391.94392.21822.1838
HILL5.38307.57297.11065.69357.15777.26136.159210.32839.78087.6833
HILL (NOISE)3.93424.79004.83394.81704.64664.85164.73994.89344.90305.0067
COLON TUMOR2.06763.78796.10313.45233.46331.83282.07351.86381.93341.8536
DLBCL10.54989.49049.273610.98968.953012.019711.670714.159213.443011.0618
CAR200.672682.088074.8928100.9478134.9345170.0618232.9729230.3482231.4008263.2982
LYMPHOMA12.427815.399015.583215.840918.800122.4250121.237923.150517.342117.3496
LUNG-STD55.996894.1136110.2884112.2191140.1124128.4688153.3367171.3579184.0983168.2461
GLIOMA4.81915.74258.70118.29087.38628.86519.755010.093210.35507.9124
LEU4.63645.18406.59447.01137.10388.16518.420810.270011.888213.2705
LUNG31.980474.043 68.184 48.724 61.818 83.154 113.622 85.821 105.910 122.239
MLL11.264218.410019.174316.9152022.466424.599629.918825.833431.073522.0068
PROSTATE24.236432.840225.798429.971939.519540.478421.456753.014116.690923.4978
SRBCT3.44753.65176.462888.96068.809610.626411.569713.786214.185915.2989
ARCENE87.9552119.1581101.0211107.7780167.190191.7981316.8370226.6963143.9225393.4693
MADELON538.46901218.59291056.91291040.2067487.8704551.4037581.3661666.5058692.4043520.1171
BREAST CANCER114.8965133.9355126.5506130.3485130.3591126.8821139.4627128.0552131.3735131.5025
OVARIAN CANCER80.8324100.160582.1879282.4396783.0164278.1723697.85803118.5893107.6156101.5973
SIDO064.2822118.2797122.3051130.559286.0615106.537992.595374.262873.642487.7044
AVERAGE66.30107.84 97.57 98.28 74.83 83.15 97.74 98.47 94.96 100.81
RANKS2.004.68 4.32 4.79 4.74 5.89 6.61 7.50 7.42 7.05
Table A9. Predictive accuracy of OFS-KW versus δ neighborhood (KNN).
Table A9. Predictive accuracy of OFS-KW versus δ neighborhood (KNN).
Data SetOFS-KW r = 0.1 r = 0.15 r = 0.2 r = 0.25 r = 0.3 r = 0.35 r = 0.4 r = 0.45 r = 0.5
WDBC0.97790.97020.97190.96840.95250.93860.96670.97010.97010.9701
HILL0.57100.55450.55280.55110.55440.56270.56100.55430.55930.5692
HILL (NOISE)0.55940.55750.54750.54620.52810.54610.53800.55450.55610.5430
COLON TUMOR0.88330.80510.83850.82310.90380.83970.88720.8410.85640.8718
DLBCL0.92230.92320.91160.88750.84470.92410.92330.950.950.95
CAR0.88990.87630.88210.88250.82040.87260.84220.85090.8520.8535
LYMPHOMA0.9701.01.00.95260.95260.95261.01.00.98181.0
LUNG-STD0.95030.99461.00.98891.00.99441.00.98891.00.9944
GLIOMA0.86550.91860.88050.90550.72950.79410.85860.86730.84410.8241
LEU1.00.95810.97140.98570.95810.98670.97240.97240.98570.9581
LUNG0.95550.950.95020.91070.94070.96070.93510.97490.94520.9702
MLL1.00.97130.94460.92920.97330.94670.92841.00.97240.9857
PROSTATE0.93140.92190.79480.82380.89290.95190.87240.93190.9410.9314
SRBCT0.94040.88950.87840.91620.91870.96460.94030.96380.9660.9653
ARCENE0.90560.83540.87080.83480.87470.88040.84490.84530.84480.8841
MADELON0.88850.51920.52650.77540.59770.61770.72190.710.72850.5992
BREAST CANCER0.73070.70260.68180.70260.68490.68560.68520.7270.68790.7026
OVARIAN CANCER0.9961.01.00.9960.9961.01.01.00.9960.992
SIDO00.9480.960.960.9880.94210.9620.93610.9260.9160.9201
AVERAGE0.890.86 0.85 0.86 0.85 0.86 0.86 0.88 0.87 0.87
RANKS3.555.55 6.00 6.66 7.26 5.13 6.11 4.61 5.03 5.11
Table A10. Predictive accuracy of OFS-KW versus δ neighborhood (SVM).
Table A10. Predictive accuracy of OFS-KW versus δ neighborhood (SVM).
Data SetOFS-KW r = 0.1 r = 0.15 r = 0.2 r = 0.25 r = 0.3 r = 0.35 r = 0.4 r = 0.45 r = 0.5
WDBC0.97540.97720.97900.97890.97710.97540.97550.97370.97370.9737
HILL0.51810.52140.49830.50660.51320.51480.50490.50990.51980.5000
HILL (NOISE)0.52310.52980.52810.52310.53140.52480.52650.52310.51160.5198
COLON TUMOR0.91790.83850.82310.85510.90130.87050.85380.86920.85380.8692
DLBCL0.97320.92330.950.91250.88570.9250.98750.950.98750.9625
CAR0.95340.92120.92380.90980.89470.89610.91360.89660.94170.9346
LYMPHOMA0.98580.9857 1.00.93590.9560.95320.9691.00.98571.0
LUNG-STD0.97780.99461.01.01.00.99441.01.01.01.0
GLIOMA0.92550.88230.86230.92050.92730.86230.87360.86410.82410.8605
LEU0.98570.95810.98670.95710.97140.98670.97240.97240.97140.9295
LUNG0.97510.95490.990.95550.94060.95020.950.97550.96030.9504
MLL0.98670.95790.95790.95790.95791.01.01.01.01.0
PROSTATE0.94140.95050.85380.88240.92140.95140.92140.93140.9410.941
SRBCT0.95190.91610.97570.9750.95350.98890.9271.00.94170.9425
ARCENE0.91020.91510.87480.86990.91010.90020.90010.88490.88030.8248
MADELON0.88720.49310.52580.78460.62650.65420.73420.73770.73350.6246
BREAST CANCER0.70910.67510.73080.69590.67780.69210.67460.72710.70570.6883
OVARIAN CANCER0.9921.01.01.01.01.01.01.01.00.996
SIDO00.9720.96790.9680.990.9760.9840.9820.9860.960.9699
AVERAGE0.89800.8612 0.8646 0.8742 0.8696 0.8750 0.8772 0.8843 0.8785 0.8678
RANKS4.23685.9737 5.3421 5.8947 5.7895 5.2105 5.7105 4.5263 5.6842 6.6316
Table A11. Predictive accuracy of OFS-KW versus δ neighborhood (RF).
Table A11. Predictive accuracy of OFS-KW versus δ neighborhood (RF).
Data SetOFS-KW r = 0.1 r = 0.15 r = 0.2 r = 0.25 r = 0.3 r = 0.35 r = 0.4 r = 0.45 r = 0.5
WDBC0.97790.95440.95970.95260.93860.96680.94390.95440.95970.9509
HILL0.54610.55120.51490.53290.52970.58410.54610.58080.53790.5725
HILL (NOISE)0.56420.53800.56090.54800.49980.52640.52790.55440.54130.5527
COLON TUMOR0.82050.83590.85130.7410.73850.82180.80380.81920.83460.8526
DLBCL0.84300.86160.91160.88480.85800.88750.91250.89730.92320.9125
CAR0.83270.790170.76470.79540.71760.76750.81140.72080.81350.7131
LYMPHOMA0.95100.96750.89260.85780.90970.87160.92860.91750.84520.8671
LUNG-STD0.96680.95590.97250.97220.98890.98360.97780.98330.99440.9833
GLIOMA0.83770.69770.57140.74410.78450.720.73590.63950.68140.6077
LEU0.98570.90290.91710.83330.88760.9590.91620.90380.9590.901
LUNG0.91190.94080.88730.89140.92640.89580.90210.88780.9160.9011
MLL0.88620.86050.91590.97130.94460.91790.9170.94570.90670.9138
PROSTATE0.88190.88240.77380.78430.85240.85330.81520.89140.88240.8819
SRBCT0.85390.86580.89470.91510.90420.90330.90550.92980.90280.8943
ARCENE0.8050.79570.76040.81540.84980.79580.75460.72620.79510.736
MADELON0.74360.51080.51040.80150.62460.67230.77190.78190.790.6535
BREAST CANCER0.69270.65040.66810.66060.66380.67840.59830.64620.66420.7062
OVARIAN CANCER0.98810.97620.97630.99610.99221.00.99210.98850.9920.9725
SIDO00.9740.9740.9620.9820.9660.9840.9680.9540.9420.948
AVERAGE0.84540.8164 0.8035 0.8253 0.8198 0.8310 0.8278 0.8275 0.8359 0.8169
RANKS4.5000 5.9737 6.5526 5.4737 6.0526 4.39475.5789 5.3684 4.8158 6.2895
Table A12. Comparison results of OFS-KW versus δ neighborhood.
Table A12. Comparison results of OFS-KW versus δ neighborhood.
Evaluation CriteriaFriedman Test (p-Values)
Compactness2.92 × 10−15
Running time3.65 × 10−10
Accuracy (KNN)0.0275
Accuracy (SVM)0.7815
Accuracy (RF)0.6683

Appendix C. The Results of Three Different Feature Stream Orders

Table A13. Comparison results of the three feature stream orders.
Table A13. Comparison results of the three feature stream orders.
OriginalInverseRandom
Compactness0.10270.1027
Running time0.45620.0654
Accuracy (KNN)0.45630.0631
Accuracy (SVM)0.05540.0555
Accuracy (RF)0.05540.1027

Appendix D. OSFS-KW versus Traditional Feature Selection Methods

Table A14. Prediction accuracy of OSFS-KW versus traditional feature selection methods (KNN).
Table A14. Prediction accuracy of OSFS-KW versus traditional feature selection methods (KNN).
Data SetOSFS-KWFisherSPECPCCReliefFMILaplacianUFSOLILFSLassoFCBFCFS
WDBC0.97790.97020.94720.93130.97010.95260.97370.96490.96140.95070.95770.9701
HILL0.57100.52490.52140.56100.54950.53790.53790.55120.54120.57590.53970.5297
HILL (NOISE)0.55940.50000.53450.51160.54950.54290.54110.53940.54460.52480.51660.5560
COLON TUMOR0.88330.85380.77690.82180.85380.82440.72820.66540.61280.78850.90260.8859
DLBCL0.92230.91160.75380.90800.950.93750.88300.75460.72700.97320.951.0
CAR0.88990.743880.21800.80180.92770.85210.56210.53350.54000.80340.91390.8350
LYMPHOMA0.9700.98570.68870.95210.91081.00.93030.81940.96520.98571.00.8578
LUNG-STD0.95030.96670.95560.98890.98350.98890.99440.9060.98330.99440.98891.0
GLIOMA0.86550.76230.27910.36680.75910.68820.60140.50680.82590.660.88050.7827
LEU1.00.97140.70860.97140.94380.92860.95810.66950.63620.98570.95710.9162
LUNG0.95550.85220.77890.88150.88150.92540.82810.80310.90120.89130.95530.8861
MLL1.00.95790.36170.89030.97130.85960.93130.66190.95790.90160.98570.9303
PROSTATE0.93140.9310.510.89240.9410.95050.6390.64810.5010.9610.95050.9324
SRBCT0.94041.00.70740.85930.90110.7830.68790.69480.87910.90550.98750.966
ARCENE0.90560.74990.560.71970.78080.55490.76490.830.65030.70570.84520
MADELON0.88850.57190.51850.57270.59460.64040.51190.63350.50190.56040.58040.5638
BREAST CANCER0.73070.72710.67810.68180.70230.69940.65730.63650.6680.81760.72010.6648
OVARIAN CANCER0.9960.97220.82690.95280.97220.97220.74660.67240.64390.94880.99210.9686
SIDO00.9480.990.9240.9360.980.9920.8820.8820.95810.930.9960.996
AVERAGE0.88870.8391 0.6447 0.8001 0.8486 0.8227 0.7557 0.7038 0.7368 0.8350 0.8747 0.8022
RANKS3.10535.8947 10.4737 7.4737 4.8947 5.7632 8.1842 9.2368 8.1842 5.5789 3.5789 5.6316
Table A15. Prediction accuracy of OSFS-KW versus traditional feature selection methods (SVM).
Table A15. Prediction accuracy of OSFS-KW versus traditional feature selection methods (SVM).
Data SetOSFS-KWFisherSPECPCCReliefFMILaplacianUFSOLILFSlassoFCBFCFS
WDBC0.97540.97890.95960.94890.97190.96480.97720.94910.97370.95780.96830.9719
HILL0.51810.51150.51320.51320.50330.50160.51320.51150.51320.51150.50990.5065
HILL (NOISE)0.52310.50830.52150.51320.53130.51980.51820.52640.50990.50820.49830.4852
COLON TUMOR0.91790.86920.70770.80510.83850.82440.71030.63210.64620.85260.88590.8846
DLBCL0.97320.89910.75380.9250.91250.96250.84830.76630.72520.96070.951.0
CAR0.95340.74890.22590.81920.92520.90660.59490.60290.56570.87220.92850.9095
LYMPHOMA0.98581.00.74340.93790.94181.00.91210.81940.98180.98571.00.8669
LUNG-STD0.97780.97780.97221.01.00.99440.98890.94441.01.00.99440.989
GLIOMA0.92550.74410.33860.49680.78270.71140.66140.49360.78770.70.88050.8291
LEU0.98570.97140.74760.95710.94380.94380.94480.65240.65240.98570.94380.9457
LUNG0.97510.82680.78320.88250.91070.93010.83260.79420.90120.9110.95040.9065
MLL0.98670.94260.38840.90370.94260.85960.960.6270.94260.88840.97240.917
PROSTATE0.94140.93050.59860.92140.9310.93140.69670.68760.50950.96050.9410.9514
SRBCT0.95190.98890.70230.94110.97640.92620.79040.81860.84440.96311.00.9771
ARCENE0.91020.740.57520.73990.83620.560.76480.79540.59570.95050.89460
MADELON0.88720.61770.54620.61770.62310.60880.48190.62690.51460.55380.62040.5712
BREAST CANCER0.70910.7270.67490.730.73670.6990.67490.67490.67490.85640.76550.6749
OVARIAN CANCER0.99230.97220.85010.94480.97220.97220.71530.66860.66770.95260.99210.9565
SIDO00.9720.9960.9320.9440.9920.9940.8820.8820.9740.94610.9920.992
AVERAGE0.89800.8395 0.6597 0.8180 0.8564 0.8321 0.7615 0.7091 0.7358 0.8588 0.8783 0.8071
RANKS2.89475.4737 10.0263 7.0263 4.9474 6.2895 8.0526 9.3684 8.1316 5.4737 4.0789 6.2368
Table A16. Prediction accuracy of OSFS-KW versus traditional feature selection methods (RF).
Table A16. Prediction accuracy of OSFS-KW versus traditional feature selection methods (RF).
Data SetOSFS-KWFisherSPECPCCReliefFMILaplacianUFSOLILFSlassoFCBFCFS
WDBC0.97790.95790.94390.91740.96320.92640.94920.94390.94740.94210.95440.9438
HILL0.54610.52630.52150.51650.50160.49010.48830.53460.52480.54780.51970.5032
HILL (NOISE)0.56420.48670.51460.48340.51320.50810.52800.55260.49330.49330.46040.5131
COLON TUMOR0.82050.86790.74230.7410.78970.85380.66280.69490.51540.72180.88590.9013
DLBCL0.84300.88570.75380.91160.88660.93750.76900.70290.66270.83220.93750.9107
CAR0.83270.96750.69470.90150.93810.97140.88480.77290.93570.9340.97140.7753
LYMPHOMA0.95100.97790.96110.98330.97790.95020.97780.89490.98890.98890.98890.989
LUNG-STD0.96680.98890.98350.96670.97790.99440.97780.92250.98330.94490.97780.9889
GLIOMA0.83770.60140.32230.40820.70950.67320.53820.48360.67320.58820.94550.7427
LEU0.98570.94380.640.91620.92950.88670.94570.58380.60950.95710.94380.8876
LUNG0.91190.83140.77330.85820.88020.90580.78690.76240.84660.89210.92010.8483
MLL0.88620.94260.30630.86270.91180.8740.82270.59230.94260.89030.94570.9016
PROSTATE0.88190.91140.55860.90190.92140.88290.61760.57760.66710.90290.9410.9219
SRBCT0.85390.92840.74780.85540.89250.85680.5680.79210.69920.89730.98890.9165
ARCENE0.8050.67420.560.5950.78510.55490.75490.72820.56020.82020.87460
MADELON0.74360.52460.51350.52580.56960.61270.50770.60270.49190.53770.52960.5369
BREAST CANCER0.69270.66350.66470.68830.70910.68930.60410.63260.63680.75850.76590.6817
OVARIAN CANCER0.98810.96820.81070.94890.96820.96820.73130.6290.71520.96080.92510.8732
SIDO00.9740.9820.93410.93410.9860.9840.8740.87210.9620.92410.9880.9939
AVERAGE0.8454 0.8226 0.6814 0.7851 0.8322 0.8169 0.7363 0.6987 0.7293 0.8176 0.86650.7805
RANKS4.5789 5.2632 9.0526 7.7105 4.6579 5.8684 8.7632 9.3421 8.0263 5.7632 3.47375.5000
Table A17. Test results of OFS-KW versus traditional feature selection methods.
Table A17. Test results of OFS-KW versus traditional feature selection methods.
Evaluation CriteriaFriedman Test (p-Values)
Accuracy (KNN)1.20 × 10−15
Accuracy (SVM)3.38 × 10−13
Accuracy (RF)1.22 × 10−10

Appendix E. OSFS-KW versus OSFS Methods

Table A18. Compactness.
Table A18. Compactness.
Data SetsOSFS-KWOFS-A3MOSFSAlpha-InvestingFast-OSFSSAOLA
WDBC18171019424
HILL51215578
HILL (NOISE)111223776
COLON TUMOR24267440
DLBCL13166118148
CAR353810301458
LYMPHOMA771318869
LUNG-STD1281777110
GLIOMA131815671224
LEU61411231478
LUNG1625353922
MLL10132312713
PROSTATE253061951
SRBCT151643280
ARCENE51555129100
MADELON2227140
BREAST CANCER3440113160
OVARIAN CANCER69106862
SIDO01849275137
AVERAGE16.8947 21.4211 11.8947 25.7895 7.8947114.7368
RANKS3.5526 4.4737 3.0263 4.0789 2.50003.3684
Table A19. Running time (seconds).
Table A19. Running time (seconds).
Data SetOSFS-KWOFS-A3MOSFSAlpha-InvestingFast-OSFSSAOLA
WDBC1.79342.84270.87450.21170.42460.1438
HILL5.38306.69463.74730.01800.09110.0778
HILL (NOISE)3.93424.53082.37380.00520.46700.0750
COLON TUMOR2.06761.87490.92960.27780.74270.0916
DLBCL10.549817.14394.73190.87602.620111.1175
CAR200.672668.555213.59821.140515.57935.1563
LYMPHOMA12.42786.77901.78970.57565.71055.9966
LUNG-STD55.996854.686017.69932.299416.96190.1351
GLIOMA4.81916.29931.78510.40962.66539.6878
LEU4.63644.43161.27310.29260.39081.5318
LUNG31.980471.202613.66620.88231.24040.3039
MLL11.264212.54123.19750.57383.788113
PROSTATE24.23647.51233.55310.60091.55470.0987
SRBCT3.44754.31231.11550.16551.06740.0394
ARCENE87.9552153.508965.95775.14788.42690.5413
MADELON538.46901098.8598822.91960.085972.25860.1640
BREAST CANCER114.8965117.067671.16012.81907.8651260.3178
OVARIAN CANCER80.832481.210533.77455.02534.4316890.3477
SIDO064.282274.820820.66890.62731.98910.0982
AVERAGE66.2971 94.4671 57.0956 1.15974.1198 2.5751
RANKS5.1053 5.5789 3.5789 1.57892.8947 2.2632
Table A20. Predictive accuracy using KNN.
Table A20. Predictive accuracy using KNN.
Data SetsOSFS-KWOFS-A3MOSFSAlpha-InvestingFast-OSFSSAOLA
WDBC0.97790.96850.94910.96320.97360.9666
HILL0.57100.55610.54460.56270.55610.5593
HILL (NOISE)0.55940.55940.53950.55770.56440.5510
COLON TUMOR0.88330.86920.88460.61540.88590
DLBCL0.92230.88400.98750.93750.9750.8072
CAR0.88990.867170.714850.73880.74960.8010
LYMPHOMA0.9700.98571.01.01.01.0
LUNG-STD0.95030.98330.99440.97240.99440
GLIOMA0.86550.84730.72950.61640.86730.8441
LEU1.00.9590.97240.90190.62670.9857
LUNG0.95550.9560.95620.95010.68560.725
MLL1.00.98670.97130.95710.8892
PROSTATE0.93140.93240.93240.93240.93240.9324
SRBCT0.94040.9660.9660.9660.9660
ARCENE0.90560.85520.87080.840.82980
MADELON0.88850.51960.51960.60230.58310
BREAST CANCER0.73070.73070.66820.72750.73040
OVARIAN CANCER0.9960.97221.00.99220.99220.8936
SIDO00.9480.9520.9420.9620.9940.958
AVERAGE0.88870.8606 0.8496 0.8313 0.8372 0.5744
RANKS2.76323.3947 3.5789 3.7632 2.8421 4.6579
Table A21. Predictive accuracy using SVM.
Table A21. Predictive accuracy using SVM.
Data SetOSFS-KWOFS-A3MOSFSAlpha-InvestingFast-OSFSSAOLA
WDBC0.97540.97720.95260.98060.96830.9736
HILL0.51810.51480.51480.50980.50820.5412
HILL (NOISE)0.52310.52310.50650.49340.52480.5579
COLON TUMOR0.91790.87180.85260.64620.85260
DLBCL0.97320.96250.98750.93750.98750.8580
CAR0.95340.927370.75570.82120.798200.6453
LYMPHOMA0.98580.96751.00.98571.00.6792
LUNG-STD0.97780.99440.99440.99441.00
GLIOMA0.92550.96360.81270.65140.84910.4391
LEU0.98570.98670.95710.98570.65240.7505
LUNG0.97510.96030.96010.94520.68530.7151
MLL0.98670.98570.94260.98571.00.9016
PROSTATE0.94140.95140.95140.95140.95140.9514
SRBCT0.95190.97710.97710.97710.97710
ARCENE0.91020.88960.90490.89490.81970
MADELON0.88720.47580.47580.6150.62650
BREAST CANCER0.70910.70910.67490.7170.77560
OVARIAN CANCER0.9920.98421.00.99620.9960.8893
SIDO00.9720.9760.9120.990.9940.958
AVERAGE0.89800.8736 0.8491 0.8462 0.8404 0.5190
RANKS2.81583.0526 3.6316 3.3947 3.0526 5.0526
Table A22. Predictive accuracy using RF.
Table A22. Predictive accuracy using RF.
Data SetOSFS-KWOFS-A3MOSFSAlpha-InvestingFast-OSFSSAOLA
WDBC0.97790.95780.94030.95090.94910.9526
HILL0.54610.55770.52800.56750.55770.5610
HILL (NOISE)0.56420.58250.55940.53480.53470.5346
COLON TUMOR0.82050.75770.86790.58080.82180
DLBCL0.84300.92410.92410.88480.93570.8064
CAR0.83270.760510.685230.67780.719300.7486
LYMPHOMA0.95100.93660.98570.91880.90020.9197
LUNG-STD0.96680.93890.98330.98350.98890
GLIOMA0.83770.6550.57450.50860.77410.7191
LEU0.98570.9590.95810.860.58480.7895
LUNG0.91190.9010.94540.8760.66080.6553
MLL0.88620.82260.95590.90190.95790.917
PROSTATE0.88190.91140.94050.94050.91190.96
SRBCT0.85390.94240.90710.94250.93060
ARCENE0.8050.78490.830.79480.8150
MADELON0.74360.52040.51690.60620.57880
BREAST CANCER0.69270.65370.58020.6890.76880
OVARIAN CANCER0.98810.980.9920.98421.00.8973
SIDO00.9740.9680.96790.98790.9880.962
AVERAGE0.84540.8165 0.8233 0.7995 0.8094 0.5486
RANKS2.89473.5263 3.3158 3.5526 2.9737 4.7368
Table A23. Comparison results.
Table A23. Comparison results.
Evaluation CriteriaFriedman Test (p-Values)
Compactness0.0248
Running time3.62 × 10−23
Accuracy (KNN)0.0337
Accuracy (SVM)0.0032
Accuracy (RF)0.0533

References

  1. Zhou, H.; Wang, J.; Zhang, H. Stochastic multicriteria decision-making approach based on SMAA-ELECTRE with extended gray numbers. Int. Trans. Oper. Res. 2019, 26, 2032–2052. [Google Scholar] [CrossRef]
  2. Tian, Z.-P.; Wang, J.; Wang, J.-Q.; Chen, X.-H. Multicriteria decision-making approach based on gray linguistic weighted Bonferroni mean operator. Int. Trans. Oper. Res. 2018, 25, 1635–1658. [Google Scholar] [CrossRef]
  3. Tian, Z.-P.; Wang, J.; Wang, J.; Zhang, H.-Y. Simplified Neutrosophic Linguistic Multi-criteria Group Decision-Making Approach to Green Product Development. Group Decis. Negot. 2017, 26, 597–627. [Google Scholar] [CrossRef]
  4. Cang, S.; Yu, H. Mutual information based input feature selection for classification problems. Decis. Support Syst. 2012, 54, 691–698. [Google Scholar] [CrossRef]
  5. Wang, Z.; Zhang, Y.; Chen, Z.; Yang, H.; Sun, Y.; Kang, J.; Yang, Y.; Liang, X. Application of ReliefF algorithm to selecting feature sets for classification of high resolution remote sensing image. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; Institute of Electrical and Electronics Engineers (IEEE): Piscataway, NJ, USA, 2016; pp. 755–758. [Google Scholar]
  6. Saqlain, S.M.; Sher, M.; Shah, F.A.; Khan, I.; Ashraf, M.U.; Awais, M.; Ghani, A. Fisher score and Matthews correlation coefficient-based feature subset selection for heart disease diagnosis using support vector machines. Knowl. Inf. Syst. 2018, 58, 139–167. [Google Scholar] [CrossRef]
  7. Benabdeslem, K.; Elghazel, H.; Hindawi, M. Ensemble constrained Laplacian score for efficient and robust semi-supervised feature selection. Knowl. Inf. Syst. 2015, 49, 1161–1185. [Google Scholar] [CrossRef]
  8. Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc. Ser. B (Statal Methodol.) 2011, 73, 273–282. [Google Scholar] [CrossRef]
  9. Kumar, V.; Minz, S. Multi-view ensemble learning: An optimal feature set partitioning for high-dimensional data classification. Knowl. Inf. Syst. 2015, 49, 1–59. [Google Scholar] [CrossRef]
  10. Wang, J.; Zhao, P.; Hoi, S.C.H.; Jin, R. Online Feature Selection and Its Applications. IEEE Trans. Knowl. Data Eng. 2013, 26, 698–710. [Google Scholar] [CrossRef] [Green Version]
  11. Glocer, K.; Eads, D.; Theiler, J. Online feature selection for pixel classification. In Proceedings of the 22nd International Conference on Software Engineering: ICSE 2000, the New Millennium, Limerick, Ireland, 4–11 June 2000; Association for Computing Machinery (ACM): New York, NY, USA, 2005; pp. 249–256. [Google Scholar]
  12. Javidi, M.M.; Eskandari, S. Online streaming feature selection: A minimum redundancy, maximum significance approach. Pattern Anal. Appl. 2018, 22, 949–963. [Google Scholar] [CrossRef]
  13. Yu, K.; Wu, X.; Ding, W.; Pei, J. Scalable and Accurate Online Feature Selection for Big Data. ACM Trans. Knowl. Discov. Data 2016, 11, 1–39. [Google Scholar] [CrossRef]
  14. Eskandari, S.; Javidi, M.M. Online streaming feature selection using rough sets. Int. J. Approx. Reason. 2016, 69, 35–57. [Google Scholar] [CrossRef]
  15. Perkins, S.; Theiler, J. Online feature selection using grafting. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), Los Alamos, NM, USA, 21–24 August 2003; pp. 592–599. [Google Scholar]
  16. Wu, X.; Yu, K.; Ding, W.; Wang, H.; Zhu, X. Online Feature Selection with Streaming Features. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 1178–1192. [Google Scholar] [CrossRef] [Green Version]
  17. Pawlak, Z. Rough sets. Int. J. Comput. Inf. Sci. 1982, 11, 341–356. [Google Scholar] [CrossRef]
  18. Yao, Y.; She, Y. Rough set models in multigranulation spaces. Inf. Sci. 2016, 327, 40–56. [Google Scholar] [CrossRef]
  19. Javidi, M.M.; Eskandari, S. Streamwise feature selection: A rough set method. Int. J. Mach. Learn. Cybern. 2016, 9, 667–676. [Google Scholar] [CrossRef]
  20. Kumar, S.U.; Inbarani, H.H. PSO-based feature selection and neighborhood rough set-based classification for BCI multiclass motor imagery task. Neural Comput. Appl. 2016, 28, 3239–3258. [Google Scholar] [CrossRef]
  21. Zhang, J.; Li, T.; Ruan, D.; Liu, D. Neighborhood rough sets for dynamic data mining. Int. J. Intell. Syst. 2012, 27, 317–342. [Google Scholar] [CrossRef]
  22. Zhou, P.; Hu, X.-G.; Li, P.; Wu, X. Online streaming feature selection using adapted Neighborhood Rough Set. Inf. Sci. 2019, 481, 258–279. [Google Scholar] [CrossRef]
  23. Lin, Y.; Li, J.; Lin, P.; Lin, G.; Chen, J. Feature selection via neighborhood multi-granulation fusion. Knowl. Based Syst. 2014, 67, 162–168. [Google Scholar] [CrossRef]
  24. Zhou, P.; Hu, X.; Li, P.; Wu, X. Online feature selection for high-dimensional class-imbalanced data. Knowl. Based Syst. 2017, 136, 187–199. [Google Scholar] [CrossRef]
  25. Pawlak, Z. Rough sets and intelligent data analysis. Inf. Sci. 2002, 147, 1–12. [Google Scholar] [CrossRef] [Green Version]
  26. Hu, Q.; Yu, D.; Liu, J.; Wu, C. Neighborhood rough set based heterogeneous feature subset selection. Inf. Sci. 2008, 178, 3577–3594. [Google Scholar] [CrossRef]
  27. Mac Parthalain, N.; Shen, Q.; Jensen, R. A Distance Measure Approach to Exploring the Rough Set Boundary Region for Attribute Reduction. IEEE Trans. Knowl. Data Eng. 2009, 22, 305–317. [Google Scholar] [CrossRef] [Green Version]
  28. Maciá-Pérez, F.; Berna-Martinez, J.V.; Oliva, A.F.; Ortega, M.A.A. Algorithm for the detection of outliers based on the theory of rough sets. Decis. Support Syst. 2015, 75, 63–75. [Google Scholar] [CrossRef] [Green Version]
  29. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
  30. Wang, Y.; Klijn, J.G.; Zhang, Y.A.; Sieuwerts, M.; Look, M.P.; Yang, F.; Talantov, D.; Timmermans, M.; Meijer-van Gelder, M.E.; Yu, J. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005, 365, 671–679. [Google Scholar] [CrossRef]
  31. Rosenwald, A.; Wright, G.; Chan, W.C.; Connors, J.M.; Campo, E.; Fisher, R.I.; Gascoyne, R.D.; Müller-Hermelink, H.K.; Smeland, E.B.; Giltnane, J.M.; et al. The Use of Molecular Profiling to Predict Survival after Chemotherapy for Diffuse Large-B-Cell Lymphoma. N. Engl. J. Med. 2002, 346, 1937–1947. [Google Scholar] [CrossRef]
  32. Yu, L.; Ding, C.; Loscalzo, S. Stable feature selection via dense feature groups. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘08), Las Vegas, NV, USA, 24–27 August 2008; Association for Computing Machinery (ACM): New York, NY, USA, 2008. [Google Scholar]
  33. Yang, K.; Cai, Z.; Li, J.; Lin, G. A stable gene selection in microarray data analysis. BMC Bioinform. 2006, 7, 228. [Google Scholar] [CrossRef] [Green Version]
  34. Richardson, A.M. Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach by Gregory W. Corder, Dale I. Foreman. Int. Stat. Rev. 2010, 78, 451–452. [Google Scholar] [CrossRef]
  35. Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
  36. Gu, Q.; Li, Z.; Han, J. Generalized Fisher Score for Feature Selection. arXiv 2012, arXiv:1202.3725. [Google Scholar]
  37. Zhao, Z.; Liu, H. Spectral feature selection for supervised and unsupervised learning. In Proceedings of the 24th International Conference on Real-Time Networks and Systems (RTNS ’16), Brest, France, 19–21 October 2016; Association for Computing Machinery (ACM): New York, NY, USA, 2017; pp. 1151–1157. [Google Scholar]
  38. Wasikowski, M.; Chen, X.-W. Combating the Small Sample Class Imbalance Problem Using Feature Selection. IEEE Trans. Knowl. Data Eng. 2009, 22, 1388–1400. [Google Scholar] [CrossRef]
  39. Robnik-Šikonja, M.; Kononenko, I. Theoretical and Empirical Analysis of ReliefF and RReliefF. Mach. Learn. 2003, 53, 23–69. [Google Scholar] [CrossRef] [Green Version]
  40. Guo, J.; Guo, Y.; Kong, X.; He, R.; Quo, Y. Unsupervised feature selection with ordinal locality. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017; pp. 1213–1218. [Google Scholar]
  41. Li, F.; Miao, D.; Pedrycz, W. Granular multi-label feature selection based on mutual information. Pattern Recognit. 2017, 67, 410–423. [Google Scholar] [CrossRef]
  42. Roffo, G.; Melzi, S.; Castellani, U.; Vinciarelli, A. Infinite Latent Feature Selection: A Probabilistic Latent Graph-Based Ranking Approach. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1407–1415. [Google Scholar]
  43. Fonti, V.; Belitser, E. Feature selection using lasso. VU Amst. Res. Pap. Bus. Anal. 2017, 30, 1–25. [Google Scholar]
  44. Yu, L.; Liu, H. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), Washington, DC, USA, 21–24 August 2003; pp. 856–863. [Google Scholar]
  45. Chutia, D.; Bhattacharyya, D.K.; Sarma, J.; Raju, P.N.L. An effective ensemble classification framework using random forests and a correlation based feature selection technique. Trans. GIS 2017, 21, 1165–1178. [Google Scholar] [CrossRef]
  46. Zhou, J.; Foster, D.P.; Stine, R.A.; Ungar, L.H. Streamwise feature selection. J. Mach. Learn. Res. 2006, 7, 1861–1885. [Google Scholar]
  47. Wu, X.; Yu, K.; Wang, H.; Ding, W. Online streaming feature selection. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 1159–1166. [Google Scholar]
  48. Yu, K.; Ding, W.; Wu, X. LOFS: A library of online streaming feature selection. Knowl. Based Syst. 2016, 113, 1–3. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Two kinds of neighborhood relations.
Figure 1. Two kinds of neighborhood relations.
Symmetry 12 01635 g001
Figure 2. Distribution of the two class examples.
Figure 2. Distribution of the two class examples.
Symmetry 12 01635 g002
Figure 3. Compactness of the three feature stream orders.
Figure 3. Compactness of the three feature stream orders.
Symmetry 12 01635 g003
Figure 4. Prediction accuracy of the three feature streaming orders (KNN).
Figure 4. Prediction accuracy of the three feature streaming orders (KNN).
Symmetry 12 01635 g004
Figure 5. Prediction accuracy of the three different feature streaming orders (SVM).
Figure 5. Prediction accuracy of the three different feature streaming orders (SVM).
Symmetry 12 01635 g005
Figure 6. Prediction accuracy of the three different feature streaming orders (RF).
Figure 6. Prediction accuracy of the three different feature streaming orders (RF).
Symmetry 12 01635 g006
Table 1. Time complexity of the online streaming feature selection (OSFS)-KW framework.
Table 1. Time complexity of the online streaming feature selection (OSFS)-KW framework.
DescriptionAlgorithmLineComplexity
Compute the distance16 O ( n )
Sort all the neighbors17 O ( n l o g n )
Repeat loop15–13 O ( n 2 l o g n )
Compare the dependency310 O ( n 2 l o g n )
Compare the dependency314–24 O ( c a r d ( B ) n 2 l o g n )
Table 2. Nineteen experimental datasets.
Table 2. Nineteen experimental datasets.
DatasetInstancesFeaturesClasses
WDBC569302
HILL6061002
HILL (NOISE)6061002
COLON TUMOR6020002
DLBCL7762852
CAR174918211
LYMPHOMA6240263
LUNG-STD18150002
GLIOMA5044334
LEU7271293
LUNG20333122
MLL7258483
PROSTATE10260332
SRBCT8323084
ARCENE20010,0002
MADELON50026002
BREAST CANCER28617,8162
OVARIAN CANCER25315,1542
SIDO05009992
Table 3. The search ranges of all adjustable parameters. RF: random forest.
Table 3. The search ranges of all adjustable parameters. RF: random forest.
ClassifiersParametersSearch Ranges
KNNThe number of neighbors(3,12)
SVMThe kernel coefficient σ for the radial basis function(0.1,3)
penalty parameter(0.1,3)
RFThe number of trees in the forest(2,15)

Share and Cite

MDPI and ACS Style

Lei, D.; Liang, P.; Hu, J.; Yuan, Y. New Online Streaming Feature Selection Based on Neighborhood Rough Set for Medical Data. Symmetry 2020, 12, 1635. https://doi.org/10.3390/sym12101635

AMA Style

Lei D, Liang P, Hu J, Yuan Y. New Online Streaming Feature Selection Based on Neighborhood Rough Set for Medical Data. Symmetry. 2020; 12(10):1635. https://doi.org/10.3390/sym12101635

Chicago/Turabian Style

Lei, Dingfei, Pei Liang, Junhua Hu, and Yuan Yuan. 2020. "New Online Streaming Feature Selection Based on Neighborhood Rough Set for Medical Data" Symmetry 12, no. 10: 1635. https://doi.org/10.3390/sym12101635

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop