New Online Streaming Feature Selection Based on Neighborhood Rough Set for Medical Data

: Not all features in many real-world applications, such as medical diagnosis and fraud detection, are available from the start. They are formed and individually flow over time. Online streaming feature selection (OSFS) has recently attracted much attention due to its ability to select the best feature subset with growing features. Rough set theory is widely used as an effective tool for feature selection, specifically the neighborhood rough set. However, the two main neighborhood relations, namely k-neighborhood and  neighborhood, cannot efficiently deal with the uneven distribution of data. The traditional method of dependency calculation does not take into account the structure of neighborhood covering. In this study, a novel neighborhood relation combined with k-neighborhood and  neighborhood relations is initially defined. Then, we propose a weighted dependency degree computation method considering the structure of the neighborhood relation. In addition, we propose a new OSFS approach named OSFS-KW considering the challenge of learning class imbalanced data. OSFS-KW has no adjustable parameters and pretraining requirements. The experimental results on 19 datasets demonstrate that OSFS-KW not only outperforms traditional methods but, also, exceeds the state-of-the-art analysis, D.L., P.L. and J.H.; investigation, Y.Y.; resources, Y.Y.; data curation, D.L. and P.L.; writing — original draft preparation, D.L. and Y.Y.; writing — review and editing, P.L. and J.H.; visualization, D.L.; supervision, J.H.; project administration, D.L., P.L., J.H. and Y.Y.; and funding acquisition, J.H.


Introduction
The number of features increases with the growth of data. A large feature space can provide much information that is useful for decision-making [1][2][3], but such a feature space includes many irrelevant or redundant features that are useless for a given concept. It is of necessity to remove the irrelevant features so that the curse of dimensionality can be relieved. This motivates some sort of research for feature selection methods. Feature selection, as a significant preprocessing step of data mining, can select a small subset, including the most significant and discriminative condition features [4]. Traditional methods are developed based on the assumption that all features are available. Many typical approaches exist, such as ReliefF [5], Fisher Score [6], mutual information (MI) [4], Laplacian Score [7], LASSO [8], and so on [9]. The main benefits of feature selection include speeding up the model training, avoiding overfitting, and reducing the impact of dimensionality during the process of data analysis [4].
However, features in many real-world applications are individually generated one-by-one over time. Traditional feature selection can no longer meet the required efficiency with the growing volume of features. For example, in the medical field, a doctor cannot easily obtain the entire features of a patient. In bioinformatic and clinical medicine situations, acquiring the entire features in a feature space is expensive and inefficient because of high-cost laboratory experiments [10]. In addition, for the task of medical image segmentation, acquiring the entire features is infeasible due to the infinite number of filters [11]. Furthermore, the symptom of a patient persistently changes over time during the treatment, and judging whether the feature contains useful information is essential for identifying the patient's disease after a new feature has emerged [12]. In these cases, waiting a long time until entire features are available and then performing the feature selection process is the primary method.
Online streaming feature selection (OSFS), presenting a feasible precept to solve feature streaming in an online way, has recently attracted wide concern [13]. The OSFS method must meet the following three criteria [14]: (1) Not all features are available, (2) the efficient incremental updating process for selected features is essential, and (3) accuracy is vital each time.
Many previous studies have proposed some different OSFS methods. For example, a grafting algorithm [15], which employed a stagewise gradient descent approach to feature selection, during which a conjugate gradient procedure was used to carry out its parameters. However, as well as the grafting algorithm, both fast OSFS [16] and a scalable and accurate online approach (SAOLA) [13] need to specify some parameters, which requires the domain information in advance. Rough set (RS) theory [17], which is an effective mathematic tool for features selection, rules extracting, or knowledge acquisition [18], needs no domain knowledge other than the given datasets [19]. In the real world, we usually encounter many numerical features in datasets, such as medical datasets. Under this circumstance, a neighborhood rough set is feasible to analyze discrete and continuous data [20,21]. Nevertheless, all these methods proposed have some adjustable parameters. Considering that selecting unified and optimal values for all different datasets is unrealistic [22], a new OSFS method based on an adapted neighborhood rough set is proposed, in which the number of neighbors for each object is determined by its surrounding instance distribution [22]. Furthermore, in the view of multi-granulation, multi-granulation rough sets is used to compute the neighborhoods of each sample and extract neighborhood size [23]. For the above OSFS methods based on neighborhood relation, dependency degree calculation is a key step. However, very little work has considered the neighborhood structure in the granulation view during this calculation. In additional, the phenomenon of uneven distribution of some data, including medical data, is common, and few works focus on the challenge of the uneven distribution of data.
In this paper, focusing on the strength and weakness of the neighborhood rough set, we proposed a novel neighborhood relation. Further, a weighted dependency degree was developed by considering the neighborhood structure of each object. Finally, our approach, named OSFS-KW, was established. Our contributions were as follows: (1) We proposed a novel neighborhood relation, and on this basis, we developed a weighted dependency computation method. (2) We developed an OSFS framework, named OSFS-KW, which can select a small subset made up of the most significant and discriminative features. (3) The OSFS-KW was established based on [24] and can deal with the class imbalance problem. (4) The results indicate that the OSFS-KW cannot only obtain better performance than traditional feature selection methods but, also, better than the state-of-the-art OSFS methods.
The remainder of the paper is organized as follows. In Section 2, we briefly review the main concepts of neighborhood RS theory. Section 3 discusses our new neighborhood relations and proposes the OSFS-KW. Then, Section 4 performs some experiments and discusses the experimental results. Finally, Section 5 concludes the paper.

Background
Neighborhood RS has been proposed to deal with numerical data or heterogeneous data. In general, a decision table (DT) for classification problem can be represented as There are two main kinds of neighborhood relations: (1) the k-nearest neighborhood relation shown in Figure 1a and (2) the neighborhood relation shown in Figure 1b. Definition 1 [26]. Given DT, a metric is a distance function, and ( , ) xy represents the distance between x and y . Then, for ,, x y z U , it must satisfy the following: (1) ( , ) 0 xy , when = xy and ( , )=0 xy , (2) ( , )= (y, ) x y x , and (3) ( , ) ( , ) ( , ) x z x y y z .
Definition 2 ( neighborhood [22] Then, the concepts of the lower and upper approximations of these two neighborhood relations are defined as follows: Definition 4. Given DT, for any XU  , two subsets of objects, called the lower and upper approximations of X with regard to the neighborhood relation, are defined as follows [27]: , then x certainly belongs to X , but if x B X   , then it may or may not belong to X . Definition 5. Given DT, for any XU  , the lower and upper approximations concerning the k-nearest neighborhood relation are defined as [24]: Figure 1a shows that the k-nearest neighbor (k = 4) samples of 1 x , 2 x , and 3 x have different class labels. In detail, the k-nearest neighborhood samples of 1 x are from class 2 C with the mark "· " and class 3 C with the mark " "; k-nearest neighborhood samples of 2 x are from classes 2 C , 3 C , and 1 C with the mark " "; the k-nearest neighbor samples of 3 x are from classes 1 C and 2 C . Figure 1b depicts that all neighbor samples of 1 x , 2 x , and 3 x also come from different class labels. We define the samples of 1 x , 2 x , and 3 x as all the boundary objects. The size of the boundary area can increase the uncertain in DT, because it reflects the roughness of X in the approximate space.
By Definition 5, The object space X can be partitioned into positive, boundary, and negative regions [28], which are defined as follows, respectively: In the data analysis, computing dependencies between attributes is an important issue. We give the definition of the dependency degree as follows: Definition 6. Given DT, for any BC  , the dependency degree of B to decision attribute set D is defined as [22] The aim of the feature selection is to select a subset B from C and gain the maximal dependency degree  [22]: In a real application, specifically in the medical field, the instances are often unevenly distributed in the feature space; that is, the distribution around some example points is sparse, while the distribution around others is tight. Neither the k-nearest neighborhood relation nor the neighborhood relation can portray sample category information well, since the setting of the parameters like r and k can hardly meet both the sparse and tight distributions. For example, the feature space has two classes, as shown in Figure 2-namely, red and green. Red and green represent two different classes respectively, which have different symbols including pentacle and hexagon as well. Around a sample point 1 x , the sample distribution is sparse. The three nearest points to 1 x all have different class from 1 x . If applying the k-nearest neighborhood relation (k = 3), 1 x will be misclassified. However, if we employ the neighborhood relation method, then the category of 1 x is consistent with that of most samples in the neighbors of 1 x . On the other hand, the sample distribution around point 2 x is tight, and two class samples are included in its neighborhood, and sample 2 x will be misclassified when applying the neighborhood relation denoted by the red circle. In fact, if applying the k-nearest neighborhood relation (k = 3), 2 x will be classified correctly.
Therefore, in Section 3, we proposed a novel neighborhood rough set combining the advantages of the k-nearest neighborhood rough set and the neighborhood rough set.

Method
In this section, we initially introduce a definition of OSFS. Then, we propose a new neighborhood relation and an approach of a weighted dependency degree. Based on three evaluation criteria-namely, maximal dependency, maximal relevance, and maximal significance-our new method OSFS-KW is presented finally.

Problem Statement
at each timestamp t, which obtains an optimal subset of features available so far.
Contrary to the traditional feature selection methods, we cannot access the full feature space in the scenarios of the OSFS. However, the two main neighborhood relations cannot make up the shortage caused by the uneven distribution data. Moreover, the class imbalanced issue of medical data is common. For example, abnormal cases attract more attention than the normal ones in the field of medical diagnosis. It is also crucial for the proposed framework to handle the class imbalanced problem.

Our New Neighborhood Relation
The standard European distance method is applied to eliminate the effect of variance on the distance among the samples. Given any samples To overcome the challenge of the uneven distribution of medical data, we proposed a novel neighborhood rough set as follows.

Definition 8. Given a decision system
where ()

Weighted Dependency Computation
The traditional dependency degree only considers the samples correctly classified instead of that of neighborhood covering. To solve this problem, we propose an approach of weighted dependency degree, which considers the granular information for features.
Proof. According to the monotonicity of neighborhood relation defined in Definition 8, The proof of Theorem 2 is easy according to the monotonicity of the neighborhood relation and Theorem 1.

Definition 11. Given
BC  and a decision attribute set D , the significance of feature c ( ) cB  to B can be rewritten as follows:

Three Evaluation Criteria
During the OSFS process, many irrelevant and redundant features should be removed for highdimensional datasets. There are three evaluation criteria used during the process, such as maxdependency, max-relevance, and max-significance. which is also equivalent to optimizing the following problem: The max-dependency maximizes either the joint dependency between the select feature subset and the decision attribute or the significance of the candidate feature to the already-selected features. However, the high-dimensional space has two limitations that lead to failure in generating the resultant equivalent classes: (1) the number of samples is often insufficient, and (2) during the multivariate density estimation process, computing the inverse of the high-dimensional covariance matric is generally an ill-posed problem [29]. Specifically, these problems are evident for continuous feature variables in real-life applications, such as in the medical field. In addition, the computational speed of max-dependency is slow. Meanwhile, max-dependency is inappropriate for OSFS, because each timestamp can only know one feature instead of the entire feature space in advance.

Max-Relevance
Max-relevance is introduced as an alternative in selecting features, as implementing maxdependency is hard. The max-relevance search feature approximates ( , ) BD in Equation (23) (26) where B is the already-selected feature subsets. A rich redundancy likely exists among the features selected according to max-relevance. For example, if two features i c and j c among the large features space highly depend on each other, then after removing any one of them, the class differentiation ability of the other one would not substantially change. Therefore, the following max-significance criterion is added to solve the redundancy problem by selecting mutually exclusive features.

Max-Significance
Based on Equation (22), the importance of each candidate feature can be calculated. The maxsignificance can select mutually exclusive features as follows: The feature flows individually over time for the OSFS. Testing all combinations of the candidate features and maximizing the dependency of the selected feature set are not appropriate. However, we can initially employ the "max-relevance" criteria to remove the irrelevant features. Then, we employ the "max-significance" criteria to remove the unimportant features in the selected feature set. Finally, the "max-dependency" criteria will be used to select the feature set with the maximal dependency. Based on the three criteria mentioned previously, in the next subsection, a novel online feature selection framework will be proposed.

OSFS-KW Framework
The proposed weighted dependency computation method based on the k  − neighborhood RS in this study is shown in Algorithm 1. First, we calculate the card value of each sample only the distribution of labels nearby i x but, also, the structure granular structure information around i x . In the real world, we generally encounter the issue of high-dimension class imbalance, specifically in medical diagnosis. Then, we employ the method proposed in [24], named the class imbalance function, as shown in Algorithm 2. For imbalanced medical data, we apply Algorithm 2 to compute at step 9 in Algorithm 1.

Algorithm 1 Weighted dependency computation
Based on the k  − neighborhood relation and the weighted dependency computation method mentioned above, we introduce our novel OSFS method, named "OSFS-KW", as shown in Algorithm 3. The main aim of the OSFS-KW is to maximize ( , ) BD with the minimal number of feature subsets.  With the "max-significance" constraint, we randomly select an attribute from B and compute its significance according to Equation (22). Some attributes with a significance equal to 0 will be removed from B. Ultimately, we can obtain the best feature subset for decision-making through the aforementioned three evaluation constraints.

Time Complexity of OFS-KW
In the process of OSFS-KW, the weighted dependency degree computation, shown in Algorithm 1, is a substantially important step. The number of examples in DS is n, and the number of attributes C is m.

Data and Preprocessing
We use a high-dimensional medical dataset as our test bench to compare the performance of the proposed OSFS-KW with the existing streaming feature selection algorithm. Table 2 summarizes the 19 high-dimensional datasets used in our experiments.
In Table 2, the BREAST CANCER and OVARIAN CANCER datasets are biomedical datasets [30]. LYMPHOMA and SIDO0 datasets are from the WCCI 2008 Performance Prediction Challenges [31]. MADELON and ARCENE are from the NIPS 2003 feature selection challenge [16]. WDBC, HILL, HILL (NOISE), and COLON TUMOR are four UCI datasets, the web can be accessed at https://archive.ics.uci.edu/ml/index.php. And DLBCL, CAR, LUNG-STD, GLIOMA, LEU, LUNG, MLL, PROSTATE, and SRBCT are nine microarray datasets [32,33]. In our experiments, we employ K-nearest neighbor (KNN), support vector machines (SVM), and random forest (RF) as the basic classifiers to evaluate a selected feature subset. The radial basis function is used in SVM, and the Gini coefficient is used to comprehensively measure all variables' importance in RF. Furthermore, a grid search cross-validation is applied to train and optimize these three classifiers to give the best prediction results. Then, search ranges of some adjustable parameters for each basic classifier are shown in Table 3.  The results are collected in the MATLAB 2017b platform with Windows 10, Intel(R) Core (TM)i5-8265U,1.8GHz CPU, and 8GB memory. In addition, we applied the Friedman test at a 95% significance level under the null hypothesis to validate whether the OSFS-KW and its rivals have a significant difference in the prediction accuracy, compactness, and running time [34]. Then, accepting the null hypothesis means that the performance of the OSFS-KW has no significant difference with its rivals. However, if the null hypothesis is rejected, then conducting follow-up inspections is necessary. If so, we employed the Nemenyi test [35], with which the performances of the two methods were significantly different if their corresponding average rankings (AR) were greater than the value of the critical difference (CD).

OSFS-KW versus k-Nearest Neighborhood
In this section, we compare OSFS-KW with the k-nearest neighborhood relation. We employ the same algorithm framework for both neighborhood relations to reduce the impact of other factors. In addition, for the k-nearest neighborhood relation, the value of k varies from 3 to 13 in the experiments.
The experiment results are shown in Appendix A. Tables A1 and A2 show the compactness and running time. The p-values of the Friedman test are 5.07 × 10 −9 and 5.47 × 10 −10 , respectively. In addition, Tables A3-A5 show the experimental results about the prediction accuracy on these datasets. The p-values on KNN, SVM, and RF are 0.6949, 0.9884, and 0.5388, respectively. Table A6 shows the test results of the OFS-KW versus k-nearest neighborhood. Therefore, a significant difference exists among these 19 datasets on compactness and running time. On the contrary, no significant difference is observed on accuracy with KNN, SVM, and RF. According to the Nemenyi test, the value of the CD is 3.8215, and we have the following observations from Tables A1-A5.
In terms of compactness, a significant difference is just observed between OSFS-KW and knearest neighborhood when k = 10, 11, 12, and 13, but OSFS-KW selects the smallest average number of features. According to the running time, there is a significant difference between OSFS-KW and k-nearest neighborhood when k = 3, 4, 5, 6, 7, and 8. In general, the k-nearest neighborhood is faster than OSFS-KW, mainly because the number of neighbors for OSFS-KW is uncertain but fixed for the k-nearest neighborhood. According to the value of AR and CD, there is no significant difference between the OSFS-KW and k-nearest neighborhood, with three basic classifiers' prediction accuracy for the value of k from 3 to 13. On some datasets, such as COLON, TUMOR, DLBCL, CAR, LYMPHOMA, and LUNG-STD, if a proper k is chosen, the k-nearest neighborhood would have a higher prediction accuracy than the OSFS-KW with KNN, SVM, and RF. This finding means that knearest neighborhood can perform well with the proper parameter k.

OSFS-KW versus  Neighborhood
In this section, the OSFS-KW is compared with the  neighborhood relation. We employ the algorithm framework for both neighborhood relations for equality. In addition, we employ   Table A12. There is a significant difference among the different algorithms on compactness, running time, and prediction accuracy using KNN, but no significant difference exits in the prediction accuracy of SVM and RF. In addition, the value of CD is 3.1049.
On the number of selected features shown in Table A7 the OSFS-KW has the highest mean of the prediction accuracy  neighborhood than among these datasets. However, the  neighborhood can also obtain the highest prediction accuracy with different r values on some datasets, such as DLBCL and LUNG-STD. However, it is impossible for the  neighborhood relation to uniform the parameters on all different kinds of datasets.

Influence of Feature Stream Order
In this section, we carry out the experiments on the OSFS-KW with three types of feature steam orders, including original, inverse, and random. Figure 3 depicts the results of the compactness of the OSFS-KW on the datasets. Figures 4-6 show the prediction accuracy about KNN, SVM, and RF, respectively.
In addition, we execute the Friedman test at a 95% significance level under the null hypothesis to verify whether there is a significant difference in the compactness, running time, and predictive accuracy. Table A12 in Appendix C shows the calculated p-values. Moreover, it is clear that there is no significant difference, except for the running time, with random order and prediction accuracy, using KNN with the random order. The number of features in the feature space has a remarkable impact on the running time between the original and random orders, specifically when the number of features is very large. For example, the number of features of ARCENE is 10,000, and the difference of the running time on the dataset between the original and random is 157.2334 s. Figures 3-6 show minor fluctuations in some datasets. However, these three orders have no significant difference with each other on most of the datasets. This result denotes that the feature stream orders have a limited impact on the OSFS-KW.
We implement all these algorithms in MATLAB. The k value of ReliefF is set to 7 for the best performance. We rank all features and select the same number of features as the OSFS-KW, considering that all these 11 traditional feature selection methods cannot be applied to the scenario of an OSFS. In addition, we employ three methods as basic classifiers-namely, KNN, SVM, and RF. The results of the prediction accuracy of the three classifiers with five-fold validation are used to evaluate the OSFS-KW and all competing ones.
The experiment results are shown in Appendix D. Tables A14-A16 show the prediction accuracy of the three basic classifiers. The p-values on the accuracy with KNN, SVM, and RF are 1.20 × 10 −15 , 3.81 × 10 −12 , and 5.99 × 10 −13 , respectively. Table A17 shows the test results. Thus, a significant difference is observed between OSFS-KW and the compared algorithms on the prediction accuracy with the three classifiers. According to the value of CD, which is 3.8215, we can observe the following results from Tables A14-A16.
(1) OSFS-KW versus Fisher. According to the values of AR and CD, no significant difference is found between these two methods on the prediction accuracy at a 95% significance level. However, OSFS-KW has a better performance than Fisher on most datasets with the three classifiers. Overall, OSFS-KW not only performs best among the 19 datasets but, also, has the highest average prediction accuracy among KNN, SVM, and RF.
We implement all aforementioned algorithms in MATLAB [48], and the significant level  is set to 0.05 for the above five algorithms. The threshold a and the wealth w of Alpha-investing are set to 0.5. As shown in Appendix E, Tables A18 and A19 show the compactness and running time of OSFS-KW against the other five algorithms. The p-values of the Friedman test on these three classifiers are 0.0248 and 3.62 × 10 −23 . Tables A20-A22 summarize the prediction accuracy on these 19 datasets using the KNN, SVM, and RF classifiers with p-values of 0.0337, 0.0032, and 0.0533, respectively. Table A23 shows the test results. A significant difference is found between the six algorithms on the number of selected features, running time, and the prediction accuracy using KNN and SVM, but no significant difference is observed using RF. According to the value of CD, which is 1.7296, we can observe the following results from Tables A18-A21.
(1) In terms of compactness, no significant difference is observed between OSFS-KW and the other competing algorithms. Fast-OSFS has the smallest mean number of selected features. In addition, for SNAOLA, the number of selected features on some datasets is remarkably large but on some other datasets is zero. This finding demonstrates that SNAOLA cannot handle some types of datasets well. (2) On the running time, Alpha-investing is the fastest algorithm among all these six algorithms and has the smallest mean running time among these datasets. According to the values of AR and CD, a significant difference exists among OSFS-KW, Alpha-investing, Fast-OSFS, and SAOLA. The difference between OSFS-KW and OFS-A3M on the running time is small.
(3) According to the prediction accuracy, OSFS-KW has the highest mean prediction accuracy on these datasets using all three classifiers. OSFS-KW outperforms the five competing algorithms. No significant difference is observed between OSFS-KW and the other competing methods, except for SAOLA.
In summary, although our method, OSFS-KW, is slower than some competing methods, including Fast-OSFS and SAOLA, OSFS-KW is superior among the six methods in prediction accuracy of the 19 datasets.

Conclusions
Most of the exiting OSFS methods cannot deal well with the problem of uneven distribution data. In this study, we defined a new k  − neighborhood relation, combining the advantages of kneighborhood relation and  neighborhood relation. Then, we proposed a weighted dependency degree considering the structure of neighborhood covering. Finally, we proposed a new OSFS framework named OSFS-KW, which need not specify any parameters in advance. In addition, this method can also handle the problem of imbalance classes in medical datasets. With three evaluation criteria, this approach can select the optimal feature subset mapping decision attributes. Finally, we used KNN, SVM, and RF as the basic classifiers in conducting the experiments to validate the effectiveness of our method. The results of the Friedman test indicate that a significant difference exists between the OSFS-KW and other neighborhood relations on compactness and running time, but there was no significant difference on the predictive accuracy. Moreover, when comparing with the 11 traditional feature selection methods and five existing OSFS algorithms, the performance of OFS-KW is better than that of the traditional feature selection methods and outperforms that of the state-of-the-art OSFS. However, we only focused on the challenges of medical data and used only medical datasets to verify the validity of our approach. Virtually, our method can be applied into other similar fields, generally. In the future, we will test and evaluate this method using some multidisciplinary datasets.

Conflicts of Interest:
The authors declare no conflict of interest.