Next Article in Journal
A Combined System Metrics Approach to Cloud Service Reliability Using Artificial Intelligence
Next Article in Special Issue
Operations with Nested Named Sets as a Tool for Artificial Intelligence
Previous Article in Journal
A Framework for Content-Based Search in Large Music Collections
Previous Article in Special Issue
Infusing Autopoietic and Cognitive Behaviors into Digital Automata to Improve Their Sentience, Resilience, and Intelligence
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Combination of Reduction Detection Using TOPSIS for Gene Expression Data Analysis

by
Jogeswar Tripathy
1,
Rasmita Dash
1,*,
Binod Kumar Pattanayak
1,
Sambit Kumar Mishra
2,*,
Tapas Kumar Mishra
2 and
Deepak Puthal
3,*
1
ITER, Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar 751030, India
2
Department of Computer Science and Engineering, SRM University-AP, Amaravati 522502, India
3
Department of Electrical Engineering and Computer Science, Khalifa University, Abu Dhabi 127788, United Arab Emirates
*
Authors to whom correspondence should be addressed.
Big Data Cogn. Comput. 2022, 6(1), 24; https://doi.org/10.3390/bdcc6010024
Submission received: 27 December 2021 / Revised: 13 February 2022 / Accepted: 16 February 2022 / Published: 23 February 2022
(This article belongs to the Special Issue Data, Structure, and Information in Artificial Intelligence)

Abstract

:
In high-dimensional data analysis, Feature Selection (FS) is one of the most fundamental issues in machine learning and requires the attention of researchers. These datasets are characterized by huge space due to a high number of features, out of which only a few are significant for analysis. Thus, significant feature extraction is crucial. There are various techniques available for feature selection; among them, the filter techniques are significant in this community, as they can be used with any type of learning algorithm and drastically lower the running time of optimization algorithms and improve the performance of the model. Furthermore, the application of a filter approach depends on the characteristics of the dataset as well as on the machine learning model. Thus, to avoid these issues in this research, a combination of feature reduction (CFR) is considered designing a pipeline of filter approaches for high-dimensional microarray data classification. Considering four filter approaches, sixteen combinations of pipelines are generated. The feature subset is reduced in different levels, and ultimately, the significant feature set is evaluated. The pipelined filter techniques are Correlation-Based Feature Selection (CBFS), Chi-Square Test (CST), Information Gain (InG), and Relief Feature Selection (RFS), and the classification techniques are Decision Tree (DT), Logistic Regression (LR), Random Forest (RF), and k-Nearest Neighbor (k-NN). The performance of CFR depends highly on the datasets as well as on the classifiers. Thereafter, the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) method is used for ranking all reduction combinations and evaluating the superior filter combination among all.

1. Introduction

Over the years, researchers have been trying with microarray technology to track gene expression on a genomic scale. Cancer diagnosis and classification are possible through examining the expression of genes. The use of microarray technology to analyze gene expression has opened up a world of possibilities for studying cell and organism biology [1]. Nowadays, every researcher primarily focuses especially on the behavior of genes across the conditions of the experiment studied; however, recently, biomedical applications have fueled both the use of available technologies and the efficient implementation of new analytical tools to deal with these complex data. Microarray data analysis yields useful results that aid in the resolution of gene expression problems. Cancer categorization is one of the most significant uses of microarray data analysis. This reflects variations in the levels of expression of various genes. However, categorizing gene expression profiles is a difficult process that has been classified as an NP-Hard issue. As a result, not all genes have a role in the development of cancer. In clinical diagnosis, a large percentage of genes are insignificant [2].
Researchers were able to examine hundreds of gene expression patterns concurrently using microarray technology, which is useful in a variety of disciplines, particularly in medicine, mainly for the detection of cancer; in biomedical research, categorizing patient’s gene expression profiles has become a regular topic. The main issue is dealing with the dimensionality of microarray data [3]. As microarrays have such a vast dimension, efficient algorithm exploration becomes too difficult for analyzing gene expression features. There are more incorrect characteristics in the dataset, due to which the algorithm’s accuracy suffers considerably [4]. Due to this reason in the pre-processing stage, feature selection approaches are used to extract meaningful information. The goal of the feature selection method is to identify the most significant characteristics from the microarray data to reduce the feature set and enhance classification accuracy. Using feature selection and classification approaches, gene expression analysis of cancer diagnosis has been made. However, combining an efficient feature selection method and classifier is a critical task to avoid incorrect drug selection [5].
In a machine learning pool, there exist three feature selection techniques: the filter approach, wrapper approach, and embedded approach. Filter techniques are an essential element of strategically selecting features due to the cheap cost of computation, making them suitable when data sizes are too big for a learning algorithm or when resources are limited. Filter methods may be split into two groups based on how they work. First, there are univariate methods: each characteristic is assessed independently in this category. The “relation” between a feature and the class label is taken into account here. Features are graded based on their “relationship” with other feature-class pairs. Mutual Information (MI) and Chi-square are a few examples of this category. Second, there are ultivariate methods: in this scenario, characteristics are accessed in sets to see how well the sets can distinguish across classes. The sets that can discriminate better are more likely to offer a more accurate classification. Ranking techniques focusing on score-based feature subset selection techniques are the most used filter approaches. In a microarray dataset, the ranking approach may be thought of as a crucial mechanism for picking the k most relevant genes. Since the number of features in microarray data might be quite enormous, the learning algorithm’s accuracy is severely harmed. As a result, selecting the top k genes from microarray data is an important pre-processing step [6]. In typical microarray research, the high number of features and the relatively limited number of observations (samples) offer numerous statistical difficulties, which are referred to as the ‘‘curse of dimensionality” in machine learning. As a result, after normalizing and pre-filtering the original datasets, we use several feature selection techniques to extract compact sets of discriminative features before using classification algorithms [7]. However, choosing a filter approach for gene selection is critical, because one technique may produce the best results for one dataset while another produces the best results for another.
Inspired by the above analysis, which is discussed by several researchers, this paper proposes a pipeline of reduction combinations using filter approaches. Four feature ranking algorithms are taken into account in this model for obtaining a better feature subset from datasets, as shown in Table 1: Correlation-Based Feature Selection (CBFS), Chi-Square Test (CST), Information Gain (InG), and Relief Feature Selection (RFS). Then, the classification techniques such as Decision Tree (DT), Logistic Regression (LR), Random Forest (RF), and k-Nearest Neighbor (k-NN) are used to classify the microarray databases. Forming the combination of four FS approaches, 16 pipelines are built up to reduce the feature subset in four phases to come up with more useful features. Considering 10 or fewer features, the final optimal feature subset is generated. With these reduced feature subsets, the performance of each pipeline is measured with the considered four classification techniques. In certain circumstances, it may become difficult to decide on a single parameter; thus, lastly, all reduction combinations are compared on different parameters such as accuracy, sensitivity, jaccard, specificity, and gmean. Thereafter, to finalize which reduction combination in a pipeline is stable irrespective of the dataset, the TOPSIS approach is used for optimal decision making.
The organization of this analysis is as follows. Section 1 introduces a generalization of the research, Section 2 describes the recent related work about the sequence of feature ranking methods with different classifiers and TOPSIS, i.e., multi-criteria selection techniques used on microarray data. Section 3 describes the proposed model, and Section 4 includes methodologies used with a detailed explanation; evaluation metrics are also discussed here with the Multi-Criteria Decision Making (MCDM) technique. Section 5 includes the complete experimental work with datasets used and the result analysis. In Section 6, the conclusion and future work are discussed, and finally, a discussion about the study is presented in Section 7.

2. Related Work

Feature ranking approaches are now used everywhere, including analytical techniques, summarizing extraction, sequential data processing, multidimensional data processing, and many more. Several studies use various filter techniques for feature selection. Hence, it is very difficult to identify a filter approach that can extract superior features from the datasets of the respective application. Again, the classification performance depends a lot on the extracted feature. Therefore, rather than sticking to a single filter approach, combinations of two or more filter approaches are applied in the pre-processing stage. Thus, the literature survey is as follows, which focuses on feature reduction in different stages, implementing filter approaches. A hybrid feature selection approach for illness identification has been presented by Namrata Singh et al. [8]. They use cross-validation for partitioning and multiple filter approaches for feature ranking with weighted scores. Furthermore, the sequential forward selection process is also employed as a wrapper technique to find out the subset of features. Compared to the benchmark classifiers, such as Naive Bayes, Support Vector Machine with Radial Basis Function, Random Forest, and k-Nearest Neighbor, it is experienced that the four-step hybrid ensemble filter selection strategy outperforms fourteen feature selection algorithms. The empirical results clearly show that the suggested hybrid approach surpasses the competing methods in terms of accuracy, sensitivity, specificity, F1-score, the area under curve evaluation measures, and the number of selected features.
Andrea Bommert et al. compared the most advanced feature selection strategies on high-dimensional datasets in their work [9]. They compared 22 filter techniques from several toolboxes on 16 high-dimensional classification datasets from distinct fields as well as the methods that select the features of a dataset in the same order. They concluded that some filter methods appear to perform better than others but failed to identify the highly reliable filter methods. Furthermore, they suggested that filter methods are dependent on the dataset.
Cosmin Lazar et al. [10] emphasized gene prioritization and filter-based feature selection techniques for informative feature extraction in Gene Expression Microarray (GEM) analysis. Rasmita Dash et al. [4] employed a pipelining of the ranking approaches presented in their work that addresses the difficulties associated with the filter approach. Few of the lower-ranked features are deleted at each level of the pipeline, resulting in a pretty decent subset of features being maintained at the end. The sequence for ranking approaches applied in the pipeline, on the other hand, is critical to ensuring that the significant genes are kept in the final subset. Out of four gene ranking methodologies, twenty-four separate pipeline models are developed during this experimental investigation. To discover the best pipeline for a given task, these pipelines are tested against seven distinct microarray databases. The Nemenyi post hoc hypothetical test confirms the grading system’s result that a pipeline model is noteworthy.
Rasmita Dash et al. [11] offer an approach for microarray data Multi-Objective Feature Selection and Classifier Ensemble (MOFSCE), which works in two phases. The first phase is a pre-processing phase in which the Pareto front is utilized to identify relevant genes using a bi-objective optimization approach. In their study, 21 Bi-Objective Feature Selection (BOFS) models are created using seven feature ranking methodologies. The BOFS model’s performance varies based on the dataset. As a consequence, the grading system is used to determine the stability of the BOFS models. The construction of a classifier ensemble, which obtains the selected characteristics from the identified BOFS model, is the second phase.
Mitsunori Ogihara et al. [12] have presented a comparative analysis of feature selection on gene expression data and multiclass categorization. The research offers eight feature selection approaches, which according to the findings are information gain, the towing rule, sum minority, max minority, Gini index, the sum of variances, one-dimensional SVM, and t-statistics. They have evaluated the feature’s usefulness by evaluating the level of class predictability when the prediction is made by splitting a gene’s whole range of expression into two sections. The results are typically satisfactory for datasets with a modest number of classes. Prediction accuracy is much worse for datasets with a high number of classes.
In another study, Mehdi Pirooznia et al. [13] examined the effectiveness of classification algorithms such SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree, and Random Forest. k-fold cross-validation was used to calculate the accuracy. The efficacy of certain standard clustering approaches, such as K-means, DBC, and expectation maximization (EM) clustering, has been tested on the datasets. Support Vector Machine Recursive Feature Elimination (SVM-RFE), Chi-squared, and CSF have been used to compare the efficiency of feature selection approaches. In each example, these approaches are used on eight separate binary class microarray datasets. By observing the increasing performance of the work completed by Rasmita Dash et al. [11], who used a three-stage dimensionality reduction strategy on microarray databases, as well as four distinct classifiers. In the first stage, statistical techniques are utilized to filter out irrelevant genes from the database. Thus, approximately 90% less significant characteristics are deleted. SNR is employed in the second step to drop a group of very noisy genes. Finally, the PCA approach is utilized to further reduce the dimension in the final stage. Then, these compressed data are evaluated using ANN, MLR, naïve Bayesian, and k-NN classifiers.
Alok Sharma et al. [14] also worked for feature selection by using transcriptome data for a classification problem. The findings of comparison investigations conducted by Changjing Shang and Qiang Shen et al. [15] highlight the relevance of appropriate feature selection in the creation of classifiers for usage in high-dimensional domains. The optimal classification performance for k-NN and Naive Bayes classifiers is attained using a subset of features determined by information gain ranking after a huge corpus of systematic tests. Naive Bayes may also perform well with a modest collection of linearly processed primary features in categorizing this challenging dataset. In addition, feature selection enhances classification accuracy while enhancing computing efficiency.
Mahnaz Behroozi and Ashkan Sami [16] looked at a dataset with a variety of sound recordings. The key contribution is to suggest a new separate classification framework that recommends using a unique classifier for each type of voice sample and presenting which vocal tests are more representative. They employed pre-processed data that was classified using four different algorithms: k-NN, SVM, discriminant analysis, and Naive Bayes. The k-NN classifier was built using the Euclidean distance metric, with k values of 1, 3, 5, and 7. With a scaling factor sigma(∑) of 3 and a penalty parameter (C) of 1, the SVM classifier was utilized with linear and radial basis kernels (RBF).
Huijuan Lu et al. [17] devised a hybrid feature selection approach that combines two gene algorithms. According to experimental results, the suggested MIMAGA-Selection approach greatly decreases the dimension of the gene expression dataset. The reduced gene expression dataset delivers superior classification accuracy when compared to traditional feature selection techniques. Thanyaluk Jirapech-Umpai and S. Aitken [18] proposed a technique to create classification models using microarray data using both supervised and unsupervised classifiers. The study focuses on the supervised classification problem in which data samples are assigned to a known class. The k-NN classifier is used in this investigation. k-NN classification is based on a distance function determined for pairs of samples in N-dimensional space, such as the Euclidean distance or Pearson’s correlation. The class memberships of each sample’s k closest neighbors, as calculated by the distance function, are used to classify it. Motivated by the above literature survey, different ideas are taken for feature ranking, classification, and multi-criteria decision-making methods in the different datasets. Hence, the purpose of this study is to assess the classification accuracy by finding out which feature ranking works better in sequencing the feature selection process for the classification of microarray samples. Here, sequencing of the feature selection is taken into consideration, which is a prerequisite for classification [19]. This reduction combination is evaluated concerning multiple performance metrics. In the final stage of implementation, TOPSIS is used to prove the outcome of this analysis.

3. Proposed Work

A rank-based approach is one of the dominant feature selection approaches in high-dimensional data analysis. This approach awards a rank based on one mathematical score to all the features in the original data. Top-ranked features are assumed to be highly informative, and a few of them are selected in the descending order of their rank. Each ranking algorithm is unique, which focuses on the score based on the ranking criteria. However, feature selection based on a single ranking criterion for any dataset may not result in satisfactory prediction. Hence, different ranking techniques are assigned to various gene-based sequencing datasets on this ranking technique’s-based feature evaluation scheme [20]. As a result, the informative gene sequence in one approach may not be the same in another. Thus, a filter technique that is useful in some problem spaces may not be successful for all datasets that refer to different applications. Hence, one ranking technique may outperform others for a specific type of problem. Thus, in this study, in place of extraction of a few top-ranked features using a single ranked-based filter technique, the merits of sequencing various filter techniques are considered. Furthermore, these genes are passed through a sequence of filter ranking approaches. In every stage of filtration, few highly ranked features are taken to the next level, and the rest are dropped.
The proposed model, which is shown in Figure 1, presents the high-level overall workflow that contains the highly spaced microarray database. It is scaled or normalized and validated with k-fold cross validation having (k − 1) fold for training and 1 fold for testing. After validating the data properly, then it is passed through a block, where the proposed sequence of the feature ranking process with different classification is performed in a proper manner. That block of the proposed model is described briefly, as shown in the Figure 2 where sixteen feature reduction (FR) sequences are designed from the four feature ranking techniques known as reduction combinations (RC) i.e. RC1–RC16. For the sequencing of feature ranking, four ranking techniques are taken into consideration, which are in order and represented as follows: FR1: Pearson Correlation Coefficient-Based Feature Selection, FR2: Chi-Square Test, FR3: Information Gain, and FR4: Relief Method. In every stage of reduction, 80% lower-ranked genes are dropped, and thus, the reduced dataset at different levels are formulated. It is very difficult to evaluate which ranking technique works well on a dataset for a specific classifier. Hence, to design the most superior RC for highly spaced gene expression data, a multi-criteria-based decision-making technique was applied in which TOPSIS is implemented considering Accuracy, Specificity, Sensitivity, Jaccard, and Gmean as five evaluation techniques. The proposed model is presented in Figure 1 in two ways, where Figure 1 explains the high-level overall flow and shows the overall model design, and Figure 2 explains the details of proposed work with the Multi-Attribute Decision-Making (MADM) process for obtaining a better sequence and classifier combination.

4. Methodology

This section provides a brief overview of the machine learning techniques and tools used for data pre-processing, data classifications, performance metrics, and the TOPSIS approach.

4.1. Normalization

Normalization is often necessary when dealing with attributes on multiple scales [21]; otherwise, it may lead to a weakening in the impact of an equally essential attribute (on a smaller scale) due to other qualities having values on a greater scale. Generally, normalization is of three types: min–max normalization, Z-score normalization, and decimal scaling. In this research, min–max normalization is used and is discussed as follows.
The min–max normalization method is used for normalizing data. This technique can result in a poor data model when multiple attributes exist with different scale values when performing data mining operations, as shown in Equation (1). So, this technique is used for the implementation of this work for normalizing the data.
n e w _ V i , j = ( V i , j m i n _ A j ) / ( m a x _ A j m i n _ A j )
where V i , j is the original value for an instance of attribute j of record i. n e w _ V i , j is the new value. m i n _ A j is the minimum value of the attribute j in the original dataset (A). m a x _ A j is the maximum value of the attribute j in the original dataset (A).

4.2. Feature Ranking Techniques Used

Methods for extracting a subset of features from a larger dataset are known as feature selection methods. Furthermore, feature selection methods provide a feature subset from the raw dataset. It is divided into three types, i.e., filter approach, wrapper approach, and embedded approach. Feature ranking includes rating each feature using a specific approach and then picking genes based on their weights [22]. Each method employed allowed for the selection of a limited number of features. Some of the papers used different successful feature ranking techniques in a better way [4,11]. Filtering techniques use an easy-to-calculate measure to quickly rank features, with the highest-ranking features picked [23]. Here, four feature selection techniques are taken into consideration as in Table 1, i.e., CBFS, CST, InG, and RFS. CBFS [2] is a multi-variant filter technique that ranks features based on the correlation between performance assessment functions. It begins with a full set of features (genes). CBFS focuses on decreasing feature-to-feature correlations while improving feature-to-class correlations. The Chi-squared feature evaluation [24] merely displays the relative relevance of the original characteristics. Then, the user may choose which features to keep and which to reject based on this information. In Chi-squared feature selection, this test statistic between the feature and the target class is used to establish the relevance of a feature. Equation (2) is used to construct the Chi-squared statistic, where the feature and class had no connection.
X 2 = ( A i B i ) 2 B i
where A i is the observed value and E i is the expected value. The information gain serves as a simple initial filter for screening features. The Information Gain InG(C,B) [25] of a feature B, relative to a set of data C, is defined as:
I n G ( C , B ) = E ( C ) x ϵ Y ( B ) | C x | | C |
In Equation (3), Y(B) is the set of all possible values of feature Band C x , the subset of C, for which feature B has the value of x. The first term is the entropy of the entire set. Given a randomly picked instance, Relief [26] searches for k of its nearest neighbours from the same class, which is referred to as nearest hits H, as well as k nearby neighbors from each of the distinct classes, which are referred to as nearest misses M. Relief selection computes feature relevance by showing the relationships between features and class labels. This approach, similar to nearest neighbor algorithms, applies weights to features based on the same-class and different-class examples that are nearest to every sample in the dataset. The adaptive formula for finding feature relevance is shown in Equation (4).
X j = X j 1 ( Y j N H j ) 2 + ( Y j N M j ) 2
where X is an n-dimensional weight vector with n features. The closest same class and different class samples are represented by N H as ‘NearHit’ and N M as ‘NearMiss’. j shows the number of iterations in the algorithm.

4.3. Classification Model

In this subsection, the authors have presented k-NN, LR, DT, and RF as classifiers [27] to classify the four cancer datasets. The summary of the classification techniques and performance evaluation matrix are described in the following section.

4.3.1. k-Nearest Neighbor (k-NN)

k-NN [28,29] chooses the class value of a new instance by examining a set of the k closest instances, as shown in Equation (6) in the training set and selecting the most frequent class value among them, with k set to five and Euclidean distance matrices used to calculate the similarity between two points. It stores the query data based on a similarity measure and the training data. k-NN parameter tuning is performed to improve the performance by selecting an appropriate value of k.
D ( x , y ) = i = 1 m ( x i y i ) 2

4.3.2. Logistic Regression (LR)

The LR classification model is a prominent option for modeling binary classifications [29]. LR creates a predictor variable by linearly combining the feature variables. Then, a logistic function is used to convert the values of this predictor variable into probabilities. Generally, this method is used for binary class prediction. It can also be applied to multiclass problems. This classification model’s logistic equation is:
Y i = ln ( ln ( x i 1 x i ) )
where X i is the probability of the occurrence of event i.

4.3.3. Decision Tree (DT)

The DT may handle classification and regression issues to solve the classification problems [30]. It has two advantages:
  • Decision Trees are made to resemble human decision-making abilities, making them simple to understand.
  • Due to the tree-like structure of the decision tree, the logic behind the concept can be easily understood.
The Decision Tree is made up of three types of nodes: the first one is for decision making (commonly represented by a square), the second is for shaping the options (commonly represented by a circular pattern), and the last one is for representing the action (commonly represented by a triangle).

4.3.4. Random Forest (RF)

RF is a classification process that commonly employs an ensemble method, which utilizes multiple decision trees to classify data. It generates bootstrap templates from the Random Forest’s original data, and for each bootstrap template, it grows a regression tree or raw classification. Rather than selecting only the best predictors for disclosure, it takes into account every node. It employs a random predictor selection and chooses the best separation among them. The parameter n-split, which specifies the number of splitting points to be evaluated for each feature, is required by RF [31]; higher values of n-split may result in more accurate predictions at the expense of increased computational load. We choose three values while leaving the other parameters at their default settings.

4.4. k-Fold Cross-Validation Method

The error rate of the classification algorithm is used to evaluate a classifier’s performance. The error rate on the data that is used to train the classifier (training set) is not a trustworthy criterion. Indeed, such a method could cause the classifier to overfit the training data. To anticipate a classifier’s performance, we must examine its error rate on a separate dataset that was not included in the training process (i.e., the test set). The k-fold cross-validation method is based on dividing the dataset into k sections at random. When the k − 1 remainders are used for training, one portion is used for testing. This technique is performed k times to ensure that each part is tested only once. Then, the k-error estimates are averaged to produce a reliable overall error estimate. Varied k-fold cross-validation trials can result in different classification error rates due to the random selection of folds [32]. As a result, we repeat the k-fold cross-validation process k times to increase the error rate’s reliability.

4.5. Performance Evaluation Criteria

At the final stage, the performance of each method was evaluated to determine which method could produce the best results [11]. To evaluate each of the methods used in this study, we used the parameters such as accuracy (Equation (7)), sensitivity (Equation (8)), specificity (Equation (9)), jaccard (Equation (10)), and gmean (Equation (11)) from the confusion matrix. The confusion matrix includes the terms FN (False Negative), FP (False Positive), TN (True Negative), and TP (True Positive). The definition of all performance metrics are as follows:
A c c u r a c y = ( T P i + T N i t o t a l n u m b e r o f s a m p l e ) 100
S e n s i t i v i t y = ( T P i T P i + F P i ) 100
S p e c i f i c i t y = ( T N i T N i + F N i ) 100
J a c c a r d = ( T P i T P i + F P i + F N i ) 100
G m e a n = S p e c i f i c i t y S e n s i t i v i t y

4.6. Multi-Criteria Decision Making (MCDM)

There are six steps involves in multi-criteria decision-making: (i) problem formulation, (ii) identification of requirements, (iii) goal setting, (iv) identification of various alternatives, (v) development of criteria, and (vi) identification and application of decision-making techniques. Some frequently referred multi-criteria techniques (also known as Multi-Attribute Decision Making (MADM)) are the Simple Additive Weighting (SAW), Gray Relational Model (GRM), and TOPSIS. TOPSIS is used to recommend one or more options from a large set of alternatives. The ranking of TOPSIS techniques was calculated using Microsoft Excel as a tool. TOPSIS can be used in situations where there are multiple feature selection algorithms to choose from, each with its own set of criteria such as accuracy, sensitivity, specificity, number of features, and so on [33]. So, TOPSIS techniques used for measurement are as follows:
Step 1: Firstly, create a matrix M i , j with ‘m’ number of rows corresponding to each feature reduction sequence and ‘n’ number of columns corresponding to the performance evaluation criteria and the number of classifiers used.
Step 2: For calculating normalized matrix N M i , j :
N M i , j = N M i , j i = 1 n ( N M i , j ) 2 , j = 1 , , J
Step 3: For calculating weighted normalized matrix W N M i , j :
W N M i , j = N M i , j W j , j = 1 , , J
where W j is the weight of the criterion and j = 1 J W j = 1 , weights can be assigned randomly or according to the criteria.
Step 4: Then, calculate the ideal best solution from the combination of the best performance values as V j + and ideal worst from the combination of the worst performance values as V j using the following formula:
V j + = { W N M 1 + , L , W N M j + } = { ( m a x i W N M i , j | j ϵ h ) , ( m a x i W N M i , j | j ϵ l ) }
V j = { W N M 1 , L , W N M j } = { ( m a x i W N M i , j | j ϵ h ) , ( m i n i W N M i , j | j ϵ l ) }
where h is the set of best performance values and l is is the set of worst performance values.
Step 5: After calculating the Ideal Best, calculate the separation measure from the Ideal Best:
S i + = j = 1 m ( W N M i , j W N M j + ) 2 , j = 1 , , J
After calculating the Ideal Worst, calculate the separation measure from the Ideal Worst:
S i = j = 1 m ( W N M i , j W N M j ) 2 , j = 1 , , J
Step 6: Finally, calculate the relative closeness to the ideal solution performance score as follows:
P i = ( S i S i + + S i ) , j = 1 , , M

5. Experimental Results

Simulation is conducted to consider a maximum of 10 numbers of genes for classification purposes, and an optimal set is extracted. Considering the reduced data, the 10 cross-validation process is implemented to come up with the training and testing data input for the classifiers. For this analysis, a few successful classification techniques for highly spaced data are taken such as k-NN, LR, DT, and RF [34]. From the classification output, it is observed that the performance of RCs varies with the order of the sequence, as these ranking techniques perform differently for different datasets and classifiers.

5.1. Dataset Description

Microarray data are high-dimensional data with a small number of samples or observations compared to the number of genes or attributes. The number of samples is in the range of hundreds, and the number of attributes is in the range of thousands. Among four datasets, Colon Cancer (D2) and Adenoma Cancer (D4) have two class levels, but another two datasets such as Brain Tumor (D1) has five class levels, and Breast Cancer (D3) has three class levels. Table 2 contains descriptions of the databases as follows.

5.2. Result Analysis

Before going to feature sequencing, first, datasets are normalized by min–max normalization, the values are converted to between 0 and 1 for each feature. After that, feature sets are reduced by 20% in each of the four steps adopted in the sequence. Further simulation is conducted to come up with an optimal set of features, fixing it within the number of 10. Then, the reduced feature sequences are validated by the 10-fold cross-validation approach by separating the training and testing data with respect to the considered classifiers; the classification outputs are presented in Table 3 and Table 4. As a result of analyzing Table 3 and Table 4, it is clear that none of the classifiers obtain the optimal results across all metrics, and the ranking of the top-performing model differs depending on the performance assessment measurement chosen. Hence, further analysis of classifiers and the FR sequencing approaches is performed by TOPSIS. Table 5 represents the result of the TOPSIS approach implemented on the Brain Tumor dataset, where rows are representing 16 number feature reduction sequences from FR1 to FR16 and columns are representing classifiers used such as k-NN, LR, DT, and RF with five performance criteria: accuracy (CR1), sensitivity (CR2), specificity (CR3), jaccard (CR4), and gmean (CR5). The classification result on each dataset is given in Table 2 and converted to the corresponding weighted normalized matrix by using Equation (15) for each dataset, as shown in Table 5, Table 6, Table 7 and Table 8. Then, the ideal best value v + is chosen from the set of combinations of best performance values for each dataset individually, and the ideal worst value v can also be chosen from the set of worst performance values, as given in Equations (14) and (15). The Ideal Best and Ideal Worst solution separation measure or Euclidean distance can be measured by using Equations (16) and (17), whose results are shown on the column as S i + and S i individually for each dataset in Table 5, Table 6, Table 7 and Table 8. Finally, the ideal solution performance score can be found out by using Equation (18), which is represented as P i for ranking of each feature reduction sequences individually with referencing to the datasets shown in Table 5, Table 6, Table 7 and Table 8. As described above, Table 6, Table 7 and Table 8 represent the TOPSIS result analysis for the Colon Cancer, Breast Cancer, and Adenoma Cancer datasets. After obtaining the ranks of each FR, the performances of all FR are compared in Table 9, where the top three ranked feature reduction sequences with respect to all classifiers are taken into consideration. As it is difficult to choose which sequence performs better, the occurrence of the few top-ranking FR is also calculated in Table 10, from which it is observed that FR5 is superior. FR5 is coming in first rank expect for one classifier (i.e., LR), where the rank is second. Finally, in Figure 3, the number of occurrences of all FR coming in the top three ranking is shown. Here, it is also found that out of 48 occurrences (as in Table 9), FR5 is coming under the top three rank in 16 occurrences. However, for other FR cases, the occurrence is quite nominal. Hence, it can be said that the FR5 model can work better as the feature reduction sequence given for four datasets, as shown in Figure 4.

6. Discussion

The following are the contribution of this proposed work:
  • This research work focuses on the feature selection and classification approaches for the gene expression cancer data analysis.
  • After going through several research contributions of the last five years, it is observed that filter approaches are successful for hugely spaced data, as these are simple to implement and with less computational cost.
  • When a single filter approach is applied, the selection approach may not be able to drop all redundant and insignificant features from the data. This may be due to the data characteristics and score function used in the selection process.
  • Thus, a series of filter approaches is applied to rank the features, and thus, few top-ranked features are extracted.
  • When a series of filter approaches is applied, which sequence will generate a significant feature is required to be evaluated. Thus, rather than based on a single classifier, multiple classifiers are used to find out the optimal sequence selection.
  • Furthermore, this analysis is statistically proven to come up with optimal results.
  • The method is generalizable for other diseases (especially for high-dimensional data), if the research challenge is similar as in microarray datasets. Since features are different for the different datasets hence by a little bit of modification, we can solve these issues such as RNA sequence and methylation data accordingly. If the dimensionality is low, then four-stage feature selection may not be required.

7. Conclusions and Future Work

After analyzing all the experimental studies and results analysis individually, it is concluded that the FR5 model works better on the given datasets on all presented classifiers. Finally, it can be concluded that the feature reduction sequence FR5, i.e., (Correlation-Based Feature SelectionChi-Squared TestRelief Feature SelectionInformation Gain) is found to be the superior approach to other feature reduction combination techniques. This analysis is implemented on highly spaced medical data. As filter approaches are quite successful for huge data, this strategy can boost classifier efficiency while requiring little computing work. Future work of this research analysis can be extended using a few more successful feature selection approaches such as other filter approaches, wrapper techniques, and embedded approaches. In this research study, the key challenges are significant feature selection, as the data are huge. Thus, alternatively, this approach can also be successfully applied in application areas where similar challenges are seen.

Author Contributions

Conceptualization, J.T., R.D. and B.K.P.; methodology, J.T., R.D. and S.K.M.; software, J.T., R.D. and T.K.M.; validation, J.T., R.D., T.K.M. and S.K.M.; formal analysis, R.D. and D.P.; investigation, J.T., R.D. and B.K.P.; resources, J.T.; data curation, J.T., R.D. and B.K.P.; writing—original draft preparation, J.T., R.D. and S.K.M.; writing—review and editing, J.T., R.D. and S.K.M.; visualization, R.D., T.K.M. and D.P.; supervision, R.D., B.K.P. and D.P.; project administration, R.D., B.K.P. and D.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Herrero, J.; Vaquerizas, J.M.; Al-Shahrour, F.; Conde, L.; Mateos, A.; Díaz-Uriarte, J.S.R.; Dopazo, J. New challenges in gene expression data analysis and the extended GEPAS. Nucleic Acids Res. 2004, 32, 485–491. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Almugren, N.; Alshamlan, H. A survey on hybrid feature selection methods in microarray gene expression data for cancer classification. IEEE Access 2019, 7, 78533–78548. [Google Scholar] [CrossRef]
  3. Singh, R.K.; Sivabalakrishnan, M.J.P.C.S. Feature selection of gene expression data for cancer classification: A review. Procedia Comput. Sci. 2015, 50, 52–57. [Google Scholar] [CrossRef] [Green Version]
  4. Dash, R.; Misra, B.B. Pipelining the ranking techniques for microarray data classification: A case study. Appl. Soft Comput. 2016, 48, 298–316. [Google Scholar] [CrossRef]
  5. Glaab, E.; Bacardit, J.; Garibaldi, J.M.; Krasnogor, N. Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. PLoS ONE 2012, 7, 39932. [Google Scholar] [CrossRef] [Green Version]
  6. Ghosh, K.K.; Begum, S.; Sardar, A.; Adhikary, S.; Ghosh, M.; Kumar, M.; Sarkar, R. Theoretical and empirical analysis of filter ranking methods: Experimental study on benchmark DNA microarray data. Expert Syst. Appl. 2021, 169, 114485. [Google Scholar] [CrossRef]
  7. Sahu, B.; Dehuri, S.; Jagadev, A.K. Feature selection model based on clustering and ranking in pipeline for microarray data. Inform. Med. Unlocked 2017, 9, 107–122. [Google Scholar] [CrossRef]
  8. Singh, N.; Singh, P. A hybrid ensemble-filter wrapper feature selection approach for medical data classification. Chemom. Intell. Lab. Syst. 2021, 217, 104396. [Google Scholar] [CrossRef]
  9. Bommert, A.; Sun, X.; Bischl, B.; Rahnenführer, J.; Lang, M. Benchmark for filter methods for feature selection in high-dimensional classification data. Comput. Stat. Data Anal. 2020, 143, 106839. [Google Scholar] [CrossRef]
  10. Lazar, C.; Taminau, J.; Meganck, S.; Steenhoff, D.; Coletta, A.; Molter, C.; de Schaetzen, V.; Duque, R.; Bersini, H.; Nowe, A. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans. Comput. Biol. Bioinform. 2012, 9, 1106–1119. [Google Scholar] [CrossRef]
  11. Dash, R.; Misra, B.B. A multi-objective feature selection and classifier ensemble technique for microarray data analysis. Int. J. Data Min. Bioinform. 2018, 20, 123–160. [Google Scholar] [CrossRef]
  12. Li, T.; Zhang, C.; Ogihara, M. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 2004, 20, 2429–2437. [Google Scholar] [CrossRef] [PubMed]
  13. Pirooznia, M.; Yang, J.Y.; Yang, M.Q.; Deng, Y. A comparative study of different machine learning methods on microarray gene expression data. BMC Genom. 2008, 9, 1–13. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Sharma, A.; Imoto, S.; Miyano, S. A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2011, 9, 754–764. [Google Scholar]
  15. Shen, Q.; Shang, C. Aiding classification of gene expression data with feature selection: A comparative study. J. Comput. Intell. Res. (IJCIR) 2006, 1, 68–76. [Google Scholar]
  16. Behroozi, M.; Sami, A. A multiple-classifier framework for Parkinson’s disease detection based on various vocal tests. Int. J. Telemed. Appl. 2016, 2016, 6837498. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  17. Lu, H.; Chen, J.; Yan, K.; Jin, Q.; Xue, Y.; Gao, Z. A hybrid feature selection algorithm for gene expression data classification. Neurocomputing 2017, 256, 56–62. [Google Scholar] [CrossRef]
  18. Jirapech-Umpai, T.; Aitken, S. Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC Bioinform. 2005, 6, 148. [Google Scholar] [CrossRef] [Green Version]
  19. Dash, R. A two stage grading approach for feature selection and classification of microarray data using Pareto based feature ranking techniques: A case study. J. King Saud-Univ.-Comput. Inf. Sci. 2020, 32, 232–247. [Google Scholar] [CrossRef]
  20. Singh, R.; Kumar, H.; Singla, R.K. TOPSIS based multi-criteria decision making of feature selection techniques for network traffic dataset. Int. J. Eng. Technol. 2014, 5, 4598–4604. [Google Scholar]
  21. GeeksforGeeks. Data Normalization in Data Mining. 2019. Available online: https://www.geeksforgeeks.org/data-normalization-in-data-mining/ (accessed on 6 March 2018).
  22. Abusamra, H. A comparative study of feature selection and classification methods for gene expression data of glioma. Procedia Comput. Sci. 2013, 23, 5–14. [Google Scholar] [CrossRef] [Green Version]
  23. Hemphill, E.; Lindsay, J.; Lee, C.M.; Oiu, I.I.; Nelson, C.E. Feature selection and classifier performance on diverse bio-logical datasets. BMC Bioinform. 2014, 15, S4. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  24. Spencer, R.; Thabtah, F.; Abdelhamid, N.; Thompson, M. Exploring feature selection and classification methods for predicting heart disease. Digit. Health 2020, 6. [Google Scholar] [CrossRef] [Green Version]
  25. Liu, S.; Xu, C.; Zhang, Y.; Liu, J.; Yu, B.; Liu, X.; Dehmer, M. Feature selection of gene expression data for cancer classification using double RBF-kernels. BMC Bioinform. 2018, 19, 396. [Google Scholar] [CrossRef] [Green Version]
  26. Gunduz, H. An efficient dimensionality reduction method using filter-based feature selection and variational autoencoders on Parkinson’s disease classification. Biomed. Signal Process. Control. 2021, 66, 102452. [Google Scholar] [CrossRef]
  27. Mohapatra, D.; Tripathy, J.; Patra, T.K. Rice Disease Detection and Monitoring Using CNN and Naive Bayes Classification. In Soft Computing Techniques and Applications; Springer: Singapore, 2021; pp. 11–29. [Google Scholar]
  28. Nazir, S.; Yousaf, M.H.; Velastin, S.A. Evaluating a bag-of-visual features approach using spatio-temporal features for action recognition. Comput. Electr. Eng. 2018, 72, 660–669. [Google Scholar] [CrossRef]
  29. Assiri, A.S.; Nazir, S.; Velastin, S.A. Breast tumor classification using an ensemble machine learning method. J. Imaging 2020, 6, 39. [Google Scholar] [CrossRef]
  30. Criminisi, A. Machine learning for medical images analysis. Med Image Anal. 2016, 33, 91–93. [Google Scholar] [CrossRef]
  31. Ko, B.C.; Kim, S.H.; Nam, J.Y. X-ray image classification using random forests with local wavelet-based CS-local binary patterns. J. Digit. Imaging 2011, 24, 1141–1151. [Google Scholar] [CrossRef] [Green Version]
  32. Tripathy, J.; Dash, R.; Pattanayak, B.K.; Mohanty, B. Agutomated Phrase Mining Using POST: The Best Approach. In Proceedings of the IEEE International Conference on Electrical Power Engineering, Communication and Computing Technology (ODICON), Odisha, India; pp. 1–6.
  33. Dash, R.; Samal, S.; Dash, R.; Rautray, R. An integrated TOPSIS crow search based classifier ensemble: In application to stock index price movement prediction. Appl. Soft Comput. 2019, 85, 105784. [Google Scholar] [CrossRef]
  34. Microarray Datasets. Available online: http://csse.szu.edu.cn/staff/zhuzx/Datasets.html (accessed on 6 March 2018).
  35. Brain Tumor Dataset. 2020. Available online: https://www.kaggle.com/ahmedhamada0/brain-tumor-detection/metadata (accessed on 6 March 2018).
  36. Adenoma Datasets. Available online: http://biogps.org/dataset/tag/adenoma/ (accessed on 6 March 2018).
Figure 1. High-Level Flow Diagram of the Proposed Model.
Figure 1. High-Level Flow Diagram of the Proposed Model.
Bdcc 06 00024 g001
Figure 2. Detailed Flow Diagram of the Proposed Model.
Figure 2. Detailed Flow Diagram of the Proposed Model.
Bdcc 06 00024 g002
Figure 3. Occurrence Analysis for Feature Reduction Technique.
Figure 3. Occurrence Analysis for Feature Reduction Technique.
Bdcc 06 00024 g003
Figure 4. FR5 Model for Microarray Data.
Figure 4. FR5 Model for Microarray Data.
Bdcc 06 00024 g004
Table 1. Acronym and Description.
Table 1. Acronym and Description.
AcronymDescriptions
ANNArtificial Neural Network
BOFSBi-Objective Feature Selection
CBFSCorrelation-Based Feature Selection
CFRCombination of Feature Reduction
CRReduction Combination
CSTChi-Square Test
DBCDistance-Based Clustering
DTDecision Tree
EMExpectation Maximization
FNFalse Negative
FPFalse Positive
FRFeature Reduction
FSFeature Selection
GEMGene Expression Microarray
GRMGray Relational Model
InGInformation Gain
k-NNk-Nearest Neighbor
LRLogistic Regression
MADMMulti-Attribute Decision Making
MCDMMulti-Criteria Decision Making
MIMutual Information
MIMAGAMutual Information Maximization and Genetic Algrithm
MLP Neural NetsMachine Learning Perception Neural Networks
MOFSCEMulti-Objective Feature Selection and Classifier Ensemble
PCAPrincipal Compotent Analysis
RBFRadial Basis Function
RFRandom Forest
RFSRelief Feature Selection
SAWSimple Additive Weighting
SVMSupport Vector Machine
SVM-RFESupport Vector Machine Recursive Feature Elimination
TNTrue Negative
TOPSISTechnique for Order of Preference by Similarity to Ideal Solution
TPTrue Positive
Table 2. Description of microarray databases.
Table 2. Description of microarray databases.
DatasetTotal
No. of Samples
No. of Samples
in Each Class
No. of GenesNo. of Classes
Brain Tumor [35]401071295
10
10
4
6
Colon Cancer [34]624020002
22
Breast Cancer [34]981112133
51
36
Adenoma Cancer [36]8470862
4
Table 3. Classification Results of Four Datasets By Using k-NN, LR, DT, and RF Classifier with Different Feature Ranking Techniques.
Table 3. Classification Results of Four Datasets By Using k-NN, LR, DT, and RF Classifier with Different Feature Ranking Techniques.
CR1 CR2 CR3 CR4 CR5
k-NNLRDTRF k-NNLRDTRF k-NNLRDTRF k-NNLRDTRF k-NNLRDTRF
D1FR10.890.920.910.89 0.780.910.920.91 0.970.910.850.93 0.890.940.910.94 0.870.910.880.92
FR20.910.940.920.88 0.90.920.940.91 0.950.880.970.89 0.950.860.920.9 0.920.900.950.90
FR30.940.930.920.92 0.880.920.90.91 0.90.920.930.94 0.940.930.910.9 0.890.920.910.92
FR40.920.90.910.9 0.910.950.940.89 0.920.780.870.88 0.910.890.910.89 0.910.860.900.88
FR50.970.950.911 0.990.9610.91 0.980.950.991 10.990.970.98 0.980.950.990.95
FR60.960.940.930.92 0.970.910.880.78 0.910.920.890.9 0.980.890.920.91 0.940.910.880.84
FR70.90.910.910.89 0.920.930.890.91 0.930.970.980.82 0.9710.90.89 0.920.950.930.86
FR80.950.910.880.9 0.910.890.920.93 0.910.870.890.91 0.930.940.960.97 0.910.880.900.92
FR90.940.910.920.93 0.920.910.930.88 0.910.930.91 0.910.950.99 0.910.950.930.89
FR100.970.910.890.87 0.980.920.880.91 0.940.920.80.83 0.930.940.890.98 0.960.920.840.87
FR110.950.940.930.9 0.930.910.850.93 0.920.950.920.87 0.940.960.910.92 0.920.930.880.90
FR120.960.970.910.89 0.960.980.910.92 0.940.960.930.92 0.910.970.930.93 0.950.970.920.92
FR130.990.930.920.93 10.970.930.95 0.890.880.990.89 0.940.950.970.9 0.940.920.960.92
FR1410.890.950.91 10.90.970.96 0.950.90.940.94 0.950.890.911 0.970.900.950.95
FR150.970.940.920.94 0.90.910.890.92 0.880.90.960.97 0.890.930.950.98 0.890.900.920.94
FR160.960.950.890.93 0.970.940.90.91 0.90.920.80.91 0.970.960.950.94 0.930.930.850.91
D2FR10.980.940.890.91 0.910.940.890.79 0.890.920.910.89 0.910.880.920.93 0.900.930.900.84
FR20.970.950.90.92 0.950.910.910.89 0.920.950.930.88 0.970.950.940.93 0.930.930.920.88
FR30.920.920.910.89 0.90.890.880.95 0.930.910.90.91 0.890.920.910.93 0.910.900.890.93
FR40.990.930.870.93 0.910.940.910.93 0.950.920.880.87 0.930.940.910.89 0.930.930.890.90
FR510.990.920.93 110.990.98 0.980.990.991 0.99110.99 0.990.990.990.99
FR610.980.880.93 0.950.930.920.91 0.950.930.940.89 0.910.970.930.95 0.950.930.930.90
FR70.960.950.930.91 0.860.880.870.93 0.910.940.930.92 0.920.930.950.98 0.880.910.900.92
FR80.990.970.880.89 0.940.910.910.92 0.960.90.870.95 0.910.920.970.94 0.950.900.890.93
FR90.990.960.920.92 0.950.930.920.89 0.950.940.970.9 0.890.960.970.91 0.950.930.940.89
FR100.980.980.940.91 0.950.930.920.93 0.970.980.910.89 0.950.960.920.93 0.960.950.910.91
FR110.970.940.910.89 0.980.9710.94 0.90.950.940.96 0.920.960.970.89 0.940.960.970.95
FR120.990.980.890.92 0.950.960.940.91 0.960.980.890.92 0.960.950.920.89 0.950.970.910.91
FR130.960.950.930.94 0.920.780.920.91 0.990.980.930.89 0.930.960.890.91 0.950.870.920.90
FR140.970.960.930.78 10.970.940.92 0.990.910.970.98 0.990.950.931 0.990.940.950.95
FR150.9810.920.87 0.960.970.960.95 0.980.960.950.93 0.890.910.940.96 0.970.960.950.94
FR1610.990.940.89 0.960.950.890.94 0.950.910.930.94 0.980.970.950.93 0.950.930.910.94
Table 4. Classification Results of Four Datasets By Using k-NN, LR, DT, and RF Classifier with Different Feature Ranking Techniques cont.
Table 4. Classification Results of Four Datasets By Using k-NN, LR, DT, and RF Classifier with Different Feature Ranking Techniques cont.
CR1 CR2 CR3 CR4 CR5
k-NNLRDTRF k-NNLRDTRF k-NNLRDTRF k-NNLRDTRF k-NNLRDTRF
D3FR10.940.930.880.91 0.910.910.90.89 0.920.920.910.95 0.930.910.90.89 0.910.910.900.92
FR20.930.920.910.9 0.940.960.980.97 0.910.940.970.96 0.930.910.980.9 0.920.950.970.96
FR30.890.910.920.87 0.920.910.90.92 0.930.890.960.91 0.940.930.940.93 0.920.900.930.91
FR40.940.930.880.92 0.970.960.890.9 0.910.930.940.92 0.960.980.920.89 0.940.940.910.91
FR50.950.920.910.9 110.980.99 0.99110.98 110.990.98 0.991.000.990.98
FR60.970.930.890.89 0.960.90.950.97 0.910.990.960.97 0.980.910.930.95 0.930.940.950.97
FR70.940.890.90.91 0.930.940.90.91 0.980.950.960.97 0.980.970.950.94 0.950.940.930.94
FR80.950.910.920.89 0.950.910.880.92 0.940.950.970.92 0.920.890.950.94 0.940.930.920.92
FR90.960.970.890.91 0.950.980.970.94 0.940.910.940.95 0.970.930.950.93 0.940.940.950.94
FR100.970.950.930.92 0.940.980.890.92 0.970.930.940.92 0.950.960.930.92 0.950.950.910.92
FR110.960.880.860.93 0.980.970.940.96 0.950.940.930.9 0.950.940.920.91 0.960.950.930.93
FR120.940.930.920.91 0.99110.9 0.960.970.910.89 0.90.890.930.94 0.970.980.950.89
FR130.920.970.930.93 0.990.930.910.99 0.990.930.920.93 0.930.940.960.96 0.990.930.910.96
FR140.970.920.930.95 0.960.890.910.92 10.9510.91 0.970.980.921 0.980.920.950.91
FR150.970.920.910.93 0.930.920.950.94 0.970.940.920.94 0.950.910.93 0.950.930.930.94
FR160.950.920.910.88 0.950.930.950.94 10.950.890.93 0.960.970.920.91 0.970.940.920.93
D4FR10.960.940.910.89 0.920.940.90.94 0.910.930.90.92 0.930.950.890.91 0.910.930.900.93
FR20.960.940.920.9 0.920.930.940.89 0.990.960.880.89 0.910.940.950.95 0.950.940.910.89
FR30.980.970.910.89 0.950.990.920.97 0.930.960.950.97 0.970.960.930.94 0.940.970.930.97
FR410.980.940.92 0.940.890.890.9 0.910.970.930.95 0.950.930.920.89 0.920.930.910.92
FR510.980.920.92 0.970.9911 10.990.991 0.9910.980.98 0.980.990.991.00
FR610.990.930.94 0.930.960.940.92 0.980.930.960.94 0.920.930.940.95 0.950.940.950.93
FR70.980.970.930.93 0.940.950.970.92 0.990.9510.96 0.960.990.980.95 0.960.950.980.94
FR80.990.950.920.91 0.930.950.940.91 0.920.940.910.9 0.950.940.940.92 0.920.940.920.90
FR90.980.990.940.93 0.950.890.940.95 0.950.970.950.93 0.9610.960.98 0.950.930.940.94
FR100.980.970.960.92 0.960.950.940.89 0.880.890.960.94 0.910.920.930.95 0.920.920.950.91
FR110.950.960.930.89 0.9710.920.91 0.930.950.920.91 0.920.950.960.96 0.950.970.920.91
FR120.980.930.890.9 0.920.960.920.94 0.960.920.930.95 0.920.930.970.94 0.940.940.920.94
FR130.960.970.920.91 0.990.930.920.93 10.950.970.94 0.890.920.980.91 0.990.940.940.93
FR140.980.940.910.93 10.960.930.89 10.910.980.95 0.90.930.931 1.000.930.950.92
FR150.990.910.930.93 0.980.950.940.91 0.970.950.970.95 0.940.950.920.96 0.970.950.950.93
FR1610.980.940.91 0.960.940.950.92 0.950.970.950.94 0.910.940.920.96 0.950.950.950.93
Table 5. Results of TOPSIS for Brain Tumor Dataset with Ranking of Feature Reduction Sequence using Dataset D1.
Table 5. Results of TOPSIS for Brain Tumor Dataset with Ranking of Feature Reduction Sequence using Dataset D1.
k-NN LR
FRCR1CR2CR3CR4CR5 S i + S i PiRankCR1CR2CR3CR4CR5 S i + S i PiRank
FR10.2340.2090.2620.2370.2340.0780.0240.237160.2480.2450.2480.2500.2470.0410.0450.52210
FR20.2400.2410.2570.2530.2490.0420.0430.507100.2530.2480.2400.2290.2440.0560.0330.37215
FR30.2480.2360.2430.2510.2400.0520.0340.394140.2510.2480.2510.2470.2500.0370.0470.5607
FR40.2420.2440.2490.2430.2460.0470.0400.457120.2430.2560.2130.2360.2340.0760.0180.19416
FR50.2550.2650.2650.2670.2650.0080.0780.90410.2560.2590.2590.2630.2590.0160.0680.8062
FR60.2530.2600.2460.2610.2530.0270.0630.70140.2530.2450.2510.2360.2480.0440.0440.49811
FR70.2370.2460.2510.2590.2490.0410.0480.53990.2450.2510.2650.2660.2580.0230.0690.7494
FR80.2500.2440.2460.2480.2450.0430.0420.492110.2450.2400.2380.2500.2390.0540.0330.38014
FR90.2480.2460.2430.2400.2450.0480.0270.357150.2450.2450.2730.2660.2590.0250.0750.7493
FR100.2550.2620.2540.2480.2580.0240.0650.73530.2450.2480.2510.2500.2500.0380.0480.5578
FR110.2500.2490.2490.2510.2490.0250.0490.66570.2530.2450.2590.2550.2520.0290.0580.6695
FR120.2530.2570.2540.2430.2560.0320.0580.64880.2610.2640.2620.2580.2630.0140.0720.8421
FR130.2610.2680.2410.2510.2540.0310.0690.68750.2510.2620.2400.2520.2510.0390.0470.5459
FR140.2630.2680.2570.2530.2620.0160.0760.82720.2400.2430.2460.2360.2440.0540.0350.39713
FR150.2550.2410.2380.2370.2400.0550.0390.414130.2530.2450.2460.2470.2460.0430.0420.49712
FR160.2530.2600.2430.2590.2520.0300.0610.67160.2560.2530.2510.2550.2530.0290.0540.6516
V+0.2630.2680.2650.2670.265 0.2610.2640.2730.2660.263
V−0.2340.2090.2380.2370.234 0.2400.2400.2130.2290.234
DT RF
FR10.2490.2510.2320.2450.2410.0570.0280.332140.2440.2500.2560.2500.2530.0420.0540.5658
FR20.2520.2560.2640.2480.2610.0260.0630.70940.2410.2500.2450.2390.2480.0540.0440.45112
FR30.2520.2460.2540.2450.2500.0430.0450.51390.2520.2500.2590.2390.2550.0410.0560.5806
FR40.2490.2560.2370.2450.2470.0480.0370.435110.2460.2450.2420.2370.2440.0550.0380.40514
FR50.2490.2730.2700.2610.2720.0110.0820.88210.2740.2500.2750.2610.2630.0150.0810.8461
FR60.2540.2400.2430.2480.2420.0540.0330.378130.2520.2150.2480.2420.2310.0650.0260.28916
FR70.2490.2430.2670.2420.2550.0410.0570.58470.2440.2500.2260.2370.2380.0660.0370.35715
FR80.2410.2510.2430.2580.2470.0470.0410.464100.2460.2560.2510.2580.2530.0390.0580.5994
FR90.2520.2540.2540.2560.2540.0320.0520.61850.2550.2420.2510.2630.2460.0380.0510.5707
FR100.2430.2400.2180.2390.2290.0790.0090.098160.2380.2500.2290.2610.2390.0610.0440.42013
FR110.2540.2320.2510.2450.2410.0570.0380.400120.2460.2560.2400.2450.2480.0510.0480.48811
FR120.2490.2480.2540.2500.2510.0390.0470.54580.2440.2530.2530.2470.2530.0430.0540.55510
FR130.2520.2540.2700.2610.2620.0230.0690.75220.2550.2610.2450.2390.2530.0450.0580.5649
FR140.2600.2650.2560.2450.2610.0250.0630.71330.2490.2640.2590.2660.2620.0300.0740.7143
FR150.2520.2430.2620.2560.2520.0380.0540.58960.2570.2530.2670.2610.2600.0220.0710.7622
FR160.2430.2460.2180.2560.2320.0730.0210.227150.2550.2500.2510.2500.2510.0380.0520.5815
V+0.2600.2730.2700.2610.272 0.2740.2640.2750.2660.263
V−0.2410.2320.2180.2390.229 0.2380.2150.2260.2370.231
Table 6. Results of TOPSIS for Colon Cancer Dataset with Ranking of Feature Reduction Sequence using Dataset D2.
Table 6. Results of TOPSIS for Colon Cancer Dataset with Ranking of Feature Reduction Sequence using Dataset D2.
k-NN LR
FRCR1CR2CR3CR4CR5 S i + S i PiRankCR1CR2CR3CR4CR5 S i + S i PiRank
FR10.2500.2410.2340.2430.2380.0490.0210.304140.2440.2530.2440.2330.2490.0460.0460.49812
FR20.2480.2520.2420.2600.2470.0290.0380.56480.2470.2450.2520.2510.2490.0370.0450.55111
FR30.2350.2380.2450.2380.2420.0500.0170.252150.2390.2390.2410.2430.2410.0530.0320.37715
FR40.2530.2410.2500.2490.2460.0350.0320.473120.2420.2530.2440.2480.2490.0390.0490.55710
FR50.2560.2650.2580.2650.2620.0030.0620.95510.2570.2690.2630.2640.2660.0030.0800.9691
FR60.2560.2520.2500.2430.2510.0300.0400.57070.2550.2500.2470.2560.2490.0320.0520.6238
FR70.2450.2280.2400.2460.2340.0560.0140.201160.2470.2370.2490.2460.2430.0470.0340.41814
FR80.2530.2490.2530.2430.2510.0300.0380.555100.2520.2450.2390.2430.2420.0470.0400.45613
FR90.2530.2520.2500.2380.2510.0340.0270.445130.2490.2500.2490.2540.2500.0320.0510.6149
FR100.2500.2520.2550.2540.2540.0210.0440.67850.2550.2500.2600.2540.2550.0250.0570.6964
FR110.2480.2600.2370.2460.2480.0350.0380.522110.2440.2610.2520.2540.2570.0250.0620.7123
FR120.2530.2520.2530.2570.2520.0210.0440.68140.2550.2580.2600.2510.2590.0190.0630.7682
FR130.2450.2440.2610.2490.2520.0300.0390.56090.2470.2100.2600.2540.2340.0690.0310.30816
FR140.2480.2650.2610.2650.2630.0080.0620.88920.2490.2610.2410.2510.2510.0320.0580.6456
FR150.2500.2540.2580.2380.2560.0300.0450.59860.2600.2610.2550.2400.2580.0280.0630.6955
FR160.2560.2540.2500.2620.2520.0180.0480.72130.2570.2550.2410.2560.2490.0320.0570.6417
V+0.2560.2650.2610.2650.263 0.2600.2690.2630.2640.266
V−0.2350.2280.2340.2380.234 0.2390.2100.2390.2330.234
DT RF
FR10.2440.2410.2450.2450.2430.0510.0160.236140.2520.2150.2420.2490.2280.0750.0380.33516
FR20.2470.2460.2510.2500.2480.0400.0260.39780.2550.2420.2390.2490.2410.0530.0500.48515
FR30.2500.2380.2420.2420.2400.0550.0150.213150.2470.2580.2470.2490.2530.0380.0610.6127
FR40.2390.2460.2370.2420.2420.0550.0120.183160.2580.2530.2360.2380.2450.0540.0590.52113
FR50.2530.2680.2670.2660.2670.0060.0620.91010.2580.2670.2720.2650.2690.0040.0900.9591
FR60.2420.2490.2530.2480.2510.0390.0280.41670.2580.2480.2420.2540.2450.0450.0580.5629
FR70.2550.2350.2510.2530.2430.0480.0280.371110.2520.2530.2500.2620.2520.0330.0640.6612
FR80.2420.2460.2340.2580.2400.0520.0240.316130.2470.2500.2580.2510.2540.0340.0590.6395
FR90.2530.2490.2610.2580.2550.0270.0420.60750.2550.2420.2440.2430.2430.0510.0510.49814
FR100.2580.2490.2450.2450.2470.0420.0280.39790.2520.2530.2420.2490.2470.0440.0570.5638
FR110.2500.2710.2530.2580.2620.0190.0510.73520.2470.2560.2610.2380.2580.0380.0640.6316
FR120.2440.2540.2400.2450.2470.0450.0230.337120.2550.2480.2500.2380.2490.0460.0560.54911
FR130.2550.2490.2510.2370.2500.0440.0280.394100.2600.2480.2420.2430.2450.0490.0580.54112
FR140.2550.2540.2610.2480.2580.0270.0420.60840.2160.2500.2660.2670.2580.0490.0630.56210
FR150.2530.2600.2560.2500.2580.0250.0420.62830.2410.2580.2530.2570.2560.0330.0620.6533
FR160.2580.2410.2510.2530.2460.0420.0310.42060.2470.2560.2550.2490.2560.0330.0620.6504
V+0.2580.2710.2670.2660.267 0.2600.2670.2720.2670.269
V−0.2390.2350.2340.2370.240 0.2160.2150.2360.2380.228
Table 7. Results of TOPSIS for Breast Cancer Dataset with Ranking of Feature Reduction Sequence using Dataset D3.
Table 7. Results of TOPSIS for Breast Cancer Dataset with Ranking of Feature Reduction Sequence using Dataset D3.
k-NN LR
FRCR1CR2CR3CR4CR5 S i + S i PiRankCR1CR2CR3CR4CR5 S i + S i PiRank
FR10.2480.2380.2410.2440.2400.0430.0160.267150.2510.2410.2440.2420.2430.0470.0180.27515
FR20.2450.2460.2380.2440.2420.0400.0160.281140.2490.2540.2490.2420.2520.0360.0290.44611
FR30.2350.2410.2430.2470.2420.0420.0120.225160.2460.2410.2360.2480.2390.0520.0140.21616
FR40.2480.2540.2380.2520.2460.0320.0270.457110.2510.2540.2460.2610.2500.0290.0370.5634
FR50.2510.2620.2590.2630.2610.0060.0490.89210.2490.2650.2650.2660.2650.0140.0580.8111
FR60.2560.2510.2380.2570.2450.0310.0330.51890.2510.2380.2620.2420.2500.0400.0320.44710
FR70.2480.2440.2570.2570.2500.0240.0330.58160.2400.2490.2520.2580.2500.0340.0320.4847
FR80.2510.2490.2460.2420.2480.0330.0230.410120.2460.2410.2520.2370.2460.0470.0200.30114
FR90.2530.2490.2460.2550.2480.0260.0170.405130.2620.2600.2410.2480.2500.0340.0380.5275
FR100.2560.2460.2540.2500.2500.0240.0320.57070.2570.2600.2460.2560.2530.0260.0400.6092
FR110.2530.2570.2490.2500.2530.0210.0340.61650.2380.2570.2490.2500.2530.0360.0320.4689
FR120.2480.2590.2510.2360.2550.0300.0320.517100.2510.2650.2570.2370.2610.0320.0450.5793
FR130.2430.2590.2590.2440.2590.0230.0370.61940.2620.2460.2460.2500.2460.0360.0330.4758
FR140.2560.2510.2620.2550.2570.0140.0420.75620.2490.2360.2520.2610.2440.0410.0310.43112
FR150.2560.2440.2540.2500.2490.0270.0310.54080.2490.2440.2490.2400.2460.0440.0210.31913
FR160.2510.2490.2620.2520.2550.0180.0380.67130.2490.2460.2520.2580.2490.0320.0320.5046
V+0.2560.2620.2620.2630.261 0.2620.2650.2650.2660.265
V−0.2350.2380.2380.2360.240 0.2380.2360.2360.2370.239
DT RF
FR10.2430.2410.2410.2380.2410.0520.0090.153160.2500.2380.2540.2380.2460.0460.0210.31113
FR20.2510.2630.2560.2600.2600.0130.0460.78320.2470.2590.2570.2410.2580.0320.0350.5275
FR30.2540.2410.2540.2490.2480.0370.0280.43580.2390.2460.2430.2490.2450.0430.0150.26116
FR40.2430.2390.2490.2440.2440.0470.0160.253150.2530.2400.2460.2380.2430.0470.0170.26315
FR50.2510.2630.2640.2620.2640.0080.0530.86710.2470.2640.2620.2630.2630.0150.0500.7731
FR60.2460.2550.2540.2460.2550.0290.0320.52270.2450.2590.2590.2550.2590.0220.0400.6413
FR70.2480.2410.2540.2520.2480.0370.0270.422100.2500.2430.2590.2520.2510.0310.0300.4928
FR80.2540.2360.2560.2520.2460.0400.0300.43290.2450.2460.2460.2520.2460.0380.0200.34012
FR90.2460.2600.2490.2520.2540.0260.0340.56530.2500.2510.2540.2490.2530.0290.0290.5037
FR100.2570.2390.2490.2460.2440.0430.0250.366120.2530.2460.2460.2470.2460.0380.0210.35311
FR110.2370.2520.2460.2440.2490.0400.0220.347140.2560.2560.2410.2440.2480.0370.0270.4259
FR120.2540.2680.2410.2460.2540.0320.0400.55550.2500.2400.2380.2520.2390.0460.0180.27614
FR130.2570.2440.2430.2540.2440.0390.0280.413110.2560.2640.2490.2570.2560.0190.0420.6852
FR140.2570.2440.2640.2440.2540.0340.0380.53460.2610.2460.2430.2680.2450.0320.0380.5424
FR150.2510.2550.2430.2650.2490.0300.0370.55640.2560.2510.2510.2490.2510.0290.0300.5106
FR160.2510.2550.2350.2440.2450.0430.0240.360130.2420.2510.2490.2440.2500.0390.0210.35310
V+0.2570.2680.2640.2650.264 0.2610.2640.2620.2680.263
V−0.2370.2360.2350.2380.241 0.2390.2380.2380.2380.239
Table 8. Results of TOPSIS for Adenoma Cancer Dataset with Ranking of Feature Reduction Sequence using Dataset D4.
Table 8. Results of TOPSIS for Adenoma Cancer Dataset with Ranking of Feature Reduction Sequence using Dataset D4.
k-NN LR
FRCR1CR2CR3CR4CR5 S i + S i PiRankCR1CR2CR3CR4CR5 S i + S i PiRank
FR10.2450.2420.2380.2490.2400.0430.0140.239150.2450.2480.2460.2500.2470.0320.0210.38813
FR20.2450.2420.2590.2440.2500.0340.0310.47890.2450.2450.2540.2480.2490.0310.0240.43810
FR30.2500.2490.2430.2600.2470.0280.0280.49770.2520.2610.2540.2530.2570.0150.0400.7282
FR40.2550.2470.2380.2540.2430.0360.0230.386130.2550.2340.2560.2450.2450.0380.0280.42312
FR50.2550.2550.2620.2650.2580.0090.0490.84710.2550.2610.2610.2630.2610.0040.0500.9311
FR60.2550.2440.2570.2460.2500.0290.0320.52360.2580.2530.2460.2450.2490.0290.0310.5137
FR70.2500.2470.2590.2570.2530.0210.0380.64630.2520.2500.2510.2610.2510.0210.0340.6204
FR80.2520.2440.2410.2540.2430.0360.0220.381140.2470.2500.2480.2480.2490.0290.0250.4588
FR90.2500.2490.2490.2570.2490.0250.0160.399120.2580.2340.2560.2630.2450.0340.0360.5216
FR100.2500.2520.2300.2440.2410.0450.0140.238160.2520.2500.2350.2420.2430.0410.0220.35116
FR110.2420.2550.2430.2460.2490.0330.0220.402110.2500.2630.2510.2500.2570.0190.0390.6733
FR120.2500.2420.2510.2460.2460.0340.0250.417100.2420.2530.2430.2450.2480.0350.0220.38314
FR130.2450.2600.2620.2380.2610.0290.0420.59450.2520.2450.2510.2420.2480.0330.0250.43211
FR140.2500.2630.2620.2410.2620.0250.0450.64440.2450.2530.2400.2450.2470.0360.0210.37315
FR150.2520.2570.2540.2520.2560.0180.0370.67220.2370.2500.2510.2500.2510.0320.0250.4429
FR160.2550.2520.2490.2440.2500.0300.0270.48080.2550.2480.2560.2480.2520.0250.0330.5675
V+0.2550.2630.2620.2650.262 0.2580.2630.2610.2630.261
V−0.2420.2420.2300.2380.240 0.2370.2340.2350.2420.243
DT RF
FR10.2460.2410.2370.2360.2390.0530.0080.132160.2430.2540.2450.2400.2490.0430.0200.31612
FR20.2490.2510.2320.2520.2420.0440.0220.337140.2460.2410.2370.2510.2390.0540.0160.23015
FR30.2460.2460.2510.2460.2480.0350.0250.416100.2430.2620.2580.2480.2600.0250.0390.6112
FR40.2540.2380.2450.2440.2420.0450.0210.316150.2520.2430.2530.2350.2480.0470.0200.30314
FR50.2490.2670.2610.2600.2640.0110.0550.83110.2520.2700.2660.2590.2680.0080.0570.8821
FR60.2510.2510.2530.2490.2520.0260.0330.55760.2570.2490.2500.2510.2490.0350.0280.4448
FR70.2510.2590.2640.2600.2620.0120.0520.81520.2540.2490.2550.2510.2520.0320.0310.4924
FR80.2490.2510.2400.2490.2460.0370.0230.380130.2490.2460.2390.2430.2430.0500.0120.19516
FR90.2540.2510.2510.2540.2510.0260.0340.57240.2540.2570.2470.2590.2520.0290.0350.5513
FR100.2590.2510.2530.2460.2520.0260.0360.57630.2520.2410.2500.2510.2450.0430.0230.34911
FR110.2510.2460.2430.2540.2440.0370.0260.408120.2430.2460.2420.2530.2440.0450.0210.31413
FR120.2410.2460.2450.2570.2460.0390.0270.410110.2460.2540.2530.2480.2530.0320.0290.4746
FR130.2490.2460.2560.2600.2510.0290.0370.56750.2490.2510.2500.2400.2510.0390.0220.36210
FR140.2460.2490.2590.2460.2540.0290.0340.53980.2540.2410.2530.2640.2470.0390.0360.4775
FR150.2510.2510.2560.2440.2540.0270.0340.55170.2540.2460.2530.2530.2490.0350.0290.4547
FR160.2540.2540.2510.2440.2520.0280.0320.53490.2490.2490.2500.2530.2490.0350.0270.4339
V+0.2590.2670.2640.2600.264 0.2570.2700.2660.2640.268
V−0.2410.2380.2320.2360.239 0.2430.2410.2370.2350.239
Table 9. Ranking and Selection of Top Three FR.
Table 9. Ranking and Selection of Top Three FR.
DClassifierRank1Rank2Rank3
D1k-NNFR5FR14FR10
LRFR12FR5FR9
DTFR5FR13FR14
RFFR5FR13FR14
D2k-NNFR5FR14FR16
LRFR5FR12FR11
DTFR5FR11FR15
RFFR5FR7FR15
D3k-NNFR5FR14FR16
LRFR5FR10FR12
DTFR5FR2FR9
RFFR5FR13FR6
D4k-NNFR5FR15FR7
LRFR5FR3FR11
DTFR5FR7FR10
RFFR5FR3FR9
Table 10. Occurrence of Few Top-Ranking FR.
Table 10. Occurrence of Few Top-Ranking FR.
No. of Occurrences of Feature Reductions Wrt Rank
FeatureReaductionRank 1Rank 2Rank 3
FR5151
FR14 32
FR10 12
FR111 12
FR15 11
FR7 21
FR12111
FR9 3
FR13 3
FR16 2
FR3 2
FR6 1
FR2 1
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Tripathy, J.; Dash, R.; Pattanayak, B.K.; Mishra, S.K.; Mishra, T.K.; Puthal, D. Combination of Reduction Detection Using TOPSIS for Gene Expression Data Analysis. Big Data Cogn. Comput. 2022, 6, 24. https://doi.org/10.3390/bdcc6010024

AMA Style

Tripathy J, Dash R, Pattanayak BK, Mishra SK, Mishra TK, Puthal D. Combination of Reduction Detection Using TOPSIS for Gene Expression Data Analysis. Big Data and Cognitive Computing. 2022; 6(1):24. https://doi.org/10.3390/bdcc6010024

Chicago/Turabian Style

Tripathy, Jogeswar, Rasmita Dash, Binod Kumar Pattanayak, Sambit Kumar Mishra, Tapas Kumar Mishra, and Deepak Puthal. 2022. "Combination of Reduction Detection Using TOPSIS for Gene Expression Data Analysis" Big Data and Cognitive Computing 6, no. 1: 24. https://doi.org/10.3390/bdcc6010024

Article Metrics

Back to TopTop