You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

21 April 2022

Red Fox Optimizer with Data-Science-Enabled Microarray Gene Expression Classification Model

,
,
,
,
and
1
Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al Kharj 11942, Saudi Arabia
2
Department of Computer Science, College of Computer Science and Engineering, Taibah University, Madinah 42353, Saudi Arabia
3
Centre for Advanced Data Science, Vellore Institute of Technology, Chennai 600127, India
4
Department of Electrical Engineering, University of Sharjah, Sharjah P.O. Box 27272, United Arab Emirates
This article belongs to the Special Issue Integrated Artificial Intelligence in Data Science

Abstract

Microarray data examination is a relatively new technology that intends to determine the proper treatment for various diseases and a precise medical diagnosis by analyzing a massive number of genes in various experimental conditions. The conventional data classification techniques suffer from overfitting and the high dimensionality of gene expression data. Therefore, the feature (gene) selection approach plays a vital role in handling a high dimensionality of data. Data science concepts can be widely employed in several data classification problems, and they identify different class labels. In this aspect, we developed a novel red fox optimizer with deep-learning-enabled microarray gene expression classification (RFODL-MGEC) model. The presented RFODL-MGEC model aims to improve classification performance by selecting appropriate features. The RFODL-MGEC model uses a novel red fox optimizer (RFO)-based feature selection approach for deriving an optimal subset of features. Moreover, the RFODL-MGEC model involves a bidirectional cascaded deep neural network (BCDNN) for data classification. The parameters involved in the BCDNN technique were tuned using the chaos game optimization (CGO) algorithm. Comprehensive experiments on benchmark datasets indicated that the RFODL-MGEC model accomplished superior results for subtype classifications. Therefore, the RFODL-MGEC model was found to be effective for the identification of various classes for high-dimensional and small-scale microarray data.

1. Introduction

The technology of DNA microarray assists in making it simpler to monitor a huge number of genes simultaneously [1]. Earlier works indicated that the technology of DNA microarray could be useful in the classification of cancer disease [2]. To classify microarray gene expression, several techniques and methods were introduced that have satisfactory outcomes [3]. For the microarray dataset, the gene expression value is organized through the matrix, where samples are rows and genes or features are columns. The value of gene expression is a real number, and it defines the expression level of a gene following certain criteria [4]. Due to the limited number of samples with an enormous number of features from the gene expression data, the systematic machine learning (ML) technique does not work well for cancer classifiers [5].
A microarray experiment produces many gene expression data in an individual sample. The ratio of the number of genes (features) to the number of patients (samples) is skewed, leading to the popular curse-of-dimensionality problem [6]. Furthermore, it enforces self-inflicting limitations on the presenting of methods: (i) processing all the information may not be possible, and (ii) processing a set of data might lead to overfitting, local maxima, and loss of information. These two problems affect the reliability and accuracy of machine learning techniques. Several studies have been conducted to identify an effective feature set [7]. Statistical and evolutionary approaches were introduced for these purposes. Feature subset selection (FSS) methods such as joint mutual information (JMI), joint mutual information maximization (JMIM), and minimum redundancy maximum relevance (mRMR) are among the main statistical methods [8].
Literature reviews showed that recent innovative technologies such as genetic algorithm (GA), mining techniques, transfer learning, deep neural network (DNN), particle swarm optimization (PSO), and so on, generate precise results [9]. The classification of microarray data is generally performed in two different ways. Feature selection (FS) focuses on choosing the most important characteristics from a large dataset to decrease computation overheads, overfitting, and noise. The classifier training process constructs a technique in the selected feature to accurately categorize a microarray sample. Innovative technologies such as convolutional neural network (CNN), image processing, ant miner, transfer learning, and experimental methods were introduced in a previous study [10]. Even though the innovative technologies for FS and classifier training can produce higher accuracy, they should be tuned based on the fundamental data set in a controlled setup to accomplish better outcomes.
We developed a novel red fox optimizer with deep-learning-enabled microarray gene expression classification (RFODL-MGEC) model. The presented RFODL-MGEC model uses a novel RFO-based feature selection (FS) approach to derive an optimum subset of features. Moreover, the proposed RFODL-MGEC model involves a bidirectional cascaded deep neural network (BCDNN) for data classification. The parameters involved in the BCDNN method were optimally tuned using a chaos game optimization (CGO) algorithm.

3. The Proposed Model

This study proposes a novel RFODL-MGEC model for microarray gene expression classification. The presented RFODL-MGEC model primarily employed an RFO-FS approach for deriving the optimum subset of features. Next, the BCDNN model was utilized for data classification, and the parameters involving the BCDNN technique were optimally tuned using a CGO algorithm. Figure 1 demonstrates the overall block diagram of our proposed RFODL-MGEC technique.
Figure 1. Block diagram of RFODL-MGEC technique.

3.1. Data Preprocessing

The z-score normalization approach was derived at the initial phase, which computed the standard deviation and arithmetic mean of provided gene data. It was evident that the normalization approach performed effectively with earlier knowledge regarding the average score and score variation of the matcher. The normalization scores were obtained using the following:
s k = s k μ σ  
where σ implies standard deviation and μ indicates arithmetic mean of provided data. In this study, the normalization of the smoothed data was carried out via z-score normalization.

3.2. Design of RFO-Based Feature Selection Approach

During the process of feature selection, the RFO-FS model was executed and the optimum set of features was chosen. A new metaheuristic approach was determined, which was named the RFO approach, and was based on the hunting processes of red foxes. Initially, the red foxes seek food in territories [18]. This can be modelled as an exploration term for global search. Next, they move over the territory to get close to their prey before attacking. This stage can be modelled as an exploitation term for local search. The process was initiated by a constant value of random candidates; each one determines a point, where x ¯ = ( x 0 , x 1 ,   , x n 1 ) and n defines a coordinate. For discriminating every fox x ¯ i in iteration t , where i indicates the fox number in the population, we introduce the notation ( x ¯ j i ) t , in which i describes the coordinate as the solution space dimension. Based on   f n , the criterion function of the n variable depends on the dimension of the searching space, and the notation ( x ¯ ) ( i ) = [ ( x ¯ 0 ) ( i ) ,   ( x ¯ 1 ) ( i ) ,   ( x ¯ n 1 ) ( i ) ] indicates the point in the space [ a ,   b ] n in which a , b . Then, ( x ¯ ) ( i ) is the optimum solution when the value of function f ( ( x ¯ ) ( i ) ) represents a global optimal on [ a , b ] . The outcomes of the estimated function by the candidate are sorted initially according to fitness condition, and for ( x ¯ b e s t ) t , the square of Euclidean distance is estimated for the candidate in the following:
D ( ( ( x ¯ ) ( i ) ) t , ( ( x ¯ ) b e s t ) t ) = ( ( x ¯ ) ( i ) ) t ( ( x ¯ ) b e s t ) t      
and the candidate moves towards the optimal population as:
( ( x ¯ ) ( i ) ) t = ( ( x ¯ ) ( i ) ) t + α × s g n ( ( ( ( x ¯ ) b e s t ) t ( x ¯ ) ( i ) ) t )
where α defines an arbitrary number in which ( 0 ,   D ( ( x ¯ ) b e s t ) c ,   ( ( x ¯ ) b e s t ) t ) .
In the RFO approach, movements and observations delude prey when hunting in a local searching phase. For simulating the probability of a fox approaching the prey, an arbitrary number γ [ 0 ,   1 ] set in the iteration for each candidate can be used.
{ m o v e   d o s e r   i f   γ > 3 / 4 s t a y   a n d   h i l e   i f   γ 3 / 4
Figure 2 depicts the steps involved in RFO.
Figure 2. Steps involved in RFO technique.
The radius comprises a as an arbitrary number within 0 and 0.2, and φ 0 denotes an arbitrary number within 0 and 2 π which defines the fox observation angle:
r = { a ×   sin   ( φ 0 ) / φ 0   i f   φ 0 0 β i f   φ 0 = 0    
β represents an arbitrary number within 0 and 1. The approaching method of the fox was modelled as follows:
{ χ 0 N e w = a × r × cos ( φ 1 ) + X 0 a c t u a l x 1 N e w = a × r × sin ( φ 1 ) + a × r × cos ( φ 2 ) + x 1 a c t u a l x 1 N e w = a × r × sin ( φ 1 ) + a × r × sin ( φ 2 ) + a × r × cos ( φ 3 ) + x 2 a c t u a l x n 1 N e w = a × r × k = 1 n 2 sin ( φ 1 ) + a × Y × cos ( φ n 1 ) + X n 2 a c t u a l x n 1 N e ω = a × r × sin ( φ 1 ) + a × r × sin ( φ 2 ) + + a × r × sin ( φ n 1 ) + a × r × sin ( φ n 1 ) + X n a a c t u a l
Five percent of the worst-case candidates were detached and replaced with upgraded candidates. In the same way, two of the optimal individuals were accomplished as ( X ( 1 ) ) t and ( X ( 2 ) ) t as an alpha couple in iteration t . This can be mathematically expressed in the following:
H c t = 1 2 ( X ( 1 ) ) t ( X ( 2 ) ) t        
Moreover, the diameter of habitat using Euclidean distance can be accomplished by Equation (8):
H d t = ( ( X ( 1 ) ) t ( X ( 2 ) ) t ) 1 2        
An arbitrary number, θ , was considered in the following:
{ N e w   n o m a d i c   c a n d i d a t e i f   θ > 0.45 R e p r o d u c t i o n   o f   t h e   a l p h a   c o u p l e i f   θ 0.45          
In this case, θ [ 0 , 1 ] . In addition, the new candidate was accomplished by the alpha couple in the following:
( X r e p ) t = θ 2 ( X ( 1 ) ) t ( X ( 2 ) ) t            

3.3. Process Involved in BCDNN-Based Classification

The BCDNN model was developed for microarray gene expression classification [19]. The DNN is separated into decoder, encoder, translator, and simulator. Let T represent the amplitude response and phase inspired from the finite-difference time-domain (FDTD) methodology and T represent the forecast from the simulator. When the module is trained, the simulator predicts T as an input image with a rapidly moving meta-atom structure compared to its arithmetical matching part. For backward calculations, T with dimensions of 82 × 1 is converted to an image with dimensions of 40 × 40 , which indicates a lower input parameter than the output parameter for regression processes. The enormous divergence makes it problematic for a system to generalize and converge well, particularly once the input spectra have stronger variation near the resonant frequency. The authors of the aforementioned study attempted to avoid this problem by including a generative adversarial network or bilinear tensor layer. Initially, it characterizes every meta-atom with a lower dimension eigen vector with dimensions of 82 × 1 through a pretrained autoencoder. The size of each tensor all over the network is noticeable below all the blocks. Dissimilar layers of the CNN are interconnected with convolution operations. The kernel multiplies the value of the tensor in the kernel region and later sums it with a novel value in tensor. In CNN, we attached two FC layers (dimensions are given below) to estimate a spectral tensor. A leaky ReLU of α = 0.2 was employed for all the convolution layers, and tan h was employed for all the FC layers. The convolution layer maps the input tensor x k with the output tensor x k + 1 :
x k + 1 = l e a k y   R e L U [ C O N V k 1 ( x k ) ] ,      
Leaky ReLU ( · ) represents the rectified linear unit action, and CONV denotes the convolutional operators (include bias terms). The k 1 subscript signifies the number of networks. In the simulator, k 1 = 32 , 32 , 64 , 64 , 128 , 128 . Strides of two are employed in two, four, and six convolutions for replacing the max-pooling layer. A dropout layer by means of 0.1 drops behindhand all the FC layers except the output layer is applied to prevent overfitting networks. Mean absolute error (MAE) was adapted for calculating the weight and gradient. MAE was determined by:
M A E = i | T p r e d i c t e d T s i m u l a t e d | N ,                  
Now, N indicates the amount of the entrances of T p r e d i c t e d . For cost functions, MAE is insensitive to outliers; however, it is unconducive to the convergence. To guarantee the module stability, the learning rate declines with the number of iterations.

3.4. Parameter Optimization Using CGO Algorithm

In order to optimally tune the parameters involved in the BCDNN method, the CGO approach was employed [20]. The CGO approach was projected depending on the presented principles of the chaos model. Important methods of fractals and chaos games were utilized to formulate a mathematical model for the CGO approach. The CGO approach considered the count of solution candidates (S) in this determination, which represents some appropriate seed inside the Sierpinski triangle. The mathematical process of this feature is as follows:
S = S 1 : S n = [ S 1 1 S 2 1   S i 1 S n 1     S 1 2 S 2 2 S i 2 S n 2     S 1 j S 2 j   S i j S n j             S 1 d   S n d   ]                        
i = 1 ,   2 n .   J = 1 , 2 .   d . In this case, n signifies the count of eligible seeds (candidate solutions) inside the Sierpinski triangle (searching space), and d defines the dimension of this seed. The primary place of these eligible seeds is demonstrated arbitrarily from the searching space as:
S 1 j   ( 0 ) = S 1 , m i n j + R ( S 1 ,   m i n j S 1 ,   m a x j )        
where R implies an arbitrary number in the interval of zero and one. The process for the primary seed is represented under:
S e e d i 1 = S i + x i ( y i G l o b a l   b e s t z i M e a n   V a l u e )    
x i ,   y i ,   a n d   z i define an arbitrary integer of zero or one for representing the possibility of rolling a die. Then, the schematic presentation of the described process for the second seed is defined as:
S e e d i 2 = G l o b a l   b e s t + x i ( y i S i z i M e a n   V a l u e )      
The schematic representations of the third and fourth seeds are described as:
S e e d i 3 = M e a n   V a l u e + x i ( y i S i z i G l o b a l   b e s t )      
S e e d i 4 = S i ( S i k = S i k + R a n d )          
in which k signifies an arbitrary integer in the interval of zero and one. During the CGO approach, different constructions are presented for x i , which controls the effort to restrict seeds.
x i = { 2 r a n d ( Ψ r a n d ) + 1 ( Ω r a n d ) + ~ Ω                            
In this case, r a n d implies a uniformly distributed random number in the interval of zero and one. Ψ and Ω are arbitrary integers in the interval of zero and one. For selecting better parameters in the BCDNN technique, the CGO method is offered as a main function, representing a positive combination to achieve higher performance. During this process, error rate is controlled as the fitness function, and the solution with lower error is observed as the optimum one. It can be defined as:
f i t n e s s ( x i ) = C l a s s i f i e r E r r o r R a t e ( x i )   = n u m b e r   o f   m i s c l a s s i f i e d   s a m p l e s T o t a l   n u m b e r   o f   s a m p l e s 100

4. Experimental Validation

The performance validation of the RFODL-MGEC model was tested using three benchmark datasets [21], namely, prostate cancer, colon tumor, and ovarian cancer datasets. The details related to the datasets are provided in Table 1. The proposed model selected a set of 6145, 984, and 8424 features for prostate, colon, and ovarian cancer datasets, respectively.
Table 1. Dataset details.

4.1. Resulting Analysis of RFODL-MGEC Technique on Prostate Cancer Dataset

Figure 3 illustrates a set of confusion matrices generated by the RFODL-MGEC model on the test prostate cancer dataset. For the entire dataset, the RFODL-MGEC model categorized 47 images as tumor and 49 images as normal. Similarly, for 70% of the training dataset, the RFODL-MGEC model categorized 32 images as tumor and 34 images as normal. In addition, for 30% of the testing dataset, the RFODL-MGEC model categorized 15 images as tumor and 15 images as normal.
Figure 3. Confusion matrices of RFODL-MGEC technique for prostate cancer dataset. (a) Entire dataset, (b) 70% of training dataset, and (c) 30% of testing dataset.
Table 2 shows a brief classification performance report for the RFODL-MGEC model on the prostate cancer dataset. The experimental results indicated that the RFODL-MGEC model demonstrated effective results on the test dataset. For instance, with the entire dataset, the RFODL-MGEC model obtained an average a c c u y , p r e c n , r e c a l , and F s c o r e of 94.12%, 94.33%, 94.19%, and 94.12%, respectively. Moreover, with 70% of the training dataset, the RFODL-MGEC technique obtained an average a c c u y , p r e c n , r e c a l , and F s c o r e of 92.96%, 93.22%, 93.02%, and 92.95%, respectively. With 30% of the testing dataset, the RFODL-MGEC system obtained an average a c c u y , p r e c n , r e c a l , and F s c o r e of 96.77%, 96.88%, 96.88%, and 96.77%, respectively.
Table 2. Resulting analysis of RFODL-MGEC technique with various measures on prostate cancer dataset.
Figure 4 illustrates the training and validation accuracy inspection of the RFODL-MGEC model with the prostate cancer dataset. Figure 4 conveys that the RFODL-MGEC model offered maximum training/validation accuracy for the classification process.
Figure 4. Accuracy analysis of RFODL-MGEC technique on prostate cancer dataset.
Figure 5 exemplifies the training and validation loss inspection of the RFODL-MGEC model with the prostate cancer dataset. Figure 5 shows that the RFODL-MGEC model offered reduced training/accuracy loss for the classification process of the test data.
Figure 5. Loss analysis of RFODL-MGEC technique on prostate cancer dataset.

4.2. Resulting Analysis of RFODL-MGEC Technique on Colon Tumor Dataset

Figure 6 demonstrates a set of confusion matrices generated by the RFODL-MGEC model for the test colon tumor dataset. For the entire dataset, the RFODL-MGEC technique categorized 38 images as negative and 21 images as positive. Likewise, for 70% of the training dataset, the RFODL-MGEC approach categorized 27 images as negative and 14 images as positive. Furthermore, with 30% of the testing dataset, the RFODL-MGEC model categorized 11 images as negative and 7 images as positive.
Figure 6. Confusion matrices of RFODL-MGEC technique on colon tumor dataset. (a) Entire dataset, (b) 70% of training dataset, and (c) 30% of testing datase.
Table 3 demonstrates a brief classification performance report on the RFODL-MGEC model with the colon tumor dataset. The experimental results indicated that the RFODL-MGEC model demonstrated effective results with the test dataset. For instance, with the entire dataset, the RFODL-MGEC model obtained an average a c c u y , p r e c n , r e c a l , and F s c o r e of 95.16%, 94.37%, 95.23%, and 94.77%, respectively. With 70% of the training dataset, the RFODL-MGEC method attained an average a c c u y , p r e c n , r e c a l , and F s c o r e of 95.35%, 93.75%, 96.55%, and 94.88%, respectively. Additionally, with 30% of the testing dataset, the RFODL-MGEC algorithm obtained an average a c c u y , p r e c n , r e c a l , and F s c o r e of 94.74%, 95.83%, 93.75%, and 94.49%, respectively.
Table 3. Resulting analysis of RFODL-MGEC technique with various measures on colon tumor dataset.
Figure 7 demonstrates the training and validation accuracy inspection of the RFODL-MGEC model on the colon tumor dataset. The figure conveys that the RFODL-MGEC technique offered maximal training/validation accuracy for the classification process.
Figure 7. Accuracy analysis of RFODL-MGEC technique on colon tumor dataset.
Figure 8 illustrates the training and validation loss inspection of the RFODL-MGEC model on the colon tumor dataset. The figure shows that the RFODL-MGEC approach offered lower training/accuracy loss for the classification process of the test data.
Figure 8. Loss analysis of RFODL-MGEC technique on colon tumor dataset.

4.3. Resulting Analysis of RFODL-MGEC Technique on Ovarian Cancer Dataset

Figure 9 illustrates a set of confusion matrices generated by the RFODL-MGEC algorithm on the test ovarian cancer dataset. For the entire dataset, the RFODL-MGEC technique categorized 159 images as ovarian and 87 images as normal. With 70% of the training dataset, the RFODL-MGEC algorithm categorized 102 images as ovarian and 69 images as normal. For 30% of the testing dataset, the RFODL-MGEC technique categorized 57 images as ovarian and 18 images as normal.
Figure 9. Confusion matrices of RFODL-MGEC technique on ovarian cancer dataset. (a) Entire dataset, (b) 70% of training dataset, and (c) 30% of testing dataset.
Table 4 shows a brief classification performance report on the RFODL-MGEC technique with the ovarian cancer dataset. The experimental results indicated that the RFODL-MGEC technique demonstrated effective results on the test dataset. For instance, with the entire dataset, the RFODL-MGEC system obtained an average a c c u y , p r e c n , r e c a l , and F s c o r e of 97.23%, 97.11%, 96.88%, and 96.99%, respectively. With 70% of the training dataset, the RFODL-MGEC algorithm obtained an average a c c u y , p r e c n , r e c a l , and F s c o r e of 96.61%, 96.49%, 96.49%, and 96.49%, respectively. Eventually, with 30% of the testing dataset, the RFODL-MGEC algorithm obtained an average a c c u y , p r e c n , r e c a l , and F s c o r e of 98.68%, 99.14%, 97.37%, and 98.21%, respectively.
Table 4. Resulting analysis of RFODL-MGEC technique with various measures on ovarian cancer dataset.
Figure 10 illustrates the training and validation accuracy inspection of the RFODL-MGEC algorithm with the ovarian cancer dataset. The figure conveys that the RFODL-MGEC technique offered maximum training/validation accuracy for the classification process.
Figure 10. Accuracy analysis of RFODL-MGEC technique with ovarian cancer dataset.
Figure 11 exemplifies the training and validation loss inspection of the RFODL-MGEC technique on the ovarian cancer dataset. The figure shows that the RFODL-MGEC system offered reduced training/accuracy loss for the classification process of the test data.
Figure 11. Loss analysis of RFODL-MGEC technique on ovarian cancer dataset.

4.4. Discussion

A detailed comparative examination of the RFODL-MGEC model with recent approaches [15] for prostate cancer is provided in Table 5 and Figure 12. The experimental outcomes indicated that the FFSDL and ESADL models reached lower classification outcomes than other approaches. At the same time, the SVM and RF models accomplished slightly enhanced classification outcomes compared with the FFSDL and ESADL models. Along with that, the ABC-SVM and PSO-SVM models accomplished closer classification performances, with an a c c u y of 96.06% and 93.71%, respectively.
Table 5. Comparative analysis of RFODL-MGEC technique with recent algorithms for prostate cancer dataset.
Figure 12. Comparative analysis of RFODL-MGEC technique with prostate cancer dataset.
The proposed RFODL-MGEC model resulted in maximum classification efficiency, with an a c c u y , p r e c n , and r e c a l of 96.77%, 96.88%, and 96.88% respectively.
A brief comparative examination of the RFODL-MGEC approach with recent approaches for colon tumors is given in Table 6 and Figure 13. The experimental outcomes indicated that the FFSDL and ESADL approaches reached lower classification outcomes than the other approaches. Likewise, the SVM and RF approaches accomplished somewhat enhanced classification outcomes compared with the FFSDL and ESADL approaches.
Table 6. Comparative analysis of RFODL-MGEC technique with recent algorithms for colon tumor dataset.
Figure 13. Comparative analysis of RFODL-MGEC technique with colon tumor dataset.
Along with that, the ABC-SVM and PSO-SVM models accomplished closer classification performances, with an a c c u y of 93.94% and 93.80%, respectively. Finally, the RFODL-MGEC model resulted in higher classification efficiency with an a c c u y , p r e c n , and r e c a l of 94.74%, 95.83%, and 93.75% respectively.
A detailed comparative examination of the RFODL-MGEC algorithm with recent approaches for ovarian cancer is given in Table 7 and Figure 14. The experimental outcomes indicated that the FFSDL and ESADL methods reached lower classification outcomes than the other approaches.
Table 7. Comparative analysis of RFODL-MGEC technique with recent algorithms for ovarian cancer dataset.
Figure 14. Comparative analysis of RFODL-MGEC technique with ovarian cancer dataset.
The SVM and RF models accomplished some enhanced classification outcomes compared with the FFSDL and ESADL models. This was followed by the ABC-SVM and PSO-SVM techniques, which accomplished closer classification performances with an a c c u y of 95.42% and 95.81%, respectively. Finally, the RFODL-MGEC methodology resulted in maximum classification efficiency, with an a c c u y , p r e c n , and r e c a l of 98.68%, 99.11%, and 97.37%, respectively.
Finally, a computation time (CT) examination of the RFODL-MGEC technique with recent models for the three distinct datasets is provided in Table 8. The experimental results indicated that the RFODL-MGEC technique showed a lower CT compared with the other methods. The proposed RFODL-MGEC technique required a lower CT of 1.231, 0.432, and 1.542 s with the test prostate cancer, colon tumor, and ovarian cancer datasets, respectively.
Table 8. Comparative CT analysis of RFODL-MGEC technique with recent algorithms.
After examining the aforementioned tables and figures, we noted that the RFODL-MGEC model was able to maximize classification performance compared with the other methods.

5. Conclusions

In this study, a novel RFODL-MGEC model was established for microarray gene expression classification. The presented RFODL-MGEC model primarily employed an RFO-FS technique for deriving an optimal subset of features. Next, the BCDNN model was utilized for data classification, and the parameters involved in the BCDNN technique were optimally tuned by utilizing a CGO algorithm. Comprehensive experiments on benchmark datasets showed that the RFODL-MGEC model accomplished superior results for subtype classifications. Therefore, the RFODL-MGEC model was found to be effective for the identification of different classes for high-dimensional and small-scale microarray data. Future directions involve the use of data clustering and feature reduction approaches to enhance classification performance. The proposed model should be tested on large-scale datasets.

Author Contributions

Conceptualization, T.V. and H.A.; methodology, T.V. and L.; software and investigation, T.V. and H.A.; validation, L., E.A. and S.S.; data curation, H.A.; writing—T.V., L. and S.S.; review and editing, H.A., E.A. and A.H.; funding acquisition, H.A. and A.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded by Prince Sattam bin Abdulaziz University, KSA under grant number: 2020/01/1174.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data sharing is not applicable to this article as no datasets were generated during this study.

Acknowledgments

The authors would like to thank Prince Sattam Bin Abdualziz University for providing technical support during this research work. This project was supported by the Deanship of Scientific Research at Prince Sattam Bin Abdulaziz University (project no. 2020/01/1174).

Conflicts of Interest

The authors declare that they have no conflict of interest.

References

  1. Ahmed, O.; Brifcani, A. Gene expression classification based on deep learning. In Proceedings of the 2019 4th Scientific International Conference Najaf (SPICN), Al-Najef, Iraq, 29–30 April 2019; pp. 145–149. [Google Scholar]
  2. Almugren, N.; Alshamlan, H. A survey on hybrid feature selection methods in microarray gene expression data for cancer classification. IEEE Access 2019, 7, 78533–78548. [Google Scholar] [CrossRef]
  3. Maniruzzaman, M.; Rahman, M.J.; Ahammed, B.; Abedin, M.M.; Suri, H.S.; Biswas, M.; El-Baz, A.; Bangeas, P.; Tsoulfas, G.; Suri, J.S. Statistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms. Comput. Methods Programs Biomed. 2019, 176, 173–193. [Google Scholar] [CrossRef] [PubMed]
  4. Tabares-Soto, R.; Orozco-Arias, S.; Romero-Cano, V.; Bucheli, V.S.; Rodríguez-Sotelo, J.L.; Jiménez-Varón, C.F. A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression data. PeerJ Comput. Sci. 2020, 6, e270. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Adiwijaya, W.U.; Lisnawati, E.; Aditsania, A.; Kusumo, D.S. Dimensionality reduction using principal component analysis for cancer detection based on microarray data classification. J. Comput. Sci. 2018, 14, 1521–1530. [Google Scholar] [CrossRef] [Green Version]
  6. Alanni, R.; Hou, J.; Azzawi, H.; Xiang, Y. A novel gene selection algorithm for cancer classification using microarray datasets. BMC Med. Genom. 2019, 12, 10. [Google Scholar] [CrossRef] [PubMed]
  7. Daoud, M.; Mayo, M. A survey of neural network-based cancer prediction models from microarray data. Artif. Intell. Med. 2019, 97, 204–214. [Google Scholar] [CrossRef] [PubMed]
  8. Aydadenta, H.; Adiwijaya, A. A clustering approach for feature selection in microarray data classification using random forest. J. Inf. Process. Syst. 2018, 14, 1167–1175. [Google Scholar]
  9. Cilia, N.D.; De Stefano, C.; Fontanella, F.; Raimondo, S.; Scotto di Freca, A. An experimental comparison of feature-selection and classification methods for microarray datasets. Information 2019, 10, 109. [Google Scholar] [CrossRef] [Green Version]
  10. Alhenawi, E.A.; Al-Sayyed, R.; Hudaib, A.; Mirjalili, S. Feature selection methods on gene expression microarray data for cancer classification: A systematic review. Comput. Biol. Med. 2022, 140, 105051. [Google Scholar] [CrossRef] [PubMed]
  11. Wang, H.; Tan, L.; Niu, B. Feature selection for classification of microarray gene expression cancers using Bacterial Colony Optimization with multi-dimensional population. Swarm Evol. Comput. 2019, 48, 172–181. [Google Scholar] [CrossRef]
  12. Zeebaree, D.Q.; Haron, H.; Abdulazeez, A.M. Gene selection and classification of microarray data using convolutional neural network. In Proceedings of the 2018 International Conference on Advanced Science and Engineering (ICOASE), Duhok, Iraq, 9–11 October; pp. 145–150.
  13. Algamal, Z.Y.; Lee, M.H. A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification. Adv. Data Anal. Classif. 2019, 13, 753–771. [Google Scholar] [CrossRef]
  14. Shukla, A.K.; Singh, P.; Vardhan, M. A two-stage gene selection method for biomarker discovery from microarray data for cancer classification. Chemom. Intell. Lab. Syst. 2018, 183, 47–58. [Google Scholar] [CrossRef]
  15. Panda, M. Elephant search optimization combined with deep neural network for microarray data analysis. J. King Saud Univ. Comput. Inf. Sci. 2020, 32, 940–948. [Google Scholar] [CrossRef]
  16. Sayed, S.; Nassef, M.; Badr, A.; Farag, I. A nested genetic algorithm for feature selection in high-dimensional cancer microarray datasets. Expert Syst. Appl. 2019, 121, 233–243. [Google Scholar] [CrossRef]
  17. Li, Z.; Xie, W.; Liu, T. Efficient feature selection and classification for microarray data. PLoS ONE 2018, 13, e0202167. [Google Scholar] [CrossRef] [PubMed]
  18. Khorami, E.; Mahdi Babaei, F.; Azadeh, A. Optimal diagnosis of COVID-19 based on convolutional neural network and red Fox optimization algorithm. Comput. Intell. Neurosci. 2021, 2021, 4454507. [Google Scholar] [CrossRef] [PubMed]
  19. Kong, W.; Chen, J.; Huang, Z.; Kuang, D. Bidirectional cascaded deep neural networks with a pretrained autoencoder for dielectric metasurfaces. Photonics Res. 2021, 9, 1607–1615. [Google Scholar] [CrossRef]
  20. Talatahari, S.; Azizi, M. Chaos Game Optimization: A novel metaheuristic algorithm. Artif. Intell. Rev. 2021, 54, 917–1004. [Google Scholar] [CrossRef]
  21. Zhu, Z.; Ong, Y.S.; Dash, M. Markov Blanket-Embedded Genetic Algorithm for Gene Selection. Pattern Recognit. 2007, 49, 3236–3248. Available online: http://csse.szu.edu.cn/staff/zhuzx/Datasets.html (accessed on 21 January 2022). [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.