Deep Learning Approaches for Detection of Breast Adenocarcinoma Causing Carcinogenic Mutations

Genes are composed of DNA and each gene has a specific sequence. Recombination or replication within the gene base ends in a permanent change in the nucleotide collection in a DNA called mutation and some mutations can lead to cancer. Breast adenocarcinoma starts in secretary cells. Breast adenocarcinoma is the most common of all cancers that occur in women. According to a survey within the United States of America, there are more than 282,000 breast adenocarcinoma patients registered each 12 months, and most of them are women. Recognition of cancer in its early stages saves many lives. A proposed framework is developed for the early detection of breast adenocarcinoma using an ensemble learning technique with multiple deep learning algorithms, specifically: Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and Bi-directional LSTM. There are 99 types of driver genes involved in breast adenocarcinoma. This study uses a dataset of 4127 samples including men and women taken from more than 12 cohorts of cancer detection institutes. The dataset encompasses a total of 6170 mutations that occur in 99 genes. On these gene sequences, different algorithms are applied for feature extraction. Three types of testing techniques including independent set testing, self-consistency testing, and a 10-fold cross-validation test is applied to validate and test the learning approaches. Subsequently, multiple deep learning approaches such as LSTM, GRU, and bi-directional LSTM algorithms are applied. Several evaluation metrics are enumerated for the validation of results including accuracy, sensitivity, specificity, Mathew’s correlation coefficient, area under the curve, training loss, precision, recall, F1 score, and Cohen’s kappa while the values obtained are 99.57, 99.50, 99.63, 0.99, 1.0, 0.2027, 99.57, 99.57, 99.57, and 99.14 respectively.


Introduction
Adenocarcinoma is a cancer that begins in secretory cells. The most common types of adenocarcinomas include prostate, lung, breast, pancreatic, colorectal, and stomach cancer. Among all, breast cancer is the second most severe cancer present in the human body. Breast cancer is the uncontrolled growth and abnormality of cells within the breast gland. Breast cancer occurs mostly in women. An expected 0.3 million women are recognized with breast cancers each year inside the United States of America. In 2021, an estimated 44,130 deaths (43,600 women and 530 men) occurred because of breast cancers in the United States [1,2]. There are numerous reasons for breast cancers in women. Some of them are aging, family breast cancer history, having a child after the age of 35, beginning menopause after the age of 55, high bone density, and so forth [1].
A biopsy is the principal technique used to diagnose breast adenocarcinoma. It is a technique in which a small tissue cell is examined with a microscope [3]. The artificial intelligence approach has workable results in the discipline of medical sciences. There are various AI methods used in the medical science area for the detection of several illnesses inside the human body. In this study, the author proposed an ensemble learning strategy that can be employed to become aware of breast cancer at its early stages. There are sequences of DNA in human genes. Any change in the sequence is known as mutation, which in most cases leads to cancer. The procedure of mutation is illustrated in Figure 1 [4].
intelligence approach has workable results in the discipline of medical sciences. There a various AI methods used in the medical science area for the detection of several illness inside the human body. In this study, the author proposed an ensemble learning strate that can be employed to become aware of breast cancer at its early stages. There are s quences of DNA in human genes. Any change in the sequence is known as mutatio which in most cases leads to cancer. The procedure of mutation is illustrated in Figure [4]. Genes coordinate with each other by having specific sequences within a cell [5]. M tation is caused by a change in the base sequence of a DNA. The main cause of this chan can also be via insertion, deletion, or replication of gene bases, which causes DNA dam age. Different factors influence DNA damage. These factors consist of metabolic infl ences or environmental factors such as radiation that led to tens of instances of dama per cell, every day [6]. The damage in the DNA molecule alters or eliminates the cel capability to transcribe the gene. DNA repair is the procedure in which a cell identifi and corrects the damage that happens in DNA [7]. This technique is constantly energe as it responds to damage in the DNA structure. When the regular repair process fails, cellular apoptosis does not occur, then DNA damage may additionally now not be repa able. This irreparable injury leads to malignant tumors, or cancer [8,9].
In this study, deep learning approaches such as LSTM, GRU, and bi-direction LSTM are employed to form a classification mechanism that provides excellent resul The proposed learning approach demonstrated good performance as discussed in the r sult section.
The second-major cause of death in women is breast adenocarcinoma. Bioinformat plays a crucial role in the field of medical sciences. Computational technologies, de learning, and machine learning algorithms make the detection and prevention of diseas much easier than earlier. In this section, some of these techniques that are used for t detection of breast adenocarcinoma are explained.
The most-used machine learning algorithms developed for breast cancer detecti are SVM (Support Vector Machine), LR (Logical Regression), RF (Random Forest), ML (Multilayer Perceptron), and KNN (K-Nearest Neighbor). In [10], breast cancer data a classified using k-Nearest Neighbors, Naïve Bayes, and Support Vector Machine train and investigated on the WEKA tool. The dataset is taken from the UCI website. For t study, the Radial basic kernel proves the best accuracy of 96.85% for data classification Data mining techniques are used to predict and resolve breast cancer survivabil [11]. Simple Logistic Regression, Decision Tree, and RBF network are used in this resear and the results are validated using a 10-fold cross-validation test. For feature extractio simple Logistic Regression outperformed all others. The dataset used in this study is tak from a database of the University Medical Centre, Institute of Oncology. Weka is used train the models. Simple logistics obtained the highest accuracy of 74.47%. In [12] artific neural networks, decision trees, and logistic regression are used. The accuracy obtain by logistic regression was 89.2%, ANN has an accuracy of 91.2% and the best accura was obtained by a decision tree with 93.6%. In [13], the fast correlation-based filter (FCB method is used for the prediction and classification of breast cancer. Five machine learni Genes coordinate with each other by having specific sequences within a cell [5]. Mutation is caused by a change in the base sequence of a DNA. The main cause of this change can also be via insertion, deletion, or replication of gene bases, which causes DNA damage. Different factors influence DNA damage. These factors consist of metabolic influences or environmental factors such as radiation that led to tens of instances of damage per cell, every day [6]. The damage in the DNA molecule alters or eliminates the cell's capability to transcribe the gene. DNA repair is the procedure in which a cell identifies and corrects the damage that happens in DNA [7]. This technique is constantly energetic as it responds to damage in the DNA structure. When the regular repair process fails, or cellular apoptosis does not occur, then DNA damage may additionally now not be repairable. This irreparable injury leads to malignant tumors, or cancer [8,9].
In this study, deep learning approaches such as LSTM, GRU, and bi-directional LSTM are employed to form a classification mechanism that provides excellent results. The proposed learning approach demonstrated good performance as discussed in the result section.
The second-major cause of death in women is breast adenocarcinoma. Bioinformatics plays a crucial role in the field of medical sciences. Computational technologies, deep learning, and machine learning algorithms make the detection and prevention of diseases much easier than earlier. In this section, some of these techniques that are used for the detection of breast adenocarcinoma are explained.
The most-used machine learning algorithms developed for breast cancer detection are SVM (Support Vector Machine), LR (Logical Regression), RF (Random Forest), MLP (Multilayer Perceptron), and KNN (K-Nearest Neighbor). In [10], breast cancer data are classified using k-Nearest Neighbors, Naïve Bayes, and Support Vector Machine trained and investigated on the WEKA tool. The dataset is taken from the UCI website. For the study, the Radial basic kernel proves the best accuracy of 96.85% for data classification.
Data mining techniques are used to predict and resolve breast cancer survivability [11]. Simple Logistic Regression, Decision Tree, and RBF network are used in this research and the results are validated using a 10-fold cross-validation test. For feature extraction, simple Logistic Regression outperformed all others. The dataset used in this study is taken from a database of the University Medical Centre, Institute of Oncology. Weka is used to train the models. Simple logistics obtained the highest accuracy of 74.47%. In [12] artificial neural networks, decision trees, and logistic regression are used. The accuracy obtained by logistic regression was 89.2%, ANN has an accuracy of 91.2% and the best accuracy was obtained by a decision tree with 93.6%. In [13], the fast correlation-based filter (FCBF) method is used for the prediction and classification of breast cancer. Five machine learning algorithms are applied, including RF, SVM, KNN, Naive Bayes, and MLP. The highest accuracy obtained by SVM is 97.9%. Many other researchers worked on breast cancer identification, as discussed in [14].

Results
Subsequent paragraphs show the results obtained using different ensemble learning techniques along with various tests.

Self-Consistency Testing
The self-consistency test is the first testing technique used for testing deep learning algorithms for the identification of breast adenocarcinoma. In the self-consistency test complete dataset is used for training and testing purpose. This test ensures that the algorithm will give its best results when it uses all its data for training purposes. The test computes the results of the proposed algorithm without experimentally measuring the stability values [15,16]. The proposed study measured accurate prediction of change in gene sequences in breast adenocarcinoma from the dataset utilizing the protocols based on self-consistency.
Results of proposed ensemble learning model with self-consistency test are discussed in Table 1. The independent set test results are discussed in Table 2 and 10-fold cross validation results are discussed in Table 3. The graph in Figure 2 shows the Training history of the proposed model in self-Consistency test.
The accuracy and loss curve of proposed ensemble learning approach for individual deep learning algorithm such as LSTM, GRU, and bi-directional LSTM in self-consistency test is shown in Figure 3. The accuracy and loss curve of proposed ensemble learning approach for individual deep learning algorithm such as LSTM, GRU, and bi-directional LSTM in self-consistency test is shown in Figure 3.     The accuracy and loss curve of proposed ensemble learning approach for individual deep learning algorithm such as LSTM, GRU, and bi-directional LSTM in self-consistency test is shown in Figure 3.      The accuracy and loss curve of proposed ensemble learning approach for individual deep learning algorithm such as LSTM, GRU, and bi-directional LSTM in self-consistency test is shown in Figure 3.    The AUC Value is 1.0, which is considered as excellent results according to AUC accuracy classification.

Independent Set Testing
The second testing technique used for the proposed ensemble learning approach is independent set testing. The values are extracted from the misperception matrix used for determining the precision of the model. The independent set test of the proposed model is the basic performance measuring method. From the dataset, 80% of the values are used for training the algorithm and 20% values are used for testing purposes. The results of independent set testing after applying deep learning algorithms are discussed in Table 2.
The ROC curve of proposed ensemble learning approach for individual deep learning algorithms is illustrated in the Figure 5.
The AUC Value is 1.0, which is considered as excellent results according to AUC curacy classification.

Independent Set Testing
The second testing technique used for the proposed ensemble learning approac independent set testing. The values are extracted from the misperception matrix used determining the precision of the model. The independent set test of the proposed m is the basic performance measuring method. From the dataset, 80% of the values are u for training the algorithm and 20% values are used for testing purposes. The resul independent set testing after applying deep learning algorithms are discussed in Tab The ROC curve of proposed ensemble learning approach for individual deep le ing algorithms is illustrated in the Figure 5.

10-Fold Cross-Validation Test
In the 10-Fold cross-validation (FCV) technique the data is equally subsamples 10 groups. Then the training set is divided into 10 partitions and treat each of them in validation set, training the model and then average generalization performance across 10-folds to make choices about hyper-parameters and architecture [12]. Figure 6 shows the working process of the 10-fold cross-validation technique.  Table 2 represents the result of proposed ensemble learning approach for individ deep learning algorithms with 10-fold cross-validation technique.
The ROC curve of proposed ensemble learning approach for individual deep le ing algorithm such as LSTM, GRU, and bi-directional LSTM when independent set tes is applied on them is illustrated in the Figure 7.

10-Fold Cross-Validation Test
In the 10-Fold cross-validation (FCV) technique the data is equally subsamples into 10 groups. Then the training set is divided into 10 partitions and treat each of them in the validation set, training the model and then average generalization performance across the 10-folds to make choices about hyper-parameters and architecture [12]. Figure 6 shows the working process of the 10-fold cross-validation technique.
The AUC Value is 1.0, which is considered as excellent results according to AUC curacy classification.

Independent Set Testing
The second testing technique used for the proposed ensemble learning approac independent set testing. The values are extracted from the misperception matrix used determining the precision of the model. The independent set test of the proposed mo is the basic performance measuring method. From the dataset, 80% of the values are u for training the algorithm and 20% values are used for testing purposes. The result independent set testing after applying deep learning algorithms are discussed in Tabl The ROC curve of proposed ensemble learning approach for individual deep lea ing algorithms is illustrated in the Figure 5.

10-Fold Cross-Validation Test
In the 10-Fold cross-validation (FCV) technique the data is equally subsamples i 10 groups. Then the training set is divided into 10 partitions and treat each of them in validation set, training the model and then average generalization performance across 10-folds to make choices about hyper-parameters and architecture [12]. Figure 6 shows the working process of the 10-fold cross-validation technique.  Table 2 represents the result of proposed ensemble learning approach for individ deep learning algorithms with 10-fold cross-validation technique.
The ROC curve of proposed ensemble learning approach for individual deep lea ing algorithm such as LSTM, GRU, and bi-directional LSTM when independent set test is applied on them is illustrated in the Figure 7.  Table 2 represents the result of proposed ensemble learning approach for individual deep learning algorithms with 10-fold cross-validation technique.
The ROC curve of proposed ensemble learning approach for individual deep learning algorithm such as LSTM, GRU, and bi-directional LSTM when independent set testing is applied on them is illustrated in the Figure 7.

Results Comparison
The results of ensemble learning approach are compared with its own individua gorithms such as LSTM, GRU, and Bi-directional LSTM in Table 4. Multiple metrics used for comparison. The independent set test is used for comparison. It is clear f Table 4 that the proposed ensemble learning model improves identification accurac the individual deep learning techniques such as LSTM, GRU, and bi-directional LSTM Ensemble learning produce better results as compared to simple deep learning a rithms in Table 4. The obtained accuracy through the accuracy is 99.57

Analysis and Discussion
Breast adenocarcinoma is the second main cause of death in women worldw There are several biological and computational research for the identification and de tion of breast adenocarcinoma. In the past studies most of the researchers used some sm datasets taken from a small number of hospitals or organizations and applied biolog or machine learning algorithms on them for detection with less accuracy and few eva tion matrices.
The proposed ensemble learning approach for individual deep learning includ LSTM, GRU, and bi-directional LSTM used the latest generalized big dataset taken f

Results Comparison
The results of ensemble learning approach are compared with its own individual algorithms such as LSTM, GRU, and Bi-directional LSTM in Table 4. Multiple metrics are used for comparison. The independent set test is used for comparison. It is clear from Table 4 that the proposed ensemble learning model improves identification accuracy of the individual deep learning techniques such as LSTM, GRU, and bi-directional LSTM. Ensemble learning produce better results as compared to simple deep learning algorithms in Table 4. The obtained accuracy through the accuracy is 99.57.

Analysis and Discussion
Breast adenocarcinoma is the second main cause of death in women worldwide. There are several biological and computational research for the identification and detection of breast adenocarcinoma. In the past studies most of the researchers used some small datasets taken from a small number of hospitals or organizations and applied biological or machine learning algorithms on them for detection with less accuracy and few evaluation matrices.
The proposed ensemble learning approach for individual deep learning including LSTM, GRU, and bi-directional LSTM used the latest generalized big dataset taken from 12 different cohorts for the identification of breast adenocarcinoma. The dataset contains 99 driver genes that cause bread adenocarcinoma where 4127 samples consist of 6170 mutations. In this research the latest dataset for normal and mutated genes sequence of breast adenocarcinoma is used. A similar study is also presented for other types of mutations [17,18] and some testing techniques are also presented in [19,20]. Three different testing techniques including self-consistency test, independent set test, and 10-fold cross validation test are applied on the dataset and the results obtained are 97.6, 99.5 and 98.2 respectively. Therefore, it can be concluded from the results obtained using above mentioned testing techniques that the proposed models are most suitable to achieve high accuracy for cancer prediction. The self-consistency test used the complete dataset for both training and testing phases. Table 1 shows the results obtained using ensemble learning approach with the self-consistency test. The independent set test used 80% of the dataset for training and the remaining 20% for testing. Table 2 shows the results of ensemble learning using independent set test. In the 10-fold cross validation test, 10 equal folds from the whole dataset were created. The proposed ensemble learning model was trained on 9 folds and tested on one-fold, and the same process was repeated. The whole data are used for testing and training. However, shuffled data are provided each time for better learning and, lastly, the average is calculated.

Materials and Methods
This study proposed a novel ensemble learning approach for deep learning techniques such as LSTM, GRU, and bi-directional LSTM for the detection of breast adenocarcinoma.

Data Acquisition Framework
The dataset is the most crucial part of this research. This dataset is used for training the models, testing the outcome, and validating the results. Data acquisition is the process of collecting reliable and accurate data for research. Data acquisition includes the process of data collection to conduct the research and defining how the data are collected from a valid source [21].
For the proposed study normal gene sequences are extracted from asia.ensambl.org [22] and mutation information of each gene related to breast cancer is extracted from intogen.org (accessed on 18 August 2022) [23]. These normal gene sequences and mutated information are extracted through web scraping code. Web scraping is the process of extracting data from different websites available on the World Wide Web [24]. There is more than 2500 type of cancer genomes involved in mutation [25]. Three types of mutations occur in human genes, namely driver mutation, passenger mutation, and not assist. Driver mutation is the type of mutation in cells that cause cancer. Driver mutation causes abnormal growth of the cells [26]. A mutation that alters the gene sequences but does not cause cancer is known as passenger mutation [27] whereas, not assist gene mutation does not contain any information about the mutation therefore it is not added to this study. The data collection is explained step by step in Appendix A.
A tool named Generate Mutated Sequences (GMS) is created in python that is used to incorporate the mutation information in normal gene sequences and create mutated sequences. Figure 8 shows the data acquisition framework in detail.
From the gene information, mutated gene sequences and normal gene sequences are categorized. For the proposed study 4127 samples are extracted from 99 types of driver genes. The sample dataset is the combination of 12 cohorts of different cancer detection websites which are then combined for this study [23].
The data sample was extracted from every possible combination of age, gender, cancer detection, treatment, and normal person. A total of 6170 mutations is used for training the models, testing the outcome, and validating the results. Driver genes involved in breast adenocarcinoma that cause cancer are shown in Table 5. The data sample was extracted from every possible combination of age, gender, cancer detection, treatment, and normal person. A total of 6170 mutations is used for training the models, testing the outcome, and validating the results. Driver genes involved in breast adenocarcinoma that cause cancer are shown in Table 5.   The existence of bases in gene sequences related to breast cancer is explained with the help of the frequency histogram in Figure 9. A total of 99 genes are involved in the progression of breast cancer. Each gene is expressed in a series of bases consisting of nucleotides. Therefore, the dataset contains many nucleotides as expressed with the help of a technique present in Natural Language Processing (NLP) known as a word cloud.
The existence of bases in gene sequences related to breast cancer is explained with the help of the frequency histogram in Figure 9. A total of 99 genes are involved in the progression of breast cancer. Each gene is expressed in a series of bases consisting of nu cleotides. Therefore, the dataset contains many nucleotides as expressed with the help o a technique present in Natural Language Processing (NLP) known as a word cloud. The benchmark dataset for the proposed study is denoted by , which is defined as Here considered as normal gene sequences while is considered as mutated gene sequences that cause cancer and is the union for both sequences. A balanced da taset is used to provide accurate results [28,29].
The main features of the raw dataset are extracted by feature extraction techniques Feature extraction is the process of passing data through multiple steps to extract the main features used for model training. It is the most important step in training machine learning algorithms. In feature extraction, the patterns of data are recognized that are further used in the training and testing process performed on data [30,31]. For the proposed study sta tistical moments are calculated such as Hahn moments, raw moments, and central mo ments. Other feature extraction techniques are also used including PRIM, RPRIM, AAPIV and RAAPIV. These feature extraction techniques are applied to mutated gene sequences and normal gene sequences for extracting the main features of data [32]. Figure 10 explains the feature extraction techniques used for the breast cancer dataset. The benchmark dataset for the proposed study is denoted by B, which is defined as Here B + considered as normal gene sequences while B − is considered as mutated gene sequences that cause cancer and U is the union for both sequences. A balanced dataset is used to provide accurate results [28,29].
The main features of the raw dataset are extracted by feature extraction techniques. Feature extraction is the process of passing data through multiple steps to extract the main features used for model training. It is the most important step in training machine learning algorithms. In feature extraction, the patterns of data are recognized that are further used in the training and testing process performed on data [30,31]. For the proposed study statistical moments are calculated such as Hahn moments, raw moments, and central moments. Other feature extraction techniques are also used including PRIM, RPRIM, AAPIV, and RAAPIV. These feature extraction techniques are applied to mutated gene sequences and normal gene sequences for extracting the main features of data [32]. Figure 10 explains the feature extraction techniques used for the breast cancer dataset.

Hahn Moment Calculation
Hahn moment is used to calculate statistical parameters [42]. Hahn moment is the most important concept in pattern recognition. It calculates the mean and variance in the dataset.
Hahn moments require two-dimensional data [43,44]. Therefore, the genomic sequences are converted into a two-dimensional matrix G of size N × N as in Equation (2).
Here G defines the gene sequence. The Hahn moments are computed using the value of G .

Hahn Moment Calculation
Hahn moment is used to calculate statistical parameters [42]. Hahn moment is the most important concept in pattern recognition. It calculates the mean and variance in the dataset.
Hahn moments require two-dimensional data [43,44]. Therefore, the genomic sequences are converted into a two-dimensional matrix of size as in Equation (2).
Here ' defines the gene sequence. The Hahn moments are computed using the value of '.
Here each element in is are the residue of genomic sequences. Statistical moments are calculated in third order. Hahn moments are orthogonal because it takes a square matrix as an input. The Hahn polynomial for the proposed study dataset is calculated by the following Equations (3) and (4).
where and are all positive integers. and are the predefined constants. is the order of the moment, and B is the size of the data array.
where + is the order of the moment, , are predefined constants, and is an arbitrary element of the square matrix .  Here each element in is G are the residue of genomic sequences. Statistical moments are calculated in third order. Hahn moments are orthogonal because it takes a square matrix as an input. The Hahn polynomial for the proposed study dataset is calculated by the following Equations (3) and (4).
where r and s are all positive integers. r and s are the predefined constants. n is the order of the moment, and B is the size of the data array. where x + y is the order of the moment, a, b are predefined constants, and δ xy is an arbitrary element of the square matrix G . For any integer A ∈ [0, B − 1] (B is the provided positive integer). These are the adjustable parameters and use to control the shape of polynomials. The Pochhammer symbol is (a)k = a · (a + 1) · · · (a + k − 1) = r(a+k) r(a) . Equations (3) and (4)

Raw Moment Calculation
Raw moment is used for statistics imputation. Imputation is the procedure of replacing the missing data values in a dataset with the most suitable substitute values to keep the facts [45]. The raw moment for the of 2D data with order a + b is expressed by Equation (5) [46].
Raw moments are calculated up to order 3. It describes significant information within the sequence such as R 00 , R 01 , R 10 , R 02 , R 20 , R 03, and R 30 .

Central Moment Calculation
Central moment of feature extraction is used to extract useful features using mean and variance. It is the moment in probability distribution about a randomly selected variable with respect to its random variable mean [42]. The general formula for the central moment calculation for the breast adenocarcinoma dataset is represented by Equation (6).
Centroids (r, s) are required to compute the central moments that are visualized as center of data. The unique features from central moments, up to 3rd order, are labeled as V 00 , V 01 , V 10 , V 11 , V 02 , V 20 , V 12 , V 21 , V 03 and V 30 .

Position Relative Incidence Matrix (PRIM)
PRIM is used for determining the positioning of each gene in the gene sequence of Breast cancer. PRIM formed matrix with the dimension of 30 * 30 is shown in Equation (7) [47].
Feature scaling lets in every data pattern to participate in detection of ovarian cancer [30]. The indication score of qth position nucleotide is determined by the R p→q with respect to the occurrence of the pth nucleotide.

Reverse Position Relative Incidence Matrix (RPRIM)
Reverse Position Relative incidence matrix (RPRIM) also work same as PRIM does but in the reverse sequence. Equation (8) elaborate the calculation of RPRIM for breast cancer dataset.

Accumulative Absolute Position Incidence Vector (AAPIV)
The frequency matrix gives the information about the incidence of genes in the gene sequence. AAPIV gives the information related to the different compositions of nucleotides in the gene sequences. The relative positioning of the nucleotides in cancerous gene sequences is observed out by using AAPIV [46]. The relative gene sequences of breast adenocarcinoma are illustrated with the help of Equation (9).
where λ n is from gene sequence having 'n total nucleotides, which can be calculated using Equation (10). For any ith component, where β k is the position of the i th nucleotides.

Reverse Accumulative Absolute Position Incidence Vector (RAAPIV)
RAAPIV work the same as AAPIV works but in the reverse order. The equation for RAAPIV is as follows. λ = {n 1 , n 2 , . . . n m } (11)

Frequency Vector Calculation
A dataset contains thousands of data records with different attributes for each record. A frequency matrix is used to represent the sequence of genes that combine to form a gene sequence. The distribution of each gene in the gene sequence of breast adenocarcinoma is utilized to form a frequency distribution vector. It is represented by Equation (12).
Here is the frequency of the genes in the breast adenocarcinoma gene sequence. The frequency vector is calculated by the Equation (13).
Here f 1 to f N indicates the frequency of each gene in the gene sequence.

Algorithm for Predictive Modeling
For the proposed study, a deep neural network with multiple layers is used for the detection of breast adenocarcinoma. Deep learning has a huge impact on the recognition, detection, prediction, and diagnosis of different types of cancer, forecasting, detection systems, and many other complex problems. A deep neural network model consists of multiple layers including an input layer, an output layer, pooling layer, dense layer, and dropout layer with fully connected layers at the top. Each of the layers takes the input from the previous layer and processes those input features. The learning features inside these layers are the algorithms that learn from the layers and train themselves using different learning procedures [48].
This study uses three different types of deep learning RNN algorithms including Long short-term memory (LSTM), Gated recurrent units (GRU), and bi-directional LSTM. These algorithms use three evaluation methods that are a self-consistency test, independent set test, and a 10-fold cross validation test for the identification of breast adenocarcinoma.

Long Short-Term Memory Network (LSTM)
LSTM is the first deep learning algorithm used in this process. LSTM is used to resolve the vanishing gradient problem in the neural network. Vanishing gradient is a problem in which the lost function approximately approaches zero and makes the neural network hard for training [49,50]. LSTM is used to address short-term and vanishing gradient problems in RNN [51]. It increases the memory of the RNN model. LSTM is a gated process all the information in LSTM is read, stored, and written with the help of these gates. These gates are of three types known as input gate, forget gate, and output gate [52]. The gate in LSTM is responsible for learning the regulation of some information from one gate to another gate. Therefore, different activation functions are utilized in every gate [53,54]. Figure 11 explains the structure of a simple gated cell used in the LSTM technique for the detection of breast cancer.
In Figure 11 x t is the input at specific time and y t is the output at specific time t. f t represents forget gate, O t and i t represent output gate and input gate, respectively. Every cell of LSTM has three inputs x t , A t−1 , B t−1 and has two outputs as b t and h t . Equations (14)- (19) explain LSTM.
In the equations x t is the input, A t−1 is the previous data cell output, B t−1 is the previous cell memory, B t is the current cell memory. Here W and U are the weights for the forget, input, and output gate, respectively. process all the information in LSTM is read, stored, and written with the help of these gates. These gates are of three types known as input gate, forget gate, and output gate [52]. The gate in LSTM is responsible for learning the regulation of some information from one gate to another gate. Therefore, different activation functions are utilized in every gate [53,54]. Figure 11 explains the structure of a simple gated cell used in the LSTM technique for the detection of breast cancer. Figure 11. LSTM cell structure for breast adenocarcinoma.
In Figure 11 is the input at specific time and is the output at specific time . represents forget gate, and represent output gate and input gate, respectively. Every cell of LSTM has three inputs , , and has two outputs as and ℎ . Equations In the equations is the input, is the previous data cell output, is the previous cell memory, is the current cell memory. Here and are the weights for the forget, input, and output gate, respectively.
The data enter directly into the embedding layer form the input layer. The embedding layer is the first hidden layer of LSTM. It consists of input dimension, output dimension, and input length. The output of the embedding layer is denoted by the Equation (20) [55].
In the equation is the output of the embedding layer, is the parameter between the input layer and embedding layer while is the one hot vector if it is filed. One hot vector is used to differentiate data from each other. Data from the embedding layer are entered into the LSTM layer where they pass from the LSTM gates. The embedding is used to convert the input into a fixed length. The input length is converted to 64. An LSTM layer of 128 neurons is added. The dropout layer prevents the model from overfitting and the dense layer connects all the input from the layer and passes to the output layer. Two dropout layers are used in this model to overcome model overfitting. One dense layer is used as a hidden layer with 10 neurons. Stochastic Gradient Descent (SGD) is used as an optimizer in LSTM layer. Sigmoid is used as an activation function. Sparse Categorical Cross Entropy (SCCE) is used to minimize the loss in training the proposed model. The data enter directly into the embedding layer form the input layer. The embedding layer is the first hidden layer of LSTM. It consists of input dimension, output dimension, and input length. The output of the embedding layer is denoted by the Equation (20) [55].
In the equation E out is the output of the embedding layer, V i is the parameter between the input layer and embedding layer while X i is the one hot vector if it is filed. One hot vector is used to differentiate data from each other. Data from the embedding layer are entered into the LSTM layer where they pass from the LSTM gates. The embedding is used to convert the input into a fixed length. The input length is converted to 64. An LSTM layer of 128 neurons is added. The dropout layer prevents the model from overfitting and the dense layer connects all the input from the layer and passes to the output layer. Two dropout layers are used in this model to overcome model overfitting. One dense layer is used as a hidden layer with 10 neurons. Stochastic Gradient Descent (SGD) is used as an optimizer in LSTM layer. Sigmoid is used as an activation function. Sparse Categorical Cross Entropy (SCCE) is used to minimize the loss in training the proposed model.

Gated Recurrent Unit (GRU)
The second deep learning method used for the proposed study is Gated Recurrent Unit (GRU) method. GRU uses fewer gates than LSTM and works in the same way. The results obtained from GRU are better than LSTM due to the smaller number of gates and parameters. GRU use only two gates reset gate and the update gate in the cell [52]. The reset gate of GRU decides how much past information is neglected and update gate decides how much past information is to be used. GRU takes less computational time than LSTM [56]. Figure 12 explains the GRU cell structure used in the identification of breast adenocarcinoma.
In the Equations (21) and (22) r t represents reset gate and z t is the update gate.

Gated Recurrent Unit (GRU)
The second deep learning method used for the proposed study is Gated Recurrent Unit (GRU) method. GRU uses fewer gates than LSTM and works in the same way. The results obtained from GRU are better than LSTM due to the smaller number of gates and parameters. GRU use only two gates reset gate and the update gate in the cell [52]. The reset gate of GRU decides how much past information is neglected and update gate decides how much past information is to be used. GRU takes less computational time than LSTM [56]. Figure 12 explains the GRU cell structure used in the identification of breast adenocarcinoma. The following Equations (21)-(24) explains GRU.
In the Equations (21) and (22) represents reset gate and is the update gate. The proposed model has one embedding layer to convert the input into a vector of fixed word length of 64. The second layer is GRU with 256 neurons and a simple RNN Layer with 128 neurons. Two dropout layers are added with 30 percent to prevent overfitting. One dense layer is added at the end with 10 neurons. Stochastic Gradient Descent (SGD) is used as an optimizer in GRU layer. Sigmoid is used as an activation function. Sparse Categorical Cross Entropy (SCCE) is used to minimize the loss in training the proposed model.

Bi-Directional LSTM
Lastly, the deep learning technique used in the proposed study is bi-directional LSTM [57]. A bi-directional LSTM uses two LSTM cells, one in the forward direction and one in the backward direction, connected with a single output.
The proposed Model has one embedding layer to convert the input into fixed vectors of fixed word length of 64. Two bi-directional layers with 128 and 64 neurons in both directions are added. Three dropout layers are added with 30 percent to prevent from overfitting. One dense layer is used with 64 neuron and one dense layer is added at the end with 10 neurons. Stochastic Gradient Descent (SGD) is used as an optimizer in GRU layer. Sigmoid is used as an activation function. Sparse Categorical Cross Entropy (SCCE) is used to minimize the loss in training the proposed model.
Unlike LSTM and GRU, Bi-directional LSTM did not need any past knowledge for the prediction it learns by itself through moving in forward and backward direction that's why the result of Bi-directional LSTM is better than LSTM and GRU [58]. The proposed model has one embedding layer to convert the input into a vector of fixed word length of 64. The second layer is GRU with 256 neurons and a simple RNN Layer with 128 neurons. Two dropout layers are added with 30 percent to prevent overfitting. One dense layer is added at the end with 10 neurons. Stochastic Gradient Descent (SGD) is used as an optimizer in GRU layer. Sigmoid is used as an activation function. Sparse Categorical Cross Entropy (SCCE) is used to minimize the loss in training the proposed model.

Bi-Directional LSTM
Lastly, the deep learning technique used in the proposed study is bi-directional LSTM [57]. A bi-directional LSTM uses two LSTM cells, one in the forward direction and one in the backward direction, connected with a single output.
The proposed Model has one embedding layer to convert the input into fixed vectors of fixed word length of 64. Two bi-directional layers with 128 and 64 neurons in both directions are added. Three dropout layers are added with 30 percent to prevent from overfitting. One dense layer is used with 64 neuron and one dense layer is added at the end with 10 neurons. Stochastic Gradient Descent (SGD) is used as an optimizer in GRU layer. Sigmoid is used as an activation function. Sparse Categorical Cross Entropy (SCCE) is used to minimize the loss in training the proposed model.
Unlike LSTM and GRU, Bi-directional LSTM did not need any past knowledge for the prediction it learns by itself through moving in forward and backward direction that's why the result of Bi-directional LSTM is better than LSTM and GRU [58].

Ensemble Learning Models
Ensemble learning model uses a divide-and-conquer approach. It is used to improve the accuracy of an individual base learners and then compile the whole model. Multiple base learners are combined to achieve the best results [59]. Each base learner learns different features from data chunks obtained using bootstrap technique, generates some results, and combines them. Then again, the data chunks feed to the model. In this way the patterns hidden in the datasets are learned by the model. Ensemble learning is an adaptable approach and shows better accuracy as compared to simple machine learning algorithms. This is due to the bootstrap technique, which allows feature replacement and row replacement techniques, and the model learns using all the possible data combinations. This also results in overcoming the overfitting issues. The popular ensemble learning model types are bagging [60], boosting [61] and stacking [62]. The aim of all these models is to obtain good accuracy.
This study improves the performance of individual deep learning models such as LSTM, GRU, and Bi-directional LSTM with the help of an ensemble learning approach. The processed dataset is divided into three groups such as training set, validation set, and test set. The validation set is denoted by V whereas, the test set is denoted by T. The training set is given as input to each individual deep learning model which are LSTM, GRU, and bi-directional LSTM. The grid search optimization technique is also applied to get search ranges and the optimum values for proposed ensemble learning model parameters. Trained learning models are generated for each individual deep learning model by the name trained model1, trained model2, and trained model3 for LSTM, GRU, and Bi-directional LSTM respectively as shown in Figure 13. results in overcoming the overfitting issues. The popular ensemble learning model types are bagging [60], boosting [61] and stacking [62]. The aim of all these models is to obtain good accuracy.
This study improves the performance of individual deep learning models such as LSTM, GRU, and Bi-directional LSTM with the help of an ensemble learning approach. The processed dataset is divided into three groups such as training set, validation set, and test set. The validation set is denoted by whereas, the test set is denoted by . The training set is given as input to each individual deep learning model which are LSTM, GRU, and bi-directional LSTM. The grid search optimization technique is also applied to get search ranges and the optimum values for proposed ensemble learning model parameters. Trained learning models are generated for each individual deep learning model by the name trained model1, trained model2, and trained model3 for LSTM, GRU, and Bidirectional LSTM respectively as shown in Figure 13. All the trained models are tested on both validation and testing sets. Lastly, final improved results are obtained by an ensemble learning model as shown in results section.
Weights are assigned to the individual deep learning model to construct the ensemble learning prediction in the equation. Here ( = 1, 2, … , ) is the weight assigned to each individual deep learning model, , represents the prediction of each individual deep learning model whereas, is for the ℎ observation. For each deep learning technique these testing techniques are applied in 10 epochs (10 feed-forward and feed backward paths). In each iteration for testing the model All the trained models are tested on both validation and testing sets. Lastly, final improved results are obtained by an ensemble learning model as shown in results section.
gp, i = ∑ N n=1 ω n f n,i (25) Weights are assigned to the individual deep learning model to construct the ensemble learning prediction in the equation. Here ω n (n = 1, 2, . . . , N) is the weight assigned to each individual deep learning model, f n, i represents the prediction of each individual deep learning model whereas, n is for the ith observation.
For each deep learning technique these testing techniques are applied in 10 epochs (10 feed-forward and feed backward paths). In each iteration for testing the model calculates its AUC, precision, F1 score, recall, Cohen's kappa, specificity, sensitivity, Mathew's correlation coefficient, loss, and accuracy. The following are the mathematical equations used to calculate the algorithms' results [63][64][65][66]. The Equations (26)-(29) explain the formulae to calculate sensitivity, specificity, accuracy, and Mathews Correlation Coefficient (MCC) respectively. Sensitivity = TP/(TP + FN) In these equations: TN = All the true negative values TP = All the true positive values from the dataset FN = False negative values FP = False positive values In the above equations, sensitivity refers to the ability to predict the count that truly identify the breast adenocarcinoma. Specificity refers to the ability to predict the count that truly identify the absence of breast cancer. TP + FN are all subjects with given condition. While TN + FP are the subjects without the given conditions. TP + FP is the total number of subjects with positive results and TN + FN is the subjects with the negative results.

Conclusions and Future Work
Breast adenocarcinoma is the most common cancer in women. Therefore, a proposed ensemble learning approach with individual deep learning techniques that includes LSTM, GRU, and bi-directional LSTM is developed for the early detection of breast adenocarcinoma. Normal gene sequences are obtained from asia.ensembl.org and mutation information is obtained from IntOgen.org. Mutated sequences are generated by incorporating mutation information into normal gene sequences as depicted in Figure 8. A feature extraction technique is used to obtain useful features from the normal and mutated gene sequences. The feature extraction also converts the data to numeric format which is ready for training and testing. Multiple feature extraction techniques used in this study can be seen in Figure 10. The proposed ensemble learning approach obtained 99.57% accuracy. Multiple testing techniques such as self-consistency test, independent set test, and 10-fold cross validation tests are applied to check the performance of the model. The ensemble learning approach has good performance in an independent set test as shown in Table 2. All the results are compared in Table 4. Therefore, it can be concluded from the results that the proposed ensemble learning approach performs with high accuracy for the identification of breast adenocarcinoma. The results of Accuracy, AUC, Loss, Sensitivity, Specificity, and Mathew's correlation coefficient of the independent test, self-consistency test and 10-fold cross-validation test are shown in Tables 1-3. Lastly, the results authenticate that the proposed ensemble learning model using deep learning classifiers can be utilized efficiently for any cancer detection.
In future this technique can be used further for the detection of other types of cancer as well. Furthermore, this technique can be beneficial to detect the incidence of other life-threatening diseases using genes. "Mutational cancer driver genes". All 99 genes are listed here. Click on first gene named "TP53". Click on table tab under "Observed mutations in tumors". All mutation information is present here for this single gene. You can see mutation information related to other genes by clicking each gene. At the write top of this page is written "Gene details". Under "Gene details" there is written "Ensembl id". This is the id of normal gene sequences related to the selected gene. If you click on this link, you will go to http://asia.ensembl.org/ (accessed on 10 September 2022) where you will find normal gene sequences. Click on this link and then click on the sequence on the left panel. Normal sequences will be shown in the main page and a download link is also available to download this normal gene sequence. repeat this process 99 times for each gene to download the whole dataset of Breast adenocarcinoma.