Multi-Modal Evolutionary Deep Learning Model for Ovarian Cancer Diagnosis

Ovarian cancer (OC) is a common reason for mortality among women. Deep learning has recently proven better performance in predicting OC stages and subtypes. However, most of the state-of-the-art deep learning models employ single modality data, which may afford low-level performance due to insufficient representation of important OC characteristics. Furthermore, these deep learning models still lack to the optimization of the model construction, which requires high computational cost to train and deploy them. In this work, a hybrid evolutionary deep learning model, using multi-modal data, is proposed. The established multi-modal fusion framework amalgamates gene modality alongside with histopathological image modality. Based on the different states and forms of each modality, we set up deep feature extraction network, respectively. This includes a predictive antlion-optimized long-short-term-memory model to process gene longitudinal data. Another predictive antlion-optimized convolutional neural network model is included to process histopathology images. The topology of each customized feature network is automatically set by the antlion optimization algorithm to make it realize better performance. After that the output from the two improved networks is fused based upon weighted linear aggregation. The deep fused features are finally used to predict OC stage. A number of assessment indicators was used to compare the proposed model to other nine multi-modal fusion models constructed using distinct evolutionary algorithms. This was conducted using a benchmark for OC and two benchmarks for breast and lung cancers. The results reveal that the proposed model is more precise and accurate in diagnosing OC and the other cancers.


Introduction
Ovarian cancer (OC) is indicated as the fifth prevalent reason for cancer-related deaths between women. Most cases (75%) happen in post-menopausal patients, with incidence of 40 for each 100,000 per year, in patients aged over 50. Early detection of such disease significantly raises 5-year survival, from 3% (in Stage IV) to 90% (in Stage I) [1]. Histopathology evaluation represents the gold standard by which OC is diagnosed and identified into histological types. Cellular morphology interpretation defines the various OC types and guides the treatment planning [2], which is best performed through expert pathologists in ovarian tumors. However, inter-observer variations in grading have been reported. These variations in histopathologic interpretation will cause not only inaccurate prognostic prediction and suboptimal treatments but also loss of life quality [3].
This sheds light on the urgent need to construct computational methods that can precisely predict OC. Towards this goal, a number of OC diagnosis models [4][5][6] were Symmetry 2021, 13, 643 2 of 25 developed during the past decade, based upon the single modal histopathological images, because it can reflect morphological characteristics of the cells that are closely connected to the OC aggressiveness. Besides histopathological images, it has also been known that the gene expression levels and genetic mutations can implicitly influence the development of cancers by accelerating the cell division rates and modifying the micro-environment of tumor [7]. Therefore, the genomic features represent important indicators for driving diagnosis practices, which contain a catalog of gene expression signatures, sequence mutations, focal variations in DNA copy numbers, and methylation alterations [8].
Given the heterogeneity and complexity of cancer survival prediction, the newly developed whole-slide histopathology scanners, as well as high-throughput omics profiling [6] coupled with innovative machine learning (ML) algorithms, the recent studies [9][10][11][12] have shown that the integrative analyses of both patient's genomic data and pathology images are of high efficiency in cancers assessment. Sun et al. [9] have integrated the pathology images and genomic data for predicting breast cancer outcome. A multiple kernel learning approach was employed to join the heterogeneous information of the two modalities. The approach achieved 0.8022 as accuracy and 0.7273 of precision. In [10], a multi-modal multi-task feature selection approach was introduced for diagnosing cancers, namely, M2DP. In which, features from both gene expression data and pathology images were extracted, then the M2DP model was implemented for identifying diagnosis related features. For each patient, the selected features were utilized for the diagnosis by using AdaBoosting. The model was tested using a benchmark for breast cancer and another one for lung cancer, showing accuracy of 72.53 and 70.08%, respectively. Zhang et al. [11] introduced a multiple kernel approach for predicting lung carcinomas by amalgamating genomic data with pathological features of images, which showed 0.8022 as accuracy. Liu et al. [12] proposed a multi-modal deep learning model for predicting breast cancer subordinate type. The model fully extracted the deep-seated features from gene and image modalities by using a deep learning model, and finally fused the different features using weighted linear aggregation. The prediction accuracy was 88.07%.
Recent advances in the convolutional neural networks (CNN) and other resembling deep learning models have remarkable implications in medical diagnosis. Kott et al. [13] utilized a deep residual CNN for histopathologic diagnosis of a prostate cancer. The model showed 91.5% accuracy at a coarse-level classification of the image patches into benign and malignant. Ismael et al. [14] proposed a method for automatically classifying brain tumors by using residual networks (ResNet50 architecture). The model accuracy was 0.97 on patient-level. Harangi et al. [15] employed the deep GoogleNet Inception for classifying dermoscopy images. The accuracy reached 0.677. Celik et al. [16] used a deep approach to detect invasive ductal carcinoma from histopathology images. The DenseNet-161 and ResNet-50 were employed. The DenseNet-161 model realized F-score of 92.38% with accuracy of 91.57%. However, in the medical applications field, lots of tasks are based upon long-range dependencies [17]. Recurrent Neural Network (RNN) models are leading methods to deeply learn the longitudinal data. A variant of RNN is long-shortterm-memory (LSTM) [18] that captures both long-term and short-term dependencies within sequential data. Gao et al. [19] employed distanced LSTM with time-distanced gates for diagnosing lung cancer by using both real computed tomography images and simulated data. The method realized 0.8905 as F-score. Guo et al. [17] presented a disease inference approach based on symptom sequence extraction from discharge summary using a bidirectional LSTM that achieved F-score of 0.572.
Although deep learning networks have shown promising diagnostic performance, designing an appropriate deep learning model not only demands extremely specialized knowledge but also is a labor-and time-consuming task to a large extent. The recent studies showed also that the expressive power for deep neural network is essential and affects the deep learning. Lu et al. [20] studied how width influences the neural networks expressiveness by showing universal approximation theorem concerning width-bounded ReLU network. In [21], the authors revealed the power of deeper networks in comparison to shallower ones, through proving that the total neurons number needed for approximating natural classes of the multivariate polynomials with m variables increases only linearly with m in case of deep neural network, but increases exponentially when just a single hidden-layer is allowed. Du et al. [22] provided theoretical insight on the generalization ability and optimization landscape of over-parameterized neural network. Qi et al. [23] showed that, in the vector-to-vector regression utilizing the deep neural networks, the generalized loss of mean-absolute-error (MAE) between predicted and expected vectors of features is upper bounded by sum of the approximation error, estimation error, and optimization error.
Therefore, an urgent issue in the deep learning domain is optimization of the model construction. According to recent studies [24,25], the model construction often affects its overall performance and requires to be automatically set. These studies revealed that optimizing the hyperparameters remains a major obstacle in designing the deep learning models, including CNNs. In [24], a genetic algorithm based approach for constructing CNN structure was proposed for automatic analysis of medical images. Gao et al. [25] presented a method for optimizing the CNN structure that integrates binary coding system with gradient-priority particle swarm optimization (PSO) to select the structure. Therefore, it is useful to employ the swarm intelligence optimization techniques for enabling the networks to automatically tune their hyperparameters besides the layer connections and make the optimal utilization of the redundant computing resources. Grey wolf optimizer (GWO) [26], antlion optimization (ALO) [27], crow search (CS) algorithm [28], artificial bee colony (ABC) [29], differential evolution (DE) algorithm [30], whale optimization (WO) algorithm [31], Salp swarm algorithm [32], PSO [33], bat optimization (BAT) algorithm [34], and genetic algorithm (GA) [35] are some biologically-inspired algorithms investigated in optimization purposes.
The goal of this work is to propose a hybrid deep learning model based on the multi-modal data to precisely predict the OC stage. The method combined the data of gene expressions, copy number variants, and the pathological image features, besides considering the heterogeneity of every sole modal data, achieving precise prediction of the OC stages. The main contributions of this paper are summarized as:

•
Unlike the state-of-the-art deep learning models which employ single modality data for OC diagnosis and also still lack to the experience of automatic topology construction, a new multi-modal deep learning model is proposed to predict the OC stage. In which, a customized feature extraction network was established to fit each modality's data and fully extract the deep-seated features, respectively. Each feature network is hybridized with the ALO optimizer to automatically set its topology, which shows stability in processing the longitudinal genomic data and provides optimum image feature maps that realize better performance. The output from the two improved feature networks is finally fused based upon weighted linear aggregation in order to predict the OC stage.

•
In total, nine multi-modal fusion models were constructed (with) and without the other different optimization algorithms, and then compared to the proposed model for testing its efficiency in predicting OC stage.

•
The established multi-modal deep learning model is applicable to predict other cancers subtypes by integrating features from the gene modal with the pathology image modal.
The rest of this paper is organized as follows; Section 2 is specified for the preliminaries. Section 3 describes the materials and methods. Section 4 details the proposed method. The experiments and results are discussed in Section 4. The conclusions and future works are listed in the last section.

Convolutional Neural Network Models
Deep learning methods are classified into CNNs, pre-trained unsupervised networks, and basic recurrent neural networks (RNN). CNNs [29,36] encompass a sequence of process- ing layers with different types. Regular CNNs have convolutional, fully-connected (FC), as well as pooling layers (Figure 1a). The most intensive part in CNNs is the convolutional layers that convolve the 2-D kernels to input feature maps for generating output feature maps. Then, the pooling layers come after the convolutional layers which down-sample output feature map by summarizing its features into patches. The pooling is computed taking the mean of each patch in the feature map, or considering the greatest value in each patch. Eventually, the final FC accomplishes the classification task similar to the conventional artificial neural network (ANN).

Convolutional Neural Network Models
Deep learning methods are classified into CNNs, pre-trained unsupervised networks, and basic recurrent neural networks (RNN). CNNs [29,36] encompass a sequence of processing layers with different types. Regular CNNs have convolutional, fully-connected (FC), as well as pooling layers (Figure 1a). The most intensive part in CNNs is the convolutional layers that convolve the 2-D kernels to input feature maps for generating output feature maps. Then, the pooling layers come after the convolutional layers which down-sample output feature map by summarizing its features into patches. The pooling is computed taking the mean of each patch in the feature map, or considering the greatest value in each patch. Eventually, the final FC accomplishes the classification task similar to the conventional artificial neural network (ANN).

Long-Short-Term-Memory Models
The LSTM is an extension of the basic RNN. This network adds time-dependent features relying on a preceding timestamp and operates as memory cell for remembering data from the preceding timestamp [18,29]. The memory cell c is controlled through a group of gate networks (Figure 1b), including forget gate network, input gate network, and output gate network. The forget gate decides which information to forget. In other words, it decides which information be eliminated from its past cell states. The input gates decide which information would be added to memory cells . This behavior determines how much information would be updated and which information would be updated. Output gates decide which information would be used as output.

Long-Short-Term-Memory Models
The LSTM is an extension of the basic RNN. This network adds time-dependent features relying on a preceding timestamp and operates as memory cell for remembering data from the preceding timestamp [18,29]. The memory cell c is controlled through a group of gate networks (Figure 1b), including f forget gate network, i input gate network, and o output gate network. The forget gate decides which information to forget. In other words, it decides which information be eliminated from its past cell states. The input gates i t decide which information would be added to memory cells c t . This behavior determines how much information would be updated and which information would be updated. Output gates o t decide which information would be used as output.

Ant Lion Optimizer
ALO is a new biologically-inspired algorithm introduced [27,37] in emulating the natural hunting mechanism of antlions: Operators of Algorithm. It supposes a population (Pop) of ants and antlions in a dimensional (d) problem space: A = {Antlion i , Ant i |i ∈ Pop}. A random variable is set using Equation (1), where Rnd is generated in [0, 1] interval randomly. Ants' random walks are calculated by normalizing cumulative sum of all iterations using Equation (2), cs computes cumulative sum, T denotes maximum iteration, and t is current iteration. During optimization process, ants' locations are kept in matrix L ant . The ant's location is the parameter for every solution. The objective function is assessed during optimization and matrix L oa stores the fitness computed for each ant. Likewise, the antlions are hiding within search space, where L antlion and L oa are used to store their positions.
Random Walks for Ants. Ants change their own positions according to Equation (2). For keeping random walks within the search space, Equation (5) is used to normalize them, a i and b i denote lower bounds, and upper bounds for the random walk in ith dimension, respectively, while c t i and d t i refer lower bounds, and upper bounds for the ith dimension at tth iteration.
Trapping in Antlion's Pits. Ants' random walks are influenced by antlions' traps. Equation (6) randomly gives the ants' walks in hyper sphere identified by the c and d vectors, surrounding a selected antlion. The Antlion t m denotes position of selected mth antlion surrounding which ants are trapped.
Building Trap. Antlion's hunting ability is modeled using roulette wheel (RW). The ALO algorithm utilizes the RW operator for electing antlions based upon their fitness through the iterations. This mechanism demonstrates high opportunities for catching ants.
Sliding Ants Towards Antlion. Using the foregoing mechanisms, antlions build traps relevant to their fitness and ants have to move randomly. Despite that, antlions shoot sand outwards the pit's center as soon as they sense any ant inside the trap. This helps slide down any trapped ant tries to escape. For modeling this behavior, the algorithm adaptively reduces hyper sphere radius for ant's random walk, where c i and d i represent the lower bounds, and upper bounds for ith dimension of the considered problem, I refers the ratio, and L denotes the constant parameter controls the exploitation level.
Catching Prey and Rebuilding the Pit. At this stage, an objective function is assessed. If the ant demonstrated better objective function compared to the selected antlion, it swaps Symmetry 2021, 13, 643 6 of 25 its position with the hunted ant's latest position to improve its opportunity for catching new one, as follows: Elitism. The best antlion obtained so far is kept as the elite. It must influence the motions for all ants during the iterations. Consequently, it is supposed that each ant randomly walks surrounding a selected antlion using the RW and elite simultaneously as follows, where Antlion RW is the antlion chosen by RW at tth iteration, and Antlion ET denotes the best antlion.

Materials and Methods
This section presents a sufficient description on the methods utilized in this work.

Dataset
We used a public dataset for prediction of OC stage taken from Cancer Genome Atlas portal, namely, the TCGA-OV: https://portal.gdc.cancer.gov/ (accessed on 2 March 2021). The dataset includes the gene expressions data and copy number variants data of 587 OC patients. In addition, each patient has many pathological images. The gene expressions data and copy variants data are of one dimension, corresponding to one modality type, the pathology images are colored, correspond to another modality type. In this work, we call them gene modal and pathology image modal, respectively. The characteristics of the TCGA-OV multi-modal dataset are detailed in Table 1. The number of patients (samples) representing the OC stage is shown. The gene expressions data comprised about 6426 data indicators, while the copy number variants included about 24,776 data indicators. There were also from one to 10 pathological images available for each patient.

System Architecture
In this work, a hybrid bio-inspired deep learning model using multi-modal data is proposed. We construct a multi-modal fusion framework by integrating gene modality data with histopathology image modality data of OC patients as demonstrated in Figure 2. Based on the different states and forms of each modality, a feature extraction network was established, respectively. In this context, a predictive antlion-optimized LSTM network model is designed to process gene longitudinal data and another predictive antlion-optimized CNN model is designed to efficiently extract abstract features from the pathological images. The topology of both the two networks is automatically selected by the ALO algorithm. The output from the two improved feature extraction networks is then fused based upon

Data Pre-Processing
The gene modality includes gene expressions data and copy number variants. So as to input the properties of these two data kinds in the network simultaneously, we integrated the data belonging to the same modality, thus the newly consolidated data included 31,202 data indicators. To eliminate the different chemotaxis and dimensional disunity problems after the integration of the two data types, we adopted Z-Score [12] to preprocess the data. The Z-Score is a data standardization method based on computing the average and standard deviation for the original data. The following equation expresses the formula specified for this method: * = + (10) In the above formula, denotes the data item value, y denotes the data item mean value, and denotes the data item standard deviation value. Due to the Z-Score linear nature, data change is not cause of "failure", but it improves data performance [12]. The principle component analysis is then applied for reduction in data noise and eliminating redundant. Therefore, the number of items in gene modality data was reduced from 31,202 to 405.

Data Pre-Processing
The gene modality includes gene expressions data and copy number variants. So as to input the properties of these two data kinds in the network simultaneously, we integrated the data belonging to the same modality, thus the newly consolidated data included 31,202 data indicators. To eliminate the different chemotaxis and dimensional disunity problems after the integration of the two data types, we adopted Z-Score [12] to pre-process the data. The Z-Score is a data standardization method based on computing the average and standard deviation for the original data. The following equation expresses the formula specified for this method: In the above formula, y denotes the data item value, y denotes the data item mean value, and δ denotes the data item standard deviation value. Due to the Z-Score linear nature, data change is not cause of "failure", but it improves data performance [12]. The principle component analysis is then applied for reduction in data noise and eliminating redundant. Therefore, the number of items in gene modality data was reduced from 31,202 to 405.

A Structure Design of Improved LSTM
The data of gene modality are big and represented as one-dimensional vectors. The LSTM has proven promise for extracting features from longitudinal data. However, the Symmetry 2021, 13, 643 8 of 25 model accuracy is affected by many architectural factors [38] including the input length (g), units number of hidden layer (u), number of epochs (m), batch size (b), initial learning rate (i), etc. For instance, if the value of m is too small, it will be difficult for training data to converge; if it is too large, overfitting will occur. Likewise, if the value of b is too small, it will be difficult for training data to converge which will cause underfitting, but if it is too large, required memory will significantly rise. Furthermore, the value of u effects the influence of fitting. Supposing that m ranges in , u ranges in , and b ranges in . A total of 250,000,000 combinations will result. This will cause a heavy calculation issue, and a reliable algorithm has to be utilized for automatically setting the hyperparameters to balance computational efficiency and predictive performance. The ALO algorithm proved competitive performance in: avoiding local optima, exploitation, improved exploration, and convergence [39]. It is super in many unimodal and multimodal test functions [40]. The reason behind its quick exploitation and convergence is the adaptive boundary shrinking technique and elitism. The higher exploration results from the employed random walks and RW selection mechanisms which cause population diversity.
This paper proposed an improved LSTM using ALO algorithm that can well process longitudinal gene data by selecting the suitable input length of the gene modality and discarding unusual input data which can diminish prediction errors. This ALO-optimized LSTM network is not restricted with specific number of LSTM units to sequentially process the gene modal data. Figure 3 shows a flowchart of the model. The optimization procedure is as follows.
Step 10: The validation and training sets are used to train the Improved LSTM.
Step 11: LSTM is fitted by the testing inputs of gene data to predict testing outputs.
The optimal hyperparameters selected by the ALO to tune the LSTM topology are demonstrated in Table 2. The range of input length was assigned within . The Units number of hidden layer was assigned within  and the batch size was assigned within the range . Additionally, the best input length of gene modality data was reduced to 90 individuals. Step 1: Data pre-processing. The gene dataset is subdivided into: training, validation, and testing sets.
Step 2: Initialization. The algorithm initially sets population of n ants' and N antlions' positions deeming the boundary values for LSTM hyperparameters. Equation (2) is used to obtain initial population.
Step 3: The fitness of all ants and antlions is computed using Equation (11), and the elite antlion ET is defined. The best combination between the LSTM hyperparameters α and CNN hyperparameters β on the validation set was found. We used the cross entropy between the feature fusion result prediction f usion and Label as fitness function. Thus the best solution of (α, β) was that closer to the minimum (α, β) as follows.
Min Loss prediction f usion , Label Step 4: The roulette wheel (RW) method is used to choose an antlion antlion t RW .
Step 5: A random walk for ant x ∈ Ant is created and normalized by using Equation (5). The position of ant x is then updated as the average of the two random walks.
Step 6: The fitness values for all the ants are then computed by Equation (11) and the antlion is swapped with its related ant if it becomes fitter. The elite's position is updated, if the antlion t x has greater fitness than the elite.
Step 7: The previous steps are repeated until reaching maximum iteration T.
Step 8: The best individual from ALO is assigned as optimum hyperparameter for the LSTM.
Step 9: The sequential processing of gene data is implemented using Equation (12), where f t is the forget gate's activation vector, i t refers the input gate, o t indicates the output gates, σ(x) denotes sigmoid function, tanh(x) indicates hyperbolic tangent function, g t is an input modulation, h t is a hidden LSTM state, x ⊗ y denotes the element-wise multiplication operation between x and y, t denotes the current timestamp, W and B are the weights and bias for each gate.
Step 10: The validation and training sets are used to train the Improved LSTM.
Step 11: LSTM is fitted by the testing inputs of gene data to predict testing outputs. The optimal hyperparameters selected by the ALO to tune the LSTM topology are demonstrated in Table 2. The range of input length was assigned within . The Units number of hidden layer was assigned within  and the batch size was assigned within the range . Additionally, the best input length of gene modality data was reduced to 90 individuals. The pathological images are with high pixel characteristics and large size that cannot be directly passed as input to the CNN. In this work, image blocking is firstly implemented by cutting the full-size image to small-size images. Accordingly, all the pathology images were cropped into sub-batches with a cut size of 1024 × 1024 pixels, each of which was then subdivided from the center point into four small images with a same size of 512 × 512 pixels. The size of these image blocks was adjusted to 224 × 224 pixels to fit feature extraction using CNN as demonstrated in Figure 4. The label of each batch was set as the label of the original full-size image.

3.
A flowchart of the ALO-optimized LSTM network constructed for feature extraction from the gene modal data. The pathological images are with high pixel characteristics and large size that cannot be directly passed as input to the CNN. In this work, image blocking is firstly implemented by cutting the full-size image to small-size images. Accordingly, all the pathology images were cropped into sub-batches with a cut size of 1024 × 1024 pixels, each of which was then subdivided from the center point into four small images with a same size of 512 × 512 pixels. The size of these image blocks was adjusted to 224 × 224 pixels to fit feature extraction using CNN as demonstrated in Figure 4. The label of each batch was set as the label of the original full-size image.

The Structure Design of Improved CNN
Although studies have shown robustness of CNN models in various applications, the accuracy is affected by architectural factors [25,29] as kernel size ( ), stride ( ), padding ( ), and number of filter channels ( ). These hyperparameters affect the sizes of feature maps at CNN layers. So, learning time and classification accuracy are significantly influenced. Thus, this paper customizes an improved CNN model for extracting the abstract features of pathological images based on VGG16 and ALO algorithm.
As shown in Figure 5, the structure design of the ALO-CNN model comprises the following: Input layer: It passes the input patches into the deep network for extracting features. The network automatically adjusts the patch image to a size of 224 × 224 to fit to the following feature extraction.
Convolutional layer: In ℎ convolutional layer, the input batch image is subjected to a convolutional operation. Supposing ℎ denotes the input batch image of ℎ convolutional layer, then a ℎ kernel with × size represented by the elite is sliding

The Structure Design of Improved CNN
Although studies have shown robustness of CNN models in various applications, the accuracy is affected by architectural factors [25,29] as kernel size (k), stride (s), padding (p), and number of filter channels ( f ). These hyperparameters affect the sizes of feature maps at CNN layers. So, learning time and classification accuracy are significantly influenced. Thus, this paper customizes an improved CNN model for extracting the abstract features of pathological images based on VGG16 and ALO algorithm.
As shown in Figure 5, the structure design of the ALO-CNN model comprises the following:

Multimodal Fusion
The improved LSTM model extracts the features from the data of gene modality and predicts the OC stage. Similarly, the improved CNN model extracts deep-level features from the pathology images, and completes the prediction. The main idea behind the multimodal fusion is to combine the features from different modes to distinguish the same phenomenon [12,41]. The outputs from the two network models designed in this paper are the probability distribution taken after the softmax classification [29]: In the formula, indicates the data of column in the output vector, and denotes the output vector dimension. After softmax regression, the output from the network represents an × 5 matrix, in which is the input samples number, and the column refers to the likelihood that the sample be in the stage − 1 of OC. In this study, Stage I, Stage II, Stage III, Stage IV, and Stage Not Available were labeled into class 0, class 1, class 2, class 3, class 4, and class 5, respectively. The multi-modal fusion was conducted in this work using the weighted linear aggregation, as follows: Input layer: It passes the input patches into the deep network for extracting features. The network automatically adjusts the patch image to a size of 224 × 224 to fit to the following feature extraction.
Convolutional layer: In l th convolutional layer, the input batch image is subjected to a convolutional operation. Supposing h c denotes the input batch image of l th convolutional layer, then a k th kernel with m × m size represented by the elite is sliding across the input using stride s. Let f l be the number of filter channels, w i∈ f l and b i∈ f l indicate the weight and bias of i th filter, the output of l th convolutional layer is defined using Equation (13).
where σ denotes the activation function used for mapping input into nonlinear space. The output m i l represents feature map resulted from the pathologic image. The size d l of feature map m i l is computed using Equation (14), based on elite's parameters, h l represents input width, k l is kernel size, p l denotes padding size, and s l is stride size. Afterwards, the result (feature maps) at lth convolutional layer will be activated by an activation function to get non-linear features. In this model, the ReLu [29] activation was selected by the elite, which inverses the input x from negative to positive and keeps the positive values, as in Equation (15). The max(0, x) indicates ReLu function, F(x) is the ReLu output. Pooling layer: The output feature maps are max-pooled using Equation (16), x denotes the output from convolution layer, where R i,j denotes the (i, j)th pooling region, and P i,j the max-pooled output.

Multimodal Fusion
The improved LSTM model extracts the features from the data of gene modality and predicts the OC stage. Similarly, the improved CNN model extracts deep-level features from the pathology images, and completes the prediction. The main idea behind the multi-modal fusion is to combine the features from different modes to distinguish the same phenomenon [12,41]. The outputs from the two network models designed in this paper are the probability distribution taken after the softmax classification [29]: In the formula, r i indicates the data of column i in the output vector, and c denotes the output vector dimension. After softmax regression, the output from the network represents an N × 5 matrix, in which N is the input samples number, and the column j refers to the likelihood that the sample be in the stage s − 1 of OC. In this study, Stage I, Stage II, Stage III, Stage IV, and Stage Not Available were labeled into class 0, class 1, class 2, class 3, class 4, and class 5, respectively. The multi-modal fusion was conducted in this work using the weighted linear aggregation, as follows: Step 1: The trained LSTM and CNN were, respectively, implemented to validation set, where the prediction result of LSTM is prediction LSTM and the prediction result of CNN is prediction CNN . The results from both models are 77 × 5 matrices, and 77 is the samples number included in the validation set. On the other side, for the optimized CNN model, as each sample involves many pathological image batches, we assigned the output of every sample as the average for all the sub-images of the sample.
Step 2: Fusion of the features from the two modalities was implemented using the following equation, in which the prediction f usion indicates the feature fusion result, α refers to the optimized hyperparameters of LSTM modality represented by the elite, and β denotes the optimized hyperparameters of CNN modality.
Step 3: The best solution of (α, β) combination is transformed to the testing set, using the multi-modal features for predicting OC stage.

Results and Discussion
Training of the model was implemented on a laptop of Intel Core-I7 CPU. Matlab 2016A was used to implement the multi-modal fusion algorithm.

Validation Criteria
For each run of the multi-modal deep learning model, we calculated the following measures on the test data to evaluate the errors given by the classification model. The MAE [23,42] is a metric relevant to L 1 norm . 1 that measures average magnitude of the absolute differences between the predicted vectors S = {x 1 , x 2 , . . . , x N } and the actual observations S * = {y 1 , y 2 , . . . , y N }. Mean-Squared-Error (MSE) [42] is a metric relevant to L 2 norm . 2 that indicates quadratic scoring rule for measuring average magnitude of the predicted vectors S = {x 1 , x 2 , . . . , x N } and actual observations S * = {y 1 , y 2 , . . . , y N }. The Mean-Absolute-Percent-Error (MAPE) and Symmetric-Mean-Absolute-Percentage-Error (SMAPE) [43], are also used which are also measures for deviation error between the predicted values S and actual observations S * , and they reveal the prediction global errors.
So as to highlight the predictive performance of OC using the proposed multi-modal model, we use the precision PRE, recall REC, accuracy ACC, and F1-score for evaluation. The TP denotes the number of OC positive cases predicted correctly, FP indicates misclassification number in the OC positive cases, TN denotes the number of OC negative cases predicted successfully, and FN is misclassification number in the OC negative cases. The F1-score is utilized to make compromise between PRE and REC.
Standard deviation (SD) [40] is a statistical representation for variation in the obtained prediction results found when running the deep learning model for k different runs. The SD is utilized as indicator for the stability and robustness of deep learning algorithm. The Mean denotes the average performance given by the model and o i * indicates the predictive result over the run i.
The statistical test of Wilcoxon rank sum (W-test) [44] was also utilized to assess the performance significance, which is a non-parametric test, assigns ranks for all the scores. The W-test assigns ranks for all the scores deemed as one group, after that it sums ranks of every group. The null assumption of that two-sample test supposes the samples belong to the same population. Thus, if a difference is found in any two rank sums, it only comes from sampling error. This test is a non-parametric version of a t-test for two given independent groups. It, accordingly, tests the null assumption that the data in i and j vectors are samples belonging to continuous distributions of equal medians, versus the alternative assumes they are not.

Experimental Setup
In this work, each multi-modal dataset was split into: training, validation, and test sets with ratio 6:2:2. Table 5 demonstrates the detailed data. The validation set was used to investigate the effect of hyperparameters (α, β) combination on the multi-modal fusion performance. All results were reported using the test set, and the algorithms were performed with 20 independent runs, each run has 5-fold cross validation.

Parameters Setting
The parametric setting of the ALO optimizer and the other nine compared bio-inspired algorithms is shown in Table 6. The parameters of both the constructed LSTM and CNN models, which are automatically set by the ALO algorithm and the compared algorithms, were previously shown in Tables 2 and 3.

First Experiment: Testing Proposed Model Using the OC Multi-Modal Dataset
This experiment analyzes the performance over different models established in this paper. Efficiency of the proposed multi-modal deep learning model is compared to these obtained using the single modalities: improved LSTM-based gene modality and CNNbased pathological image modality. As shown in Tables 7 and 8, the proposed deep learning model by amalgamating heterogeneous features of genomic data and pathological images realizes the lowest MAE, MSE, MAPE, and SMAPE. The obtained error rates are 0.0188, 0.2075, 2.1018, and 3.3056, respectively. The proposed model also shows the lowest SD value (0.044) over the 20 runs, which reveals its stability and robustness. On the other hand, it shows the best predictive performance with regard to precision, recall, accuracy, and F1-score. The given results are 98.76, 98.74, 98.87, and 99.43%, respectively. It is also observed that the gene modality model comes in the second rank after the proposed model in terms of error evaluation rates, predictive performance, and SD value. The potential reason for that optimization is that genomic data can effectively reflect the OC characteristics, so feature learning by using only image level is not enough to realize the high-performance requirements.  Table 8. Predictive performance of the multi-modal and Single modal methods using the TCGA-OV dataset.

Second Experiment: Testing Proposed Model Using Benchmarks for Other Cancers
In this experiment, we evaluate our multi-modal fusion framework on two benchmarks for other cancers taken from the TCGA. The first dataset is Breast Invasive Carcinoma (BRCA) [9,10] that includes histopathology images alongside with genomic profiling reports for 578 cases. Among them, 133 cases are labeled as longer-term survivors and 445 cases are labeled as shorter-term survivors. The second dataset is Lung Squamous Cell Carcinoma, (LUSC) [10,11], which involves 101 cases, 31 patients of them are classified as low-risk survivors, and 70 patients are classified as high-risk survivors. The performance is assessed through various ways. Here, confusion matrix is utilized, which provides valuable information on the predicted and actual labels given by the proposed fusion method. By using that information, the performance can be evaluated from different aspects, as seen in Figure 6 which shows the confusion matrix for BRCA and LUSC datasets versus the TCGA-OV dataset. The diagonal blue-shadowed numbers represent the true positives of cancer class, while the white-shadowed numbers represent the confused false positives. Clearly, we observe that the proposed multi-modal fusion algorithm improved accuracy of each cancer recognition system by significant amount, through increasing the number of true positives and reducing the number of false positives, over each dataset.
Symmetry 2021, 13, 643 17 of 25 0.1254 and 0.1066, and the SD value is 0.047 and 0.029, respectively, which are lower than these of the compared single modalities. Furthermore, for each benchmark, the significance of performance is also assessed using the Wilcoxon test and the results are shown in Table 10. The table demonstrates that across all the datasets, the proposed model has significant difference over all the compared single modalities, which means that it reveals significant enhance over all these models at (0.05) significance level. These results obviously reflect that the integrative analysis for both pathological images and genomic data using the proposed multi-modal fusion model can also efficiently improve the predictive performance of breast and lung cancers, which indicates that the proposed model is applicable to the other cancers.   From Table 9, we can also derive that for the BRCA and LUSC benchmarks, the proposed model realizes more accurate discriminative results than the single gene modal and histopathology image modal. The accuracy is 98.8 and 99.28%, while the F-score is 98.92 and 99.31%, respectively. The obtained MAE rate is 0.0155 and 0.0139, while the MSE is 0.1254 and 0.1066, and the SD value is 0.047 and 0.029, respectively, which are lower than these of the compared single modalities. Furthermore, for each benchmark, the significance of performance is also assessed using the Wilcoxon test and the results are shown in Table 10. The table demonstrates that across all the datasets, the proposed model has significant difference over all the compared single modalities, which means that it reveals significant enhance over all these models at (0.05) significance level. These results obviously reflect that the integrative analysis for both pathological images and genomic data using the proposed multi-modal fusion model can also efficiently improve the predictive performance of breast and lung cancers, which indicates that the proposed model is applicable to the other cancers.

Comparative Analysis
Extensive experiments were conducted to assess the performance of proposed multimodal fusion model (with) and without the optimization algorithms, by using the three benchmarks. Accordingly, we evaluate the effect of the topology of LSTM and CNN models selected by different bio-inspired optimization algorithms on the error reduction and predictive performance of the model. The compared state-of-the-art algorithms include the GA [35], PSO [33], DE [30], GWO [26], ABC [29], WO [31], CS algorithm [28], and BAT algorithm [34]. By comparing the prediction ability of the proposed model to these obtained by the other nine models established using the compared bio-inspired algorithms shown in Table 11, we can realize that across the three datasets, the proposed model achieved the highest precision, recall, accuracy, and F-score, with limited error rates and low SD values. In terms of Wilcoxon test, it also has significance difference against the other constructed bio-inspired deep learning models, at significance level of 0.05. Furthermore, models based on CS, GWO, and WO ranked as second, third, and fourth, respectively, with regard to OC stage prediction. It is also noted that the model constructed for the diagnosis using only the two deep feature extractors without bio-inspired optimization has shown lower performance in comparison to the other evolutionary multi-modal deep learning models or even the other evolutionary single modal deep learning models. This reflects the high impact of the model construction on the deep learning performance. The convergence speed curves of all the constructed multi-modal fusion models using the TCGA-OV, BRCA, and LUSC datasets are presented in Figure 7, respectively, which demonstrates how the fitness decreases over the iterations and proves that the ALO escapes local minima in the three multi-modal datasets and is capable of finding the optimal combination of LSTM and CNN weights that improves the fusion results in less than 10 iterations. Furthermore, the proposed model achieved optimal prediction accuracy in a shorter computational time across the three datasets. Average CPU time of roughly 120.6, 173.4, and 80.8 s has been taken across the whole dataset, respectively, while the best and worst computed fitness yielded by the model across each dataset were ≈0.08 and ≈0.09, respectively, as demonstrated in Figure 8.  So as to discuss the rationality of CNN constructed in this work for pathology image feature extraction, and investigate the advantages of VGG16 infrastructure optimized with ALO algorithm, we replace the VGG16 architecture with the AlexNet [45] and ResNet34 [45], and do some comparisons. In this context, the ALO algorithm was used to set the topology of each network model. It is clearly noted from Table 12 that the accuracy based upon the ResNet34 is 96.55, 96.81, and 96.97% on TCGA-OV, BRCA, and LUSC, respectively, while the accuracy based upon the AlexNet is 95.95, 95.63, and 95.88%, respectively. These accuracies are lower in comparison to the VGG16. Additionally, the ResNet34 model outperforms the AlexNet model in terms of predictive performance, error rate, and SD value. Furthermore, the CPU time taken by using the proposed model was smaller than the two other models over the three datasets. There is also significant difference at 0.05-level for the VGG16 model against the two other models in terms of Wilcoxon test.

Comparisons to Others
To further verify the efficiency of the proposed multi-modal fusion model for cancer diagnosis from multi-modal data, we compared it to the recent works that used the same multi-modal benchmarks used in this work. Table 13 [6,[9][10][11][12] illustrates the comparison. For instance, Yu et al. [6] presented a single modal approach based on analysis of the whole-slide pathology images from 587 primary serous ovarian adenocarcinoma patients taken from the TCGA. The CNNs were used to classify the pathology images with cancerous cells, predict the pathology grade of the OC patient, and predict platinum-based chemotherapy. The area under the receiver-operating-characteristic curve (AUC) was 0.95 for classification of cancerous regions and 0.80 for classification of tumor grade. The "inverted pyramid" deep neural network and VGG16 were the methods used by Liu et al. [12] to, respectively, extract gene features and pathological image features of breast cancer. The multi-modal fusion was implemented using weighted linear aggregation and simulated annealing algorithm. In summary, the multi-modal fusion model established in this work was superior to the other models with regard to various indicators. There are multiple factors for why the proposed model performs well in diagnosis of OC and other cancers. First, combining heterogeneous features of genomic data and pathology images can consistently achieve robust and accurate diagnosis results than the single modal diagnosis. Second, the traditional ML and CNNs models still lack to the optimization of the model construction in the domain of cancer subtypes. We present a flexible designed ALO-LSTM gene feature extraction network, which does not suppose fixed topology, for example, the input length and number of LSTM units need in the sequential processing of gene data is selected by the ALO. The established network was stable in processing the gene features for a short or a long interval of time without causing vanishing gradient. Additionally, the improved ALO-CNN pathological feature extraction network does not suppose fixed size for kernels, filters, strides, and padding values. This helped to provide the most informative features from the two modalities and improved the diagnosis results.

Conclusions
This paper works to overcome the potential insufficient representation of OC characteristics caused by the single modal approaches used previously in the state-of-the-art by proposing a multi-modal deep learning model in order to precisely predict OC stage. The model combines gene modality along with histopathology image modality. So it is composed of two evolutionary deep-feature extraction network models. The first one is a predictive ALO-optimized LSTM network to sequentially process the longitudinal data of gene modality while the second is a predictive ALO-optimized CNN for extracting the abstract features from pathological images. In this context, the ALO optimizer is hybridized with each feature network to automatically set its topology, which helped to diminish the model's errors and increase predictive accuracy of OC stage. Then, the deep features from the two improved networks are fused based upon weighted linear aggregation. The experimental results were conducted using a public multi-modal OC dataset and two benchmarks for other cancers. The results revealed that the proposed multi-modal fusion model by amalgamating heterogeneous features, including genomic and image information, realizes optimal accuracy and lower error rates than those models comprising single modal data. After that, extensive comparisons have been made using each multi-modal dataset for assessing the performance based on various recent bio-inspired optimization algorithms. Among the compared models, the proposed model achieved not only the highest prediction accuracy and lowest classification error but also the highest convergence speed and shortest CPU time over all the multi-modal datasets. Furthermore, the proposed model shows significant difference against all the models constructed in this paper in terms of Wilcoxon test. These results are due to the stability and robustness of the ALO-LSTM network in extracting genomic features from the gene modal data for a short or a long interval of time without causing vanishing gradient. In addition to the flexibility of ALO-CNN network which does not consider static values for kernels, padding, and filter channels. This made the network extract the sufficient abstract features from the pathological images and discard the useless ones, in addition to the higher exploitation and convergence speed of the ALO algorithm in providing the best topology for each feature network in a way that optimized the prediction results of OC and other cancers. In the future work, we intend to amalgamate different modalities to deeply predict other diseases including respiratory system disease.