This segment focuses on the experimental stages in order to effectively illustrate the study’s findings. Broadly, the section discusses the datasets and parameter descriptions, the algorithm of the proposed FOWFS feature fusion approach. The experimentation was performed using Intel(R) Core(TM) i5-7200U CPU @ 2.50G Hz with 2.71 GHz processor, 4.00 GB (3.88 GB usable) RAM, 64-bit operating system, x64-based processor operating system, and executed on the platform Google Colab.
4.4. Result Analysis and Validation
This section discusses the experimental results of all the proposed deep feature fusion approach for skin lesion classification of HAM 10000 and BCN 20000 datasets along with the evaluation and validation of the feature selection strategies. In the first phase of experimentation, the benefit of transfer learning mechanism was achieved for obtaining the better performance with less computational effort. Here, three CNNs’ pre-trained learning models, VGG16, EfficientNet B0, and ResNet50, were used to extract the meaningful features from the new images.
Table 3 shows the experimental results of those three pre-trained models for both the skin lesion image datasets, which demonstrates the feature acquisition time (in minutes) with respect to the original features. A straightforward comparison was carried out for the accuracy validation with respect to fused feature sets and the highest ranked feature sets (features whose
) obtained from three pre-trained models using Decision Tree (DT), Naïve Bayesian (NB), Multi-Layer Perceptron (MLP), and Support Vector Machine (SVM) classifiers as discussed in
Table 4,
Table 5,
Table 6,
Table 7, respectively. From those three tables, it can be seen that for both the datasets, initially, the number of features selected from three pre-trained models is 2560 features, which form a fused feature set and the CFS selects 2560 features and as all the features are selected for the classification process, the ranking of features has not been done, therefore there is no improvement in validation accuracy.
From
Table 4, it can be seen that, for the HAM 10000 dataset, the AWFS selects highest ranked feature set with weights of VGG16 (with 512 features) and any one of the other two pre-trained models (with 1024 features) based on DT classifier with an improved accuracy of 94.10%. It can also be inferred that the MOWFS-AJS and FOWFS-AJS have validation accuracy of 94.24% and 94.22%, respectively, with the highest ranked feature set of 1024 and 914 number of features. Considering the improvement in accuracy with respect to CFS, MOWFS-AJS, and FOWFS-AJS, it is clearly evident that with a lower number of feature sets, MOWFS-AJS and FOWFS-AJS achieve 3.14% and 3.12% improved accuracy for HAM 10000 dataset based on DT classifier. Similarly, for the BCN 20000 dataset, the improvement of MOWFS-AJS and FOWFS-AJS over CFS was found to be 7.77% and 7.75%, respectively, with a lower number of features selected as ranked fused feature set based on DT classifier.
The performance based on NB classifier from
Table 5 can be detailed as follows. The observed improvements for HAM 10000 dataset of MOWFS-AJS and FOWFS-AJS over CFS were found to be 3.1% and 3.3%, respectively with 1024 + 1024 and 1015 ranked feature sets. Similarly, for the BCN 20000 dataset, the recorded improvements of MOWFS-AJS and FOWFS-AJS over CFS were 6.27% and 6.47%. Additionally, it was seen that the number of features selected for classification by FOWFS-AJS is only 998 features, which is much less in comparison to both strategies.
Table 6 depicts the performance of all proposed feature fusion strategies based on the MLP classifier. From this table, it can be seen that the FOWFS-AJS is outperformed over the rest of the compared methods for both the datasets. The observed improvements for HAM 10000 dataset of MOWFS-AJS and FOWFS-AJS over CFS were found to be 3.3% and 3.58%, respectively, with 512 and 975 features in ranked feature set. Similarly, for the BCN 20000 dataset, the recorded improvements of MOWFS-AJS and FOWFS-AJS over CFS are 5.49% and 5.57% with 512 + 1024 and 929 selected features from the ranked feature set.
Similarly, the performance based on the SVM classifier for both the datasets are recorded in
Table 7. From this table, we can see that the improvements for the HAM 10000 dataset of MOWFS-AJS and FOWFS-AJS over CFS was found to be 3.87% and 5.54%, respectively with 512+1024 and 876 features in the ranked feature set. For the BCN 20000 dataset, the recorded improvements of MOWFS-AJS and FOWFS-AJS over CFS were 4.65% and 6.32% with 512 and 899 selected features from the ranked feature set. From
Table 5 to
Table 7, the FOWFS-AJS outperformed rest of the proposed feature fusion strategies with respect to validation accuracy measured using NB, MLP, and SVM for both the skin lesion datasets except the performance recorded using DT shows MOWFS-AJS better results in comparison to other strategies (
Table 4), but when compared with FOWFS-AJS, it has only 0.02% improved result for both the datasets.
The recognition performance of the three CNNs’ pre-trained models and the proposed strategies, namely CFS, AWFS, MOWFS-GA, MOWFS-PSO, MOWFS-AJS, FOWFS-GA, FOWFS-PSO, and FOWFS-AJS, are recorded in
Table 8 and
Table 9 for HAM 10000 and BCN 20000 datasets, respectively, by measuring the accuracy, precision, sensitivity, and F1-score based on all four classification algorithms. From both tables, it is observed that the SVM shows better recognition performance and FOWFS-AJS is showing improved recognition rate with respect to all the models considered for comparison.
Further, a straightforward comparison was made considering the observed validation accuracy of all the proposed feature fusion strategies for the combined or fused feature sets and the feature sets obtained after ranking based on all four classifiers for both of the datasets as given in
Figure 6,
Figure 7,
Figure 8,
Figure 9. The differences in validation accuracy based on DT classifier for HAM 10000 and BCN 20000 datasets are represented in
Figure 6a,b respectively and from this figure, we can see the significant improvement of MOWFS-AJS and FOWFS-AJS over the remaining six strategies and the MOWFS-AJS performed better in this case of classification with 1.09% (fused feature set) and 2.91% (ranked feature set) for HAM 10000 and 3.51% and 7.75% for BCN 20000 datasets. The FOWFS-AJS showed better validation accuracy with respect to the rest of the proposed strategies based on NB, MLP, and SVM classifiers. From
Figure 7a,b, it can be seen that FOWFS-AJS over CFS showed improvement of 1% (fused feature set) and 2% (ranked feature set) and 3.24% (fused feature set) and 6.47% (ranked feature set) for HAM 10000 and BCN 20000 datasets, respectively. Similarly, the accuracy recorded based on MLP and SVM classifiers can be summarized as 1.34% (fused feature set),3.51% (ranked feature set), 2% (fused feature set), 5.54% (ranked feature set) for HAM 10000 dataset (
Figure 8a and
Figure 9a) and 2.99% (fused feature set), 5.57% (ranked feature set) and 1.64% (fused feature set) and 6.35% (ranked feature set) for BNC dataset respectively (
Figure 8b and
Figure 9b).
Additionally, the area under the receiver operating characteristics curves (AUC-ROC) were plotted to measure the performance and degree of separability amongst the proposed three strategies AWFS, MOWFS-AJS, and FOWFS-AJS to describe how well the models are capable of distinguishing between the classes which are represented in
Figure 10,
Figure 11,
Figure 12,
Figure 13 for both datasets based on DT, NB, MLP, and SVM classifiers. From
Figure 10a,b, it is observed that FOWFS-AJS showed best accuracy performance with 90.9% and 91.06% for HAM 10000 and BCN 20000 datasets, respectively. Similarly, the recorded performance of the three remaining classifiers can be summarized as: based on NB classifier, the best recorded performance of FOWFS-AJS was 92.84% and 93.21% for HAM 10000 and BCN 20000 datasets, respectively (
Figure 11a,b); based on MLP, FOWFS-AJS showed 93.24% and 93.81% for HAM 10000 and BCN 20000 datasets, respectively (
Figure 12a,b); and similarly, the SVM recorded a performance of FOWFS-AJS as 94.05% and 94.90%, respectively, for HAM 10000 and BCN 20000 datasets (
Figure 13a,b).
Finally, a computational comparison (in minutes) was made between the feature acquisition time by the proposed deep feature fusion strategies and the mean of time taken for classification algorithms to classify the skin lesson datasets with the updated feature sets and is shown in
Figure 14a,b for HAM 10000 and BCN 20000 datasets, respectively. From those two figures, it is also evident that the proposed FOWFS-AJS comparatively showed better performance with respect to both feature acquisition and classification time for both the datasets.
4.5. Validation through Statistical Test
The experimental results were further analyzed using a non-parametric Friedman statistical test [
51,
52] to determine whether or not there is a statistical difference observed between the models or strategies experimented and compared. Here, this statistical test was utilized to deal with the issue of comparison between all three pre-trained CNNs’ models and the proposed fusion strategies on both HAM 10000 and BCN 20000 datasets. To analyze the performance of VGG16, EfficientNet B0, ResNet50, CFS, AWFS, MOWFS-GA, MOWFS-PSO, MOWFS-AJS, FOWFS-GA, FOWFS-PSO, and FOWFS-AJS, the test was performed from the perspective of average ranking. This Friedman test, which is under the null hypothesis, was computed as follows using Equation (8):
where
represent the statistical value, number of datasets, the number of strategies used, and average ranking respectively. The statistical value
is distributed in line with the Fisherman distribution with
and
degrees of freedom. The result of this test is
and the calculated
The critical value is 2.9782 under the significance level
for
and
; it is evident that the critical value is smaller than the observed values of all
on all evaluation metrics. This means that the null hypothesis on all evaluations metrics is rejected under this test and the proposed FOWFS-AJS deep feature fusion strategy achieves satisfactory performance on two datasets and eleven compared models.
4.6. Discussions on Key Findings
The key findings of this research are as follows. The performance of the transfer learning at the feature level based on the CNNs’ three pre-trained networks achieved optimal performance faster than any other traditional feature selection models and the ensemble learning of features used to design the feature fusion models (for example, CFS) from the output of those three pre-trained networks showed their good performance to design a robust classifier for skin lesion datasets. From the experimentation, it was seen that only designing a CFS model based on basic fusion strategy does not achieve better leverage, therefore the weighted approach of selecting features and forming features sets through AWFS was experimented and shown to have better performance over CFS. Rather than only using feature fusion, it was seen that the strategy for decision on feature fusion approach by utilizing the AJS optimizer to identify the optimum point considering two special cases (active and passive) motions of this algorithm helped to find the best cost. In this study, two decision-based feature fusion models, namely model-based and feature based strategies formed by adaptively choosing the optimal weights such as MOWFS-AJS and FOWFS-AJS have showed their relatively good performance. The MSE of both model-based and feature-based strategies are measured as , where the measured MSE of the opted classification model is taken as the cost and the weights are taken as . This total process was continued for 50 iterations to obtain optimized weights from all three pre-trained models. Thus, the final ensemble of features was formed for test set as , , for model-based strategy. The feature-based strategy focused on feature-based optimization of adaptively chosen weights for formation of combined weighted feature set such as , and with total weights (512 + 1024 + 1024). Then, the process of obtaining optimized weights is performed and finally it returns 512 + 1024 + 1024 optimized weights based on each feature and the combined feature set is formed as , , . Then, features having were considered as best performing features and were considered for final classification. The performance of the proposed deep feature fusion strategies was evaluated based on each classification model and the proposed optimized strategies were compared with GA and PSO, two widely used meta-heuristic optimization techniques, through accuracy, precision, sensitivity, and F1-score. Finally, the Friedman statistical test was performed to statistically validate the proposed strategies.
The empirical evidence showed that choosing the weights of the pre-trained networks adaptively in an optimized way gave a good starting point for initialization to mitigating the chances of exploding or vanishing gradients, thus the performance of FOWFS-AJS with SVM learning leveraged the existing network for both the skin lesion datasets and the advantage of properly selecting rich and informative beneficial feature through this feature-based optimized approach received better attention during the experimentation and validation processes.