Medical Image Classifications Using Convolutional Neural Networks: A Survey of Current Methods and Statistical Modeling of the Literature

: In this review, we compiled convolutional neural network (CNN) methods which have the potential to automate the manual, costly and error-prone processing of medical images. We attempted to provide a thorough survey of improved architectures, popular frameworks, activation functions, ensemble techniques, hyperparameter optimizations, performance metrics, relevant datasets and data preprocessing strategies that can be used to design robust CNN models. We also used machine learning algorithms for the statistical modeling of the current literature to uncover latent topics, method gaps, prevalent themes and potential future advancements. The statistical modeling results indicate a temporal shift in favor of improved CNN designs, such as a shift from the use of a CNN architecture to a CNN-transformer hybrid. The insights from statistical modeling point that the surge of CNN practitioners into the medical imaging field, partly driven by the COVID-19 challenge, catalyzed the use of CNN methods for detecting and diagnosing pathological conditions. This phenomenon likely contributed to the sharp increase in the number of publications on the use of CNNs for medical imaging, both during and after the pandemic. Overall, the existing literature has certain gaps in scope with respect to the design and optimization of CNN architectures and methods specifically for medical imaging. Additionally, there is a lack of post hoc explainability of CNN models and slow progress in adopting CNNs for low-resource medical imaging. This review ends with a list of open research questions that have been identified through statistical modeling and recommendations that can potentially help set up more robust, improved and reproducible CNN experiments for medical imaging.


Introduction
Traditionally, medical images are manually annotated by domain experts with special skills which makes the overall process labor intensive, expensive, slow and error-prone [1].Automated faster and more accurate methods are critical for near real-time diagnosis and better patient outcomes.This review focuses on the application of convolutional neural networks for medical image classifications, emphasizing recent improvements in algorithms and approaches.We covered key CNN methodologies as applied in research and clinical settings with respect to medical image localization, detection, preprocessing, segmentations and classifications.Machine learning algorithms were also applied for statistical modeling of the current literature to uncover latent topics, prevalent themes, method gaps and possible future advancements.

Background and Context
Healthcare is a high priority sector due to its importance for wellness, healthspan and lifespan.Higher levels of healthcare and services entail direct, fast and reliable diagnostic approaches, such as using medical imaging.However, the interpretations of medical images by medical experts are quite limited due to subjectivity of experts and the complexity of the images [1].In addition, extensive variations exist among experts partly attributable to human fatigue as a result of the heavy workloads of medical professionals [1].
Following the success of CNN in image processing in other real-world applications, it is also being explored as a key and robust method for applications in clinical settings [2,3].In this review, we compiled recently improved components of deep CNN architectures, popular frameworks, activation functions, preprocessing approaches, publicly available datasets, ensemble methods and optimization techniques that are being applied for medical image understanding.Additionally, we used machine learning-based statistical modeling of the current literature to identify patterns, trends, method gaps and future advancements that were not obvious from the individual studies.This review ends with a discussion on methodological challenges and open research issues with regard to applications of CNNs for medical imaging.

Importance of CNN for Medical Image Classification
Imaging techniques are used to capture anomalies or pathological parts of the human body [4].The captured images must be understood for the diagnosis, prognosis and treatment planning of the anomalies [4].Analyzing images generated in clinical practice by extracting information in an efficient manner is critical for improved clinical diagnosis.However, the effectiveness of image understanding performed by skilled medical professionals is limited (and the process is slow and error-prone) due to the scarce availability of human experts and the fatigue and rough estimate procedures involved.CNNs are being widely accepted and practiced as effective tools for image understanding due to their ability to learn and extract features automatically.
There is a growing interest among researchers and clinicians in applying CNN methods for segmentation, abnormality detection, disease classification and diagnosis [5][6][7][8].Different variations of CNN methods use different approaches to improve their performance in wide ranges of image classification tasks [9,10].The robustness and automatability of CNNs in addition to reports of CNN techniques outperforming human experts seem to be the driving forces for the enthusiastic surge of their use in medical image understanding [11][12][13].

Objectives of the Study
This study is designed to organize and present CNN algorithms (including improved architectures), activation functions, popular frame works, optimization approaches, relevant datasets, data preprocessing techniques and model ensemble methods in one place and make them available for researchers and clinicians who are interested in setting up their own CNN experiments (Figure 1).Another important objective of this study is to use machine learning algorithms for the statistical modeling of the literature (to identify current practices and future trends that are relevant to CNN application for medical image understanding) (Figure 1).

What Distinguishes the Current Study from Previously Published Review Papers?
There are great reviews focusing on the application of deep learning and CNN for medical image understanding.Many of these reviews focus on findings of the individual papers (result reviews), on specific method, on specific medical imaging or on a particular disease.But this review is both an image type and disease agnostic method review, and at the same time, it includes the machine learning-assisted statistical modeling of the current literature on the application of CNN for medical image understanding.Indeed, there are other reviews which have surveyed broader areas of the literature.And yet, the coverage of alternative CNN components in previously published review papers are less comprehensive than what is compiled in this study.

Review of CNN Algorithms and Methods
In this review, we compiled and summarized the recent advances in CNN-based medical image understanding and highlighted methodological challenges and opportunities.We began by providing an overview of the key components of CNN architectures, design improvements and activation function that are used for medical image understanding.Then, we discussed popular frameworks, ensemble techniques, widely used hyperparameters, optimizations and tuning approaches, performance metrics, databases of relevant medical images and input data preprocessing that are essential for developing more robust and transferable models.This study attempted to provide a comprehensive overview of the current state of CNN-based methods as applied for medical image understanding.Additionally, statistical modeling was used to identify some of the open CNN-related method gaps and to suggest potential future advancements.

Basic Architectures of CNNs
CNNs have become the mainstream algorithms for image classification due to their remarkable performance for object detection, action recognition, image classification, segmentation and disease diagnosis [7,[14][15][16][17][18][19][20][21][22].CNNs have the advantage of being able to distinguish complex shapes of images [23] due to their ability to learn and extract features without the need for prior knowledge or human intervention [24].
Architectures of CNNs are designed to automatically learn and extract features from images through a series of convolutional, pooling and fully connected layers (Figure 2) [10,16].Each layer, in a typical CNN architecture (Figure 2), has a specific purpose: • The input layer is the first layer which receives the input image.
• The convolutional layer is the core layer of the CNN architecture, where the convolu- tion operation is performed using a set of learnable kernels or filters to detect edges, corners and textures (to extract features from the input data) [25].Feature extraction may involve strides and paddings along with kernels (1).
where W o is the output size, W i is the input size, F is the kernel size, P is the padding size and S is the size of the stride.
• The activation function introduces non-linearity to capture complex relationships in the data, and it is applied element-wise to the output of the convolutional layer.• The pooling layer is applied to reduce the spatial dimensions (width and height) of the feature maps obtained from the convolution layer [16] by performing down-sampling.• The fully connected layer is used to learn high-level representations by combining features learned from the previous layers.• The output layer is the last layer which produces the desired output based on the task at hand.
CNN architectures can have additional components like dropout and normalization layers, depending on the specific application and network design.

Improvements in Architectural Designs of CNNs
Improved or hybrid structures of CNNs with other algorithms such as transformers (Figure 3a), recurrent neural networks (RNNs), generative adversarial networks (GANs) and shallow methods have shown better performances in medical image segmentations and classifications [26].Each Swin transformer block consists of residual connection, and two-layer perceptron (MLP) with Gaussian error linear units, a LayerNorm layer (LNL) and a multi-head selfattention module (Figure 3b).
The query (Q), key (K) and value (V) of the self-attention can be given as (2): where Q d is the query dimension and B is the bias matrix High-precision medical image segmentation is a challenging task due to the existence of inherent distortion and magnification in medical images as well as the presence of lesions with similar density to normal tissues.Recently, hybrid structures of transformers with CNN along with attention blocks have been designed and progressively improved to tackle the problem of medical segmentation.Many of the hybrid and/or improved structures are also designed for medical image classifications in addition to segmentation tasks (Table 1).

CNN Design Description Specific Application
U-net [28] U-net++ [29] U-shaped network design or a nested U-net architecture.
attention U-net [33] Attention gate (AG) model.Automatically learns to focus on structures of varying sizes and shapes.
ResNet [34][35][36] A deep residual learning network (a shortcut connection model to significantly reduce the difficulty of training very deep CNNs).
Aims to simplify very deep networks by introducing a residual block that sums two input signals.

CNN Design Description Specific Application
FC-DenseNet [37,38] Fully convolutional DenseNet developed by the composition of dense blocks and pooling operations in which the up-sampling path was introduced to restore the input resolution.
For semantic image segmentation.
Swin Transformer [41] Hierarchical vision transformer using shifted windows (uses a sliding window to limit self-attention calculations to non-overlapping partial windows).
Serves as a general-purpose backbone for medical image segmentation and classification.
Pretrained framework tailored for self-supervised tasks in 3D medical image analysis.
TransUNet [45] This has an embedded transformers in the down-sampling process to extract the information in the original image.
To solve a lack of high-level detail.
To tackle the problem of high-precision medical image segmentation by introducing a ResLinear-transformer and convolutional linear attention block to FC-DenseNet SETR [47] Segmentation transformer.
A pure transformer (without convolution and resolution reduction) to encode an image as a sequence of patches.
Deformable DETR [48] Fully end-to-end object detector using a simple architecture by combining CNNs and transformer encoder-decoders architecture.
Mitigates the slow convergence and high complexity issues of DETR.
Medical Transformer [49] Gated axial attention for medical image segmentation.
Operates on the whole image and patches to learn global and local features.
O-Net [27] Framework with deep fusion of CNN and transformer.
For simultaneous segmentation and classification.
TransMed [50] Combines CNN and transformer to efficiently extract low-level features of images.Multi-modal medical image classification.
For medical image segmentation.
For tumor detection and classification of MRI brain images.
For medical image segmentation.
HCT-Net [55] Hybrid CNN-transformer model based on a neural architecture search network.
A neural architecture search network for medical image segmentation.
HybridCTrm [56] Bridging CNNs and transformers.For multimodal image segmentation.For detection and classification of non-small cell lung cancer types.

3D CNN Multimodal
Framework [61] The framework comprises a 3D CNN for each modality, whose predictions are then combined using a late fusion strategy based on Dempster-Shafer theory.
Classification of MRI images with multimodal fusion.This multimodal framework processes all the available MRI data in order to reach a diagnosis.
Feature fusion-based ensemble CNN learning optimization [62] An ensemble CNN framework incorporates optimal feature fusion: multiple CNN models with different architectures are trained on the dataset using fine-tuning and transfer learning techniques.
For the automated detection of pneumonia.
Learning optimizations achieved by iteratively eliminating irrelevant features from the fully connected layer of each CNN model using chi-square and mRMR methods.Optimal feature sets are then concatenated to enhance feature vector diversity for classification.
External validation of a deep learning model [63] This model is based on ResNet-18 to automatically assess the mammographic breast density (for each mammogram), providing a quantitative measure of the breast tissue composition.
For breast density classification.

Weakly Supervised Deep Multiple Instance
Learning [64] This is a two-stage framework based on deep multiple instance learning.It requires only global labels (weak supervision).
For diagnosis and detection of breast cancer.This approach provides classification of the whole volume and of each slice and the 3D localization of lesions through heatmaps.

DM-CNN [65]
Dynamic multi-scale CNN contains four sub-modules: dynamic multi-scale feature fusion module (DMFF), hierarchical dynamic uncertainty quantifies attention (HDUQ-Attention), multi-scale fusion pooling method (MF Pooling) and multi-objective loss (MO loss) For medical image classification with uncertainty quantification.DMFF selects convolution kernels according to the feature maps of each level for information fusion.HDUQ-Attention has a tuning block to adjust the attention weight according to the information of each layer.The Monte Carlo (MC) dropout structure is for quantifying uncertainty.The MF pooling is to speed up the computation and prevent overfitting.And the MO loss is for a fast optimization speed and good classification effect.The wave block module in Wave-MLP replaces the Tok-MLP module in UNeXt.The phase term in wave block can dynamically aggregate token to improve segmentation accuracy.An AG attention gate module at the skip connection suppresses irrelevant feature representations.Then, the focal Tversky loss is added to handle both binary and multiple classification tasks.
For multi-modality medical image semantic segmentation.
MobileNets [67] Efficient CNN based on a streamlined architecture that uses depth-wise separable convolutions to build light-weight deep neural networks.
For mobile and embedded vision applications.Use cases include object detection, fine-grain classification, face attributes and largescale geo-localization.
UL-BTD [68] an automated ultra-light brain tumor detection (UL-BTD) system based on ultra-light deep learning architecture (UL-DLA) for deep features, integrated with highly distinctive textural features, extracted by gray level co-occurrence matrix.
For multiclass brain tumor detection.It forms a hybrid feature space for tumor detection using support vector machine, leading to high prediction accuracy and optimum false negatives with limited network size to fit within the average GPU resources of a modern PC system.

Activation Functions Used in CNNs
The use of an optimal activation function along with a robust CNN structure is important for medical image analysis.Having a suitable nonlinear activation function can significantly improve the performance of the network.It is important to note that there is no single or "best" activation function that works universally for all CNN architectures.The choice of activation function should be based on empirical evaluation and specific task requirements (Table 2).The wave block module in Wave-MLP replaces the Tok-MLP module in UNeXt.The phase term in wave block can dynamically aggregate token to improve segmentation accuracy.An AG attention gate module at the skip connection suppresses irrelevant feature representations.Then, the focal Tversky loss is added to handle both binary and multiple classification tasks.
For multi-modality medical image semantic segmentation.
MobileNets [67] Efficient CNN based on a streamlined architecture that uses depth-wise separable convolutions to build lightweight deep neural networks.
For mobile and embedded vision applications.Use cases include object detection, fine-grain classification, face attributes and largescale geo-localization.
UL-BTD [68] an automated ultra-light brain tumor detection (UL-BTD) system based on ultra-light deep learning architecture (UL-DLA) for deep features, integrated with highly distinctive textural features, extracted by gray level co-occurrence matrix.
For multiclass brain tumor detection.It forms a hybrid feature space for tumor detection using support vector machine, leading to high prediction accuracy and optimum false negatives with limited network size to fit within the average GPU resources of a modern PC system.

Activation Functions Used in CNNs
The use of an optimal activation function along with a robust CNN structure is important for medical image analysis.Having a suitable nonlinear activation function can significantly improve the performance of the network.It is important to note that there is no single or "best" activation function that works universally for all CNN architectures.The choice of activation function should be based on empirical evaluation and specific task requirements (Table 2).Leaky ReLU x is the input, and alpha is a small positive constant (determines the slope for negative input values).
lightweight multimodality UNeXt and Wave-MLP semantic segmentation network [66] The wave block module in Wave-MLP replaces the Tok-MLP module in UNeXt.The phase term in wave block can dynamically aggregate token to improve segmentation accuracy.An AG attention gate module at the skip connection suppresses irrelevant feature representations.Then, the focal Tversky loss is added to handle both binary and multiple classification tasks.
For multi-modality medical image semantic segmentation.
MobileNets [67] Efficient CNN based on a streamlined architecture that uses depth-wise separable convolutions to build lightweight deep neural networks.
For mobile and embedded vision applications.Use cases include object detection, fine-grain classification, face attributes and largescale geo-localization.
UL-BTD [68] an automated ultra-light brain tumor detection (UL-BTD) system based on ultra-light deep learning architecture (UL-DLA) for deep features, integrated with highly distinctive textural features, extracted by gray level co-occurrence matrix.
For multiclass brain tumor detection.It forms a hybrid feature space for tumor detection using support vector machine, leading to high prediction accuracy and optimum false negatives with limited network size to fit within the average GPU resources of a modern PC system.

Activation Functions Used in CNNs
The use of an optimal activation function along with a robust CNN structure is important for medical image analysis.Having a suitable nonlinear activation function can significantly improve the performance of the network.It is important to note that there is no single or "best" activation function that works universally for all CNN architectures.The choice of activation function should be based on empirical evaluation and specific task requirements (Table 2).Addresses the issue of "dead neurons" by allowing small negative values instead of setting them to zero; provides some gradient flow for negative inputs during backpropagation.
Parametric ReLU (PReLU) Alpha is a parameter that can be learned during the training process (controls the slope for negative input values).
Similar to leaky ReLU (although alpha here is a parameter to be learned and optimized).
During training, the alpha parameter is updated through backpropagation, enabling the network to learn the optimal value for each neuron.Adjusting the slope for negative inputs can lead to improved performance and better representation learning.SWISH-RELU The advantage of SWISH-RELU is that it retains the desirable properties of Swish, such as the smoothness and non-monotonic behavior, while also providing a fallback to ReLU for negative inputs.This fallback mitigates the problem of dead neurons and vanishing gradients associated with the standard Swish activation function.
representation learning.
The slope is fixed to a predefined value This is smooth and non-monotonic.
x is the input and erf is the error function used to model cumulative distribution.
such as the smoothness and nonmonotonic behavior, while also providing a fallback to ReLU for negative inputs.
This fallback mitigates the problem of dead neurons and vanishing gradients associated with the standard Swish activation function.
completely vanish for negative inputs.
SWISH-RELU performs well in CNNs for image classification.

Gaussian
Error Linear Unit (GELU) This is smooth and non-monotonic.
x is the input and   * Graphs of activation functions were generated using ggplot2 package of R or an online graphing method called desmos (https://www.desmos.com)(accessed: 14 June 2023.Many of the activation functions (Table 2) are smooth and non-monotonic, meaning that they do not strictly increase or decrease for all input values.

Popular Frameworks
There are a number of CNN frameworks widely used for medical image understandin (Table 3).The popularity of frameworks is ranked based on the number of searchhits using Google Scholar, PubMed and IEEE Xplore (Figure 4 and Table 3).The use of Keras as a TensorFlow interface seems to be the most widely used framework across the three search engines/databases (Figure 4).alpha = 0.5 ISRU is used as an alternative to sigmoid or tanh in situations where a more gradual transition from low to high activations is desired.But it is not widely used in deep learning models.
* Graphs of activation functions were generated using ggplot2 package of R or an online graphing method called desmos (https://www.desmos.com,accessed: 14 June 2023).Many of the activation functions (Table 2) are smooth and non-monotonic, meaning that they do not strictly increase or decrease for all input values.

Popular Frameworks
There are a number of CNN frameworks widely used for medical image understandin (Table 3).The popularity of frameworks is ranked based on the number of search-hits using Google Scholar, PubMed and IEEE Xplore (Figure 4 and Table 3).The use of Keras as a TensorFlow interface seems to be the most widely used framework across the three search engines/databases (Figure 4).

Ensemble Approaches for CNN Models
There are several models that can be ensembled with CNN designs which can be used for medical image analyses.Ensemble techniques aim to improve the robustness and accuracy of CNNs [75][76][77].Ensemble methods for CNN models include mixture ensemble of CNNs [78] used for breast tumor classification [79], ensembles of pre-trained CNNs (such as inception v3) [80] used for epilepsy classification [78], in-network ensembles for obstructive sleep apnea detection [81,82], weighted average ensembles for pneumonia detection [83], a self-ensemble framework [84] used for brain lesion segmentation, orthogonal and attentive ensemble networks [85] used for COVID-19 diagnosis [86,87], 3D CNN ensembles used for pulmonary nodule classification in lung cancer screening [88] and ensembles of REFINED-CNN built under different choices of distance metrics and/or projection schemes used for anti-cancer drug sensitivity prediction [89].Ensembled designs consisting of deep CNN and recurrent neural network architecture are applied for the recognition of end-to-end arousal from ECG signals [90].

Hyperparameters of CNNs Used for Medical Image Analyses
Hyperparameters are settings that affect the performance of Neural nets.There are more than 420 hyperparameters reported in the literature used for tuning up deep neural networks (Supplementary Table S1).Of the 420 plus, around 30 to 40 hyperparameters are widely reported in relation with application of CNN for object recognition and medical image understanding (Table 4).

Number of layers
Specifies the depth or the number of layers in the CNN architecture.
Filter/kernel size Defines the spatial extent of the filters (convolutional kernels) used to scan the input data.

Pooling type
Determines the downsampling operation applied to reduce the spatial dimensions of the feature maps (e.g., max pooling, average pooling).
Pooling size Specifies the size of the pooling window used for downsampling.

Stride
Defines the step size at which the filter/kernel moves horizontally or vertically when performing convolutions or pooling.

Padding
Determines whether and how extra border pixels are added to the input data before performing convolutions or pooling.
Learning rate decay Reduces the learning rate over time to allow for finer adjustments during training.
Weight decay Adds a penalty term to the loss function to discourage large weights, reducing overfitting.

Data augmentation
Applies random transformations to the training data, such as rotation, flipping or zooming, to increase the diversity of examples and improve generalization.

Transfer learning
Uses pre-trained models on large-scale datasets as a starting point for training on a specific task, saving training time and potentially improving performance.
Early stopping Stops the training process if the validation loss does not improve over a certain number of epochs, preventing overfitting and saving computational resources.

Learning rate schedule
Specifies how the learning rate is adjusted during training, such as by reducing it after a certain number of epochs or based on a predefined schedule.

Initialization of biases
Determines how the biases of the model's layers are initialized.
Learning rate warm-up Gradually increases the learning rate at the beginning of training to stabilize the optimization process.
Image normalization Specifies how the input images are normalized (e.g., mean subtraction, scaling to a certain range).
Network architecture Defines the overall structure of the CNN model, including the arrangement and types of layers (e.g., VGG, ResNet, Inception).

Number of filters per layer
Determines the depth of the feature maps produced by each convolutional layer.

Hyperparameter * Description
Dilated convolutions Allows the network to have a larger receptive field without increasing the number of parameters.
Weight sharing Shares weights across different parts of the network to reduce the number of parameters and improve generalization.
Learning rate annealing Gradually decreases the learning rate during training to fine-tune the model's parameters.
Input image size Specifies the size of the input images to the CNN model.

Number of convolutional layers
Determines the depth or capacity of the CNN; the appropriate number depends on the task, the size and diversity of the dataset and the computational resources.

Number of fully connected layers
Used to map the high-level features to the desired output.The number of neurons or units in each fully connected layer is another hyperparameter.

Momentum
Used in optimization algorithms (e.g., SGD) and can improve the convergence speed and stability of CNN training by accumulating momentum from past gradients.
Inverse dropout Makes inference faster during test time.

L2 regularization
For the sparse representation of features.
* These hyperparameters offer a wide range of options for configuring and optimizing CNN models for various tasks, including medical image classification.The optimal values for these hyperparameters depend on the specific task and dataset being used and can be determined through hyperparameter tuning.

Hyperparameter Tuning and Optimization Methods
The tuning of deep learning architectures helps to improve the ease of data encoding, integrative layering, multivariate classification and predictive model performance.Particularly, the hyperparameter tuning of CNN models are important steps for training, iterative tuning and benchmarking (to make classifications).
The following important and widely used CNN hyperparameter tuning methods can be used to improve the reproducibility of model outputs or performances: • Automatic hyperparameter optimization tools (like Amazon's HyperparameterTuner or Google Vizier) [91,92]

•
Metaheuristic optimization techniques [103] or SHO metaheuristic optimization for fine-tuning the weights, biases and hyperparameters [104] • The orthogonal array tuning method [2], the adaptive hyperparameter tuning and the covariance matrix adaptation evolution strategy.

•
Simulated annealing, the KNN approach, per-parameter regularization and the EVO technique (used to obtain the accurate optimized value in terms of hybridized exploitation and exploration).
Some tuning methods may be more computationally expensive than others, so it is important to consider the trade-offs between accuracy and efficiency when selecting a hyperparameter tuning method.

Tuning of Parameters
CNN model parameters such as weights and biases can be randomly initialized and iteratively updated (using backpropagations) guided by the markers of model performance (described under model performance).Applying factorizations to the weight metrics in the networks can help to significantly reduce the total number of parameters to be trained.Different gradient descents, including the stochastic gradient descent and exponentiated gradient algorithms can be used to update parameters [105].

Benchmarking of Model Performances
Model performances can be benchmarked by calculating model convergence, cost, and training set and validation set errors.The quantitative values of the training and validation set errors will be evaluated in reference to the base error on datasets that are from the same distribution.If the training set error is high, the model has a high bias (underfitting) toward the training set.To address the high training set error (high model bias or underfitting), it is recommended to use deeper neural nets, longer training, and/or different CNN architecture.On the other hand, if the development set error is high while training set error is low, the model has high variance (overfitting) toward the training set.To address the high validation set error (high model variance or overfitting), it is recommended to use more datasets of the same distribution (e.g., publicly available databases), regularization, a different neural network architecture and/or inverse dropout.

Performance Metrics Used in Evaluating CNN Models
The performance of classifier models can be evaluated using the diagnostic (confusion) matrix and derivatives of the main diagnostic parameters: such as sensitivity (recall), specificity (true negative rate), F1-score, positive and negative predictive values, accuracy, precision, positive and negative likelihood ratios, diagnostic odds ratio, Matthew's correlation coefficient and the area under the receiver operating characteristic curve (both on the validation and test datasets).
Confusion matrix: True class (columns) and Predicted class (rows) True Positives (TP) False Positives (FP) False Negatives (FN) True Negatives (TN) • Classification accuracy is the percentage of correctly classified instances out of the total number of instances in the dataset (3).
Accuracy is used to evaluate the performance of CNNs in image classification tasks [106,107].

•
Sensitivity and specificity are measures of the true positive rate and true negative rate, respectively.Sensitivity measures the proportion of correctly identified actual-positives (4), and specificity measures the proportion of correctly identified actual-negatives (5).
These metrics are commonly used in medical image analysis tasks [108].
• The F1 score is a measure of the balance between precision and recall, which are metrics that evaluate the accuracy and completeness of the model's predictions, respectively.That is, the F1 score ( 6) is the harmonic mean of precision and recall and is used to evaluate the performance of CNNs in binary classification tasks.
• Mean squared error (MSE): measures the average squared difference between the predicted and actual values (7).MSE is particularly important to evaluate quantitative or regression tasks.
where n is the number of observations, y i is the observed values and y i is the predicted values.
• Receiver operating characteristic (ROC) is a graphical representation of the trade-off between sensitivity and specificity for different classification thresholds.It is used to evaluate the performance of CNNs in binary classification tasks [2,5].• Area under the curve (AUC) is a summary measure of the ROC curve representing the probability that a randomly chosen negative instance will be ranked lower than a randomly chosen positive instance.It is commonly used to evaluate the overall performance of CNNs in binary classification tasks.
The generalizability of classifier models can be further evaluated on totally independent datasets of similar distribution.

•
Data distillation methods: uniform experiment design method, highlighting, background filling, resizing, noise reduction, the Gabor filter model, image defect detection and implicit differentiation; • Optical flow image processing;
These pre-processing techniques are used to prepare the image dataset for CNN modeling.In addition to pre-processing, some studies also use segmentation via CNN to further analyze the images.

Image Datasets Relevant for Medical Themes
The use of CNN techniques in medical image analysis and disease classification necessitates the availability of comprehensive and diverse datasets [111][112][113].The success of these techniques relies on the richness and representativeness of the datasets, as they enable the extraction of salient information and features from medical images and records [112,114,115].
The list of salient datasets important for medical themes (Table 5) encompasses a wide array of medical images for diverse pathological conditions.The quality, representativeness, and diversity of these datasets make them valuable for practicing and for setting up CNN experiments as well as for CNN-based medical image understanding.Supplementary Tables S3-S5 also show lists of comparisons of performances of CNN methods applied to other popular public datasets.Traumatic brain injury These datasets include images on traumatic brain injury patients, clinical and molecular datasets.

Data Augmentation for Training a Robust CNN Diagnostic Model for Cases with Insufficient Training Data
Data augmentation is a critical component in training robust convolutional neural network (CNN) models when there are no sufficient training datasets.It involves generating additional data to enhance the training process, improve model performance and generalization.Augmenting the training dataset involves applying various transformations to the original images, creating new variations such as rotation, flipping, zooming, translation, brightness and contrast adjustment, Gaussian noise (adding a small amount of Gaussian noise to the images to make the model more robust to noise), elastic deformations (applying elastic deformations to the images to introduce distortions, making the model more tolerant to deformations in the input data), color jittering (randomly change the hue, saturation and brightness of the images to introduce variations in color), random cropping (a portion of the image, forcing the model to focus on different regions) and shearing (apply shearing transformations to the images, simulating changes in the viewing angle) [116][117][118].These processes help the model become more robust by exposing it to different perspectives, orientations and conditions.Studies have demonstrated that data augmentation, when combined with fine-tuning and transfer learning, can significantly enhance model accuracy [119,120].Additionally, data augmentation can be used to enhance the robustness of CNN models to noise for improved training [121,122] and to mitigate kernel saturation (to increase classification accuracy) [123].Therefore, data augmentation techniques can be used to develop robust CNN diagnostic models by addressing limited training data, noise and kernel saturation.
Data augmentation can be implemented using libraries such as TensorFlow's Im-ageDataGenerator or PyTorch's transforms.These libraries provide convenient tools for applying various transformations to the training data on-the-fly during the model training process.When implementing data augmentation, it is essential to strike a balance.For example, too much augmentation may result in the model memorizing augmented images rather than learning useful features.Additionally, it is important to consider the nature of the diagnostic task; for medical imaging, it is advisable to be cautious with certain transformations to avoid introducing unrealistic artifacts.
Enhancing CNN-Based Image Classification for Rare Diseases through Data Augmentation The scarcity of labeled data for images associated with rare diseases poses a significant challenge for training accurate and robust models.The key challenges include (i) the limited availability of annotated images not only hinders the training of CNNs but also poses a risk of overfitting, where models may fail to generalize well to new and unseen instances; (ii) the imbalanced class distribution inherent in many rare disease datasets exacerbates the difficulty, as models may struggle to discern minority classes effectively.
By artificially expanding the training dataset, data augmentation enables CNNs to learn invariant features and nuances, ultimately enhancing the model's ability to generalize and hence improving the model's capability to recognize subtle patterns and features indicative of rare diseases.It is advisable to also conduct comparative analyses between the augmented and non-augmented models to assess the efficacy of data augmentation in improving the robustness and generalization of the CNN model for rare disease image classification.

Machine Learning-Assisted Statistical Modeling of the Literature (Pertaining to CCN Application for Medical Image Understanding)
Scientific progress relies on the efficient assimilation of published knowledge in order to choose the most promising way forward and to minimize reinvention.But, due to the rapidly evolving nature of the research literature, determining the relevance of an individual report, aggregating and synthesizing multiple reports to derive new insights and finding latent knowledge cannot be efficiently carried out manually.Here, we used machine learning-assisted statistical modeling to search, aggregate, analyze and synthesize the literature on the application of CNN for medical imaging to identify latent and relevant information spread across research articles, conference proceedings and book chapters.
The whole process started by gathering a comprehensive corpus of literature on application of CNN for medical image understanding including journal articles, conference proceedings and book chapters.The gathered large corpus of datasets were preprocessed to handle specialized terminology, abbreviations and language patterns prevalent in the medical imaging literature; to remove noise, ensure consistency and standardize the text, and transform the raw text into a suitable input format.This involved removing irrelevant metadata, handling special characters, standardizing text formatting, tokenizing the text into phrases or words, text normalization, removing stopwords, stemming, lemmatization and spelling corrections while preserving the contextual information and maintain the integrity of the text during preprocessing.Feature mining techniques such as entity recognition, keyword extraction, topic modeling and literature summarization were used to identify trends, patterns and associations and to detect relationships between entities within the existing literature.Language model-based statistical modeling were used to generate coherent summaries, identify method gaps, predict future trends and propose potential solutions based on the patterns and relationships identified during the text mining and analyses stages.

Literature Search Strategy
We used different literature search strategies in which multiple combinations of key words and search engines along with stringent exclusion and inclusion criteria.

Literature search engines:
We tested 26 different search engines/databases and 5 large language-based AI tools (Supplementary Table S2).Based on coverage-overlap of the tested search engines and specificity metrics, we chose Google Scholar, IEEE Xplore, PubMed and Dimensions as our main search engines to access literatures pertaining to technical resources on the use of CNN for medical imaging.

Inclusion and exclusion criteria:
The literature search using Google Scholar and PubMed were largely focused on peer reviewed articles, whereas studies obtained using IEEE Xplore included conference proceeding in addition to peer reviewed articles.All searches were restricted to CNN methods and approaches, particularly focusing on recent developments and improvements that are useful for medical image understanding.
Non-English materials were filtered out as the first exclusion criterion.The exclusion of contents from retracted sources were carried out using RetractionWatch, a database for checking retracted studies and papers.Also, we used Search Smart, a tool that allows researchers to compare the capabilities of most of the conventional search tools, including Google Scholar, IEEE Xplore and PubMed, as an additional exclusion criterion.

Statistical Modeling and Visualization
Machine learning algorithms (implemented as open source, python library or R package) such as non-negative matrix factorization, automated content analysis, Cochrane crowd platform, Rayyan, VOSviewer, Bibliometrix, litsearchr, revtools, wordcloud, word-cloud2, tm and ggplot2 were used for the statistical modeling of the literature and the visualization of the modeling outputs.These tools were used for topic modeling, word frequency counting, network analyses, knowledge graph construction, visualization to uncover latent topics, prevalent themes, method gaps and potential future directions.The statistical modeling involves multiple steps and the functions of each of these packages.For example, we used multiple functions of the "bibliometrix" package, such as "convert2df" to convert the corpus of documents to data frames (as statistical modeling inputs); "biblio-Analysis" for statistical scoring of the data frames; "summary" to see the overall picture of the statistical analyses outputs; "biblioNetwork" to construct networks based on the analyses results; and "plot" for visualization.Similarly, multiple steps were involved with the other packages and tools during the statistical modeling and visualization processes.The detailed steps and scripts for each of these tools and packages can be found at the GitHub repository (the is link provided under the "Data Availability Statement").

Results from Statistical Modeling
The findings of the statistical modeling, reported in this section, were based on search hits identified using IEEE Xplore, PubMed, Google Scholar and Dimensions.A total of 4609 publications (accessed on 14 June 2023) consisting of 2278 research articles, 938 conference proceedings, 903 book chapters, 470 preprints, 19 edited books and 1 monograph were identified by searching for the keywords "convolutional neural networks" AND "classification" AND ("medical image" OR "medical imaging") in the "Title" AND "abstracts" (which were published from 2006 to 2023).Networks and graphs were visualized using VOSviewer [124,125] and ggplot2 [126].Bibliometrix (R package) was used to assess publication and citation trends.IEEE Xplore: The advanced search option of the IEEE Xplore, with the search key words: "classification", "medical image" and "convolutional neural network" (in the "abstract") identified 863 unique hits consisting of 687 conference proceedings, 164 journal articles, 7 early access articles and 4 books.Of the 863, 617 citations were published between 2020-2023 (which comprised 484 conference proceedings, 123 journal articles, 7 early access and 3 books) (Figures 4 and S1).
Google Scholar: Using the advanced search option occurring only in the title of the article (screenshot shown next), we identified 212 articles.The statistical modeling of the 212 citations was visualized using VOSViewer (Figure 6).
Transfer learning and segmentation along with the attention mechanisms and incorporation of the transformer seems to be dominant approaches within the combined search hits obtained from IEEE Xplore, PubMed and Google Scholar (Supplementary Figure S2a,b).Medical classification for the diagnostic purpose of COVID-19, brain pathology and breast cancer seem to dominate the literature with regard to application of CNNs for medical image analysis (Supplementary Figures S2 and S3).Widely mentioned metrics to access classification performance include ROC curves and sensitivity and specificity (Supplementary Figure S2).Additionally, the phrases convolutional neural networks, medical image classification, diagnostic imaging, backpropagation, adaptive momentum methods, nonconvex optimization and image interpretation were mentioned more frequently among 141 references cited within the text of this manuscript (Figures 5-7 and Supplementary Figure S4).Dimensions: Using the advanced search options and keywords "convolutional neural networks AND classification AND (medical image OR medical imaging)" searched for in "Title AND abstracts", a total of 4609 hits were found.The search hits include 2278 research articles, 938 proceedings, 903 chapters, 470 preprint, 19 edited books and 1 monograph.
All the search hits obtained using IEEE Xplore, PubMed and Google Scholar were also subsets of the 2278 research articles, 938 proceedings and 903 chapters identified using Dimensions.Key word frequency and word cloud analyses of the 2278 articles 938 proceedings and 903 chapters used to rank popular diagnostic images used as CNN inputs, corresponding diseases, CNN algorithms and evaluation metrics that are applied to understand such medical images (Figures 8 and 9).The most frequently used medical imaging modalities seem to include X-ray, MRI, radiography, ultrasonography, histopathological staining, CT, tomography and optical endoscopy (Figure 8).These images are used for various medical imaging tasks such as for the detection of COVID-19 [127], lesion detection, image segmentation and image classification in specialties such as radiology, cardiology and gastroenterology (Figure 8).

Annual distribution of publications identified using search engines
The number of publications per year for the combined search hits showed a steep increase since the start of the COVID-19 pandemic (Figures 10 and S5).This analysis was based on the combined references collected using Google Scholar, PubMed and IEEE Xplore (after excluding duplicates).The number of hits from Google Scholar were small because we focused only on the titles of the articles.Bibliometrix v4.1.4(R package) analyses outputs (Figure 11): Bibliometrix supports bibliographic database files from Dimensions, which also includes all the publication obtained from Google Scholar, IEEE Xplore and PubMed.A steep increase in publications took place after 2018 (probably more propelled by the wide use of CNN for COVID-19 diagnosis).On the other hand, the most cited papers are published around 2016 (Figure 12).An analysis was performed on 2273 research articles published between 2006 and 2023.The summary of the analysis showed that an annual growth rate of 40.6% with an average article age of 2.01 years, 33.57average citations per article and 7.056 average citations per year per article (Figure 11a).Publications in the journals "Multimedia Tools and Applications", "IEEE Access", "Diagnostics" and "Applied Sciences" are highly cited (Figure 13).Using Dimensions as a search tool, with the search criteria "Review AND CNN AND MEDICAL AND (IMAGE OR IMAGING) AND CLASSIFICATION" in the titles and abstracts (accessed on 30 July 2023), 115 articles were identified.One non-English article was excluded.Of the 114 articles, 59 were reviews on results of studies focusing on a specific disease, 23 were on single image or method, 12 were not review papers and 8 review papers were not specific to medical imaging (Figure 14).Of the 12 method review papers considered (Figure 14), 8 were broad and shallow introductory or background method reviews and the remaining 4 method review papers were comparable to this study (Table 6).

Summary of statistical modeling results
The majority of articles were published during the COVID-19 pandemic and afterwards (Figures 8-11).It seems that more imaging data became available during COVID-19, making transfer learning and GAN-based data augmentation less important compared to papers published pre-COVID-19 (Figure 5).The analysis of chest X-rays, histopathology and endoscopic images, diabetic retinopathy and neuroimaging, brain tumor/neoplasm classification and detections utilizing CNNs was identified during and post-COVID-19 (Figures 5-7).Likewise, the use of transformers, image fusion schemes, genetic algorithms and momentum trended more during and after the COVID-19 pandemic (Figures 5-7).Overall, the majority of the medical image classifications were applied for the recognition of pathological conditions and the detection and diagnosis of diseases such as COVID-19 (including pneumonia detection) and lung and breast cancers, as predicted through chest X-ray and CT images.

Highlights of Current Practices
The use of CNNs in medical image understanding has shown significant improvements.CNN models for medical image classifications can be trained from scratch, using off-the-shelf pretrained models (transfer learning) and/or conducting unsupervised CNN pretraining with supervised fine-tuning [131,132].Ensembles of different CNN models and combinations of CNN algorithms with transformers, including global spatial attention mechanisms [133], are being explored for the classification of multiple pathological images such as X-ray, MRI, CT and histopathological stains.

Implications for Clinical Practice
Medical image understanding using CNN has shown promising results in various medical domains, including disease classification, tumor segmentation, lesion detection, identifying anatomical location [50,[134][135][136] and diagnosing COVID-19 and metastatic cancer with high classification accuracy [137].Overall, the use of CNNs in medical image classification has significant implications for clinical practice, including improving the accuracy and speed of diagnosis, and for treatment planning.Ongoing research in these areas aims to advance the field and improve the effectiveness, interpretability and applicability of CNN models in clinical settings.

Gaps in the Current State of CNN Application for Medical Image Understanding
CNNs are being applied to a wide range of medical images [138] including X-ray, MRI, CT, optical coherence tomography and histopathology stains.However, there are still several open questions and challenges regarding the current state of CNN application for medical image understanding.Such gaps include: (i) The problem of interpretability (considering CNN models as black boxes), making it challenging to interpret their decisions and understand the reasoning behind their predictions, which can be a significant barrier to the adoption of CNNs in medical imaging.(ii) The limited availability of annotated medical imaging datasets-the performance of CNNs is highly dependent on the quality and quantity of the training data.(iii) Robustness to the diverse data and pathological variations-medical images can exhibit significant variations due to factors such as different imaging modalities, patient demographics, imaging protocols and disease presentations.Ensuring CNN models' robustness and generalizability across diverse data distributions and pathological variations is an open question.(iv) Finding efficient methods for domain adaptation and transfer learning to enhance the generalizability of CNN models across different medical imaging domains is still challenging.(v) Addressing class imbalance and rare diseases-medical image datasets often suffer from class imbalance, where certain diseases or conditions are underrepresented.Developing techniques to handle class imbalance and effectively train CNN models on rare diseases is an ongoing challenge.(vi) Uncertainty estimation-determining how to effectively incorporate uncertainty estimation into CNN models and providing an error margin for predictions is an ongoing research challenge.(vii) Integration with clinical workflow-the successful integration of CNN models into clinical practice requires addressing workflow-related challenges, including seamless integration with existing medical systems, interpretability in a clinical context and adapting CNN models to real-time decision support systems.(viii) Data privacy and security-medical images contain sensitive patient information, and preserving patient privacy is important.Exploring techniques like federated learning, differential privacy or secure computation to enable CNN training on distributed medical image data while preserving privacy is an open question.

Trends and Future Directions
CNNs are considered a significant technological breakthrough in the field of medical image understanding and are increasingly gaining attention [139,140].It is evident that CNNs have been successfully employed in medical image recognition, segmentation and classification.The use of class decomposition and transfer learning, synthetic data augmentation, Bayesian and adaptive hyperparameter optimizations and specialized architectures can be used to improve the robustness and performance of CNN models for the classi-fication of medical images [3,131,132,[141][142][143], including the diagnosis and prognosis of diseases such as COVID-19 [140].
While CNNs have shown promise in medical image understanding, further research is needed to optimize their performance for efficient medical diagnosis and treatment follow-up or evaluation.Some of the future trends and potential improvements of CNN approaches in the field of medical image classification include: • The development of specialized and efficient CNN architectures (including methods for automatically designing CNN architectures), such as evolving arbitrary CNNs with the goal of better discovering and understanding new effective architectures for robust learning outcomes that are tailored towards learning specific representations.

•
Design methods that can be readily used for automatic optimization of CNN architectures for personalized medicine.

•
Designing new domain agnostic CNN algorithms which can be used for transfer learning or that can be used for reliable learning from small datasets or learn online, e.g., by combining with reinforcement learning.

•
Exploring new activation functions for efficient and robust learning, including on small datasets (to mitigate the issue of labeled data scarcity).• Aiming to improve feature extraction efficiency of CNNs using multilinear filters [144] and designing accelerators for CNN inferences [145] to improve the speed and accuracy of medical image analysis tasks.

•
Designing more efficient 3D CNNs.• The dynamic selection of misclassified negative samples during training to improve performance and to speed up learning.• The privacy, data security and prevention of adversarial data poisoning.These recommendations can help set up more robust, improved, secure and reliable CNN experiments for medical image understanding.

Limitations of the Review
This review focuses on the methodological aspect of CNN as applied for medical image understanding, and hence, it does not include a summary of the findings and results of the literature that are not directly related to developments or advancements of CNN methods.
Additionally, the current literature on the application of CNN for medical image classification has some limitations in scope.First, most research efforts focus on adapting existing CNN architectures with or without transfer learning, rather than designing and optimizing CNNs architectures and approaches specific to medical image classification.Second, there is a lack of research on the amount of data needed to train deep CNN models to achieve high accuracy.Third, while CNNs have shown competitive performance in medical image analysis tasks, such as disease classification, segmentation and detection, there is a lack of post hoc explainability for CNN-based models [146] as well as a need for more research on the use of CNNs for low resource medical image analysis [143].Even though there are advancements with regard to light-weight CNN architectures for economic GPU-based systems [67,68,147], still more concerted efforts are needed to make automated medical image analysis accessible in real time.

Conclusions
The focus of this review is on the methodological aspects of CNNs' applications for medical image understanding.It is organized to serve as a resource for practitioners by compiling and presenting improved architectures, popular frameworks, activation functions, model ensemble techniques, hyperparameter optimizations, performance metrics, medical theme relevant datasets and input data preprocessing methods that are important for better CNN designs and to learn robust models.
We also used machine learning (ML) algorithms for the statistical modeling of the literature to uncover latent topics, patterns and prevalent themes, method gaps and potential future directions that were not obvious from individual studies.Our ML-assisted analyses showed that the COVID-19 pandemic probably stimulated the wide use of CNN for clinical image classifications and disease diagnosis.The COVID-19 problem probably drove the flow of CNN practitioners to the discipline of medical imaging, apparently creating an atmosphere of collaboration with people in the biomedical field.This may be the reason for the drastic increase in journal articles and conference proceedings pertaining to the application of CNN for medical image recognition, analysis, classification and disease detection and diagnosis.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/make6010033/s1. Figure S1: Results from statistical modeling of 863 citations identified using IEEE Xplore.Network visualized using VOSviewer; Figure S2: Analysis outputs of the combined references (search hits) from IEEE Xplore, PubMed and Google Scholar.Color palettes represent (a) relationships of the corresponding studies and (b) publications years; Figure S3: Systematic analysis of the references cited in this review (and creating a network using VOSviewer with settings: Create a map based on bibliographic data and Co-occurrence counting method for Keywords)); Figure S4: Most frequently mentioned methods among cited references: convolutional neural networks, medical image classification, diagnostic imaging, backpropagation, adaptive momentum methods, nonconvex optimization and image interpretation; Figure S5: Distribution of references per year for (a) the three search engines (IEEE Xplore, PubMed, Google Scholar) and unique combinations of the search hits from three search engines.(b) Obtained from Dimensions and the search hits obtained from IEEE Xplore, PubMed and Google Scholar; Table S1: List of hyperparameters; Table S2: List of search engines.Table S3: Representative classical CNN methods, and their applications on well-known public datasets including metrics used to assess their performance.Table S4: Studies comparing the performances of different CNN methods including metrics used for comparison.Table S5: The efficiency comparison among SOTA CNN methods in various medical image classification tasks (comparison of some widely used CNN architectures in medical image classification).

Figure 1 .
Figure 1.Application of CNN methods for medical image understanding, and machine learningassisted statistical modeling of the current literature.

Figure 2 .
Figure 2. Basic architecture of convolutional neural network showing the main components and steps involved in medical image classifications.

Figure 3 .
Figure 3. (a) Hybrid structure of convolutional neural networks with transformer (applied for diagnosis of brain tumor).(b) Consecutive Swin transformer block.Copyright notice: (a,b) were redrawn from Figures 1 and 2 from Wang et al. [27] (published with open access under the Creative Commons License).

Table 2 .
Activation functions frequently used in CNN applications for medical image processing.The middle solid line separates the ReLU families and derivates from other classes of activation functions.) = max(0, x) i.e., ReLU(x) = {0, if x ≤ 0, x, if x > 0} Computationally efficient and helps alleviate the vanishing gradient problem, allowing for faster training and improved network performance.Leaky ReLU Leaky ReLU(x) = max(alpha × x, x) = {x, if x > 0, alpha × x, if x ≤ 0} x is the input, and alpha is a small positive constant (determines the slope for negative input values).alpha = 0.1 Addresses the issue of "dead neurons" by allowing small negative values instead of setting them to zero; provides some gradient flow for negative inputs during backpropagation.Parametric ReLU (PReLU) PReLU(x) = {x, if x > 0, alpha × x, if x ≤ 0}; Alpha is a parameter that can be learned during the training process (controls the slope for negative input values).
the alpha parameter is updated through backpropagation, enabling the network to learn the optimal value for each neuron.Adjusting the slope for negative inputs can lead to Computationally efficient and helps alleviate the vanishing gradient problem, allowing for faster training and improved network performance.

Table 2 .
Activation functions frequently used in CNN applications for medical image processing.The middle solid line separates the ReLU families and derivates from other classes of activation functions.) = max(0, x) i.e., ReLU(x) = {0, if x ≤ 0, x, if x > 0} Computationally efficient and helps alleviate the vanishing gradient problem, allowing for faster training and improved network performance.Leaky ReLU Leaky ReLU(x) = max(alpha × x, x) = {x, if x > 0, alpha × x, if x ≤ 0} x is the input, and alpha is a small positive constant (determines the slope for negative input values).alpha = 0.1 Addresses the issue of "dead neurons" by allowing small negative values instead of setting them to zero; provides some gradient flow for negative inputs during backpropagation.Parametric ReLU (PReLU) PReLU(x) = {x, if x > 0, alpha × x, if x ≤ 0}; Alpha is a parameter that can be learned during the training process (controls the slope for negative input values).
the alpha parameter is updated through backpropagation, enabling the network to learn the optimal value for each neuron.Adjusting the slope for negative inputs can lead to alpha = 0.1 ) = {x, if x > 0, alpha × (exp(x) − 1), if x ≤ 0} Alpha is a hyperparameter (controls the behavior of the function); ELU captures more nuanced information from negative inputs and alleviate the vanishing predefined value for lambda λ or in general SELU(x) = {scale × (x if x > 0 else (alpha × exp(x) − alpha)), if training; scale × x, if testing} x is input to the activation function, alpha is a hyperparameter that controls the slope for negative inputs and scale is a scaling factor to maintain the mean and variance of the inputs close to 0 and 1, respectively.SELU has the property of self-normalization, which can lead to improved performance and stability in deep neural networks.training, SELU applies a modified ELU function (negative inputs are transformed with a negative slope).The scale factor stabilizes the activations and ensures self-normalization.The mean and standard deviation of the outputs are enforced to be approximately 0 and 1, respectively (helps address the vanishing/exploding gradient problem).During testing, SELU behaves as a scaled identity function (inputs are multiplied by the scale factor to preserve the output magnitude).Swish SWISH (x) = x × sigmoid(beta × x) Beta is a hyperparameter that controls the behavior of the function.Higher values of beta can lead to more pronounced nonlinearity, while lower values can make it closer to the identity function.beta = 0.5 Combines the linearity of the identity function (x) with the non-linearity of the sigmoid function (for positive inputs: retains the linearity; for negative inputs, the output towards zero is dampened due to the sigmoid function).It performs well in CNNs.SWISH-RELU SWISH-RELU(x) = x × sigmoid(beta × x) if x > 0 SWISH-RELU(x) = x if x ≤ 0 The advantage of SWISH-RELU is that it retains the desirable properties of Swish, beta = 0.1 The Swish activation function with a ReLU fallback is a Swish and ReLU hybrid.The sigmoid introduces a smooth non-linearity, while the ReLU fallback ensures that the activation does not alpha = 1.0 Smooths negative inputs by using an exponential function; the exponential smoothing helps reduce the impact of noisy activations.a(e x − 1) x f or x < 0 f or x ≥ 0 with predefined value for lambda λ or in general SELU(x) = {scale × (x if x > 0 else (alpha × exp(x) − alpha)), if training; scale × x, if testing} x is input to the activation function, alpha is a hyperparameter that controls the slope for negative inputs and scale is a scaling factor to maintain the mean and variance of the inputs close to 0 and 1, respectively.SELU has the property of self-normalization, which can lead to improved performance and stability in deep neural networks.) = {x, if x > 0, alpha × (exp(x) − 1), if x ≤ 0} Alpha is a hyperparameter (controls the behavior of the function); ELU captures more nuanced information from negative inputs and alleviate the vanishing gradient problem.alpha = 1.0 Smooths negative inputs by using an exponential function; the exponential smoothing helps reduce the impact of noisy activationspredefined value for lambda λ or in general SELU(x) = {scale × (x if x > 0 else (alpha × exp(x) − alpha)), if training; scale × x, if testing} x is input to the activation function, alpha is a hyperparameter that controls the slope for negative inputs and scale is a scaling factor to maintain the mean and variance of the inputs close to 0 and 1, respectively.SELU has the property of self-normalization, which can lead to improved performance and stability in deep neural networks.training, SELU applies a modified ELU function (negative inputs are transformed with a negative slope).The scale factor stabilizes the activations and ensures self-normalization.The mean and standard deviation of the outputs are enforced to be approximately 0 and 1, respectively (helps address the vanishing/exploding gradient problem).During testing, SELU behaves as a scaled identity function (inputs are multiplied by the scale factor to preserve the output magnitude).Swish SWISH (x) = x × sigmoid(beta × x) Beta is a hyperparameter that controls the behavior of the function.Higher values of beta can lead to more pronounced nonlinearity, while lower values can make it closer to the identity function.beta = 0.5 Combines the linearity of the identity function (x) with the non-linearity of the sigmoid function (for positive inputs: retains the linearity; for negative inputs, the output towards zero is dampened due to the sigmoid function).It performs well in CNNs.SWISH-RELU SWISH-RELU(x) = x × sigmoid(beta × x) if x > 0 SWISH-RELU(x) = x if x ≤ 0 The advantage of SWISH-RELU is that it retains the desirable properties of Swish, beta = 0.1 The Swish activation function with a ReLU fallback is a Swish and ReLU hybrid.The sigmoid introduces a smooth non-linearity, while the ReLU fallback ensures that the activation does not SELU, by adjusting the mean and variance, takes care of internal normalization.Gradients can be used to adjust the variance (needs a region with a gradient > 1 to increase it).During training, SELU applies a modified ELU function (negative inputs are transformed with a negative slope).The scale factor stabilizes the activations and ensures self-normalization.The mean and standard deviation of the outputs are enforced to be approximately 0 and 1, respectively (helps address the vanishing/exploding gradient problem).During testing, SELU behaves as a scaled identity function (inputs are multiplied by the scale factor to preserve the output magnitude).Swish SWISH (x) = x × sigmoid(beta × x) Beta is a hyperparameter that controls the behavior of the function.Higher values of beta can lead to more pronounced non-linearity, while lower values can make it closer to the identity function.) = {x, if x > 0, alpha × (exp(x) − 1), if x ≤ 0} Alpha is a hyperparameter (controls the behavior of the function); ELU captures more nuanced information from negative inputs and alleviate the vanishing for lambda λ or in general SELU(x) = {scale × (x if x > 0 else (alpha × exp(x) − alpha)), if training; scale × x, if testing} x is input to the activation function, alpha is a hyperparameter that controls the slope for negative inputs and scale is a scaling factor to maintain the mean and variance of the inputs close to 0 and 1, respectively.SELU has the property of self-normalization, which can lead to improved performance and stability in deep neural networks.training, SELU applies a modified ELU function (negative inputs are transformed with a negative slope).The scale factor stabilizes the activations and ensures self-normalization.The mean and standard deviation of the outputs are enforced to be approximately 0 and 1, respectively (helps address the vanishing/exploding gradient problem).During testing, SELU behaves as a scaled identity function (inputs are multiplied by the scale factor to preserve the output magnitude).Swish SWISH (x) = x × sigmoid(beta × x) Beta is a hyperparameter that controls the behavior of the function.Higher values of beta can lead to more pronounced nonlinearity, while lower values can make it closer to the identity function.beta = 0.5 Combines the linearity of the identity function (x) with the non-linearity of the sigmoid function (for positive inputs: retains the linearity; for negative inputs, the output towards zero is dampened due to the sigmoid function).It performs well in CNNs.SWISH-RELU SWISH-RELU(x) = x × sigmoid(beta × x) if x > 0 SWISH-RELU(x) = x if x ≤ 0 The advantage of SWISH-RELU is that it retains the desirable properties of Swish, beta = 0.1 The Swish activation function with a ReLU fallback is a Swish and ReLU hybrid.The sigmoid introduces a smooth non-linearity, while the ReLU fallback ensures that the activation does not beta = 0.5 Combines the linearity of the identity function (x) with the non-linearity of the sigmoid function (for positive inputs: retains the linearity; for negative inputs, the output towards zero is dampened due to the sigmoid function).It performs well in CNNs.
during testing.This introduces a form of regularization and can help prevent overfitting.Similar to leaky ReLU.A variation of Leaky ReLU that randomly samples the slope from a uniform distribution during training.Exponential Linear Unit (ELU) ELU(x) = {x, if x > 0, alpha × (exp(x) − 1), if x ≤ 0} Alpha is a hyperparameter (controls the behavior of the function); ELU captures more nuanced information from negative inputs and alleviate the vanishing gradient problem.alpha = 1.0 Smooths negative inputs by using an exponential function; the exponential smoothing helps reduce the impact of noisy activations.predefined value for lambda λ or in general SELU(x) = {scale × (x if x > 0 else (alpha × exp(x) − alpha)), if training; scale × x, if testing} x is input to the activation function, alpha is a hyperparameter that controls the slope for negative inputs and scale is a scaling factor to maintain the mean and variance of the inputs close to 0 and 1, respectively.SELU has the property of self-normalization, which can lead to improved performance and stability in deep neural networks.SELU, by adjusting the mean and variance, takes care of internal normalization.Gradients can be used to adjust the variance (needs a region with a gradient > 1 to increase it).During training, SELU applies a modified ELU function (negative inputs are transformed with a negative slope).The scale factor stabilizes the activations and ensures self-normalization.The mean and standard deviation of the outputs are enforced to be approximately 0 and 1, respectively (helps address the vanishing/exploding gradient problem).During testing, SELU behaves as a scaled identity function (inputs are multiplied by the scale factor to preserve the output magnitude).Swish SWISH (x) = x × sigmoid(beta × x) Beta is a hyperparameter that controls the behavior of the function.Higher values of beta can lead to more pronounced nonlinearity, while lower values can make it closer to the identity function.beta = 0.5 Combines the linearity of the identity function (x) with the non-linearity of the sigmoid function (for positive inputs: retains the linearity; for negative inputs, the output towards zero is dampened due to the sigmoid function).It performs well in CNNs.SWISH-RELU SWISH-RELU(x) = x × sigmoid(beta × x) if x > 0 SWISH-RELU(x) = x if x ≤ 0 The advantage of SWISH-RELU is that it retains the desirable properties of Swish, beta = 0.1 The Swish activation function with a ReLU fallback is a Swish and ReLU hybrid.The sigmoid introduces a smooth non-linearity, while the ReLU fallback ensures that the activation does not beta = 0.1 The Swish activation function with a ReLU fallback is a Swish and ReLU hybrid.The sigmoid introduces a smooth non-linearity, while the ReLU fallback ensures that the activation does not completely vanish for negative inputs.SWISH-RELU performs well in CNNs for image classification.

)))
erf is the error function used to model cumulative distribution.erf = 0.3 GELU has a smooth and non-linear behavior that can help capture complex patterns and gradients; it performs well in NLP and CNNs.It is computationally more expensive than ReLU due to the involvement of erf but improves the performance in certain scenarios.Softmax Softmax(xi) = ) ∑ ) Given an input vector of x = [x1, x2, ..., xn], the Softmax function computes the probability pi for each element xi as: Softmax(xi) = exp(xi)/sum(exp(xj)) for j = 1 to n The highest probability class is selected as the predicted class label.Boundaries vary based on the xi and xj values.Used as the final activation function in the output layer for multi-class classification tasks (takes a vector of real numbers inputs and outputs a vector of probabilities between 0 and 1 that sum up to 1).Enables the network to assign probabilities to each class, indicating the model s confidence for each class prediction.Hyperbolic Tangent (Tanh) Tanh(x) = (exp(x) − exp(-x))/(exp(x) + exp(-x)) = ) ) Non-linear symmetric function around the origin (squeezes the input value into a range between −1 and 1).Useful for tasks that require outputs in the range of −1 to 1 or for modeling symmetric patterns.Suffers from the "vanishing gradient" problem, where the gradient becomes extremely small for inputs with very high absolute values.Sigmoid (logistic) sigmoid(x) = 1/(1 + exp(−x)) A non-linear function that squeezes the input value into a range between 0 and 1. Suffers from the "vanishing gradient" problem, where the gradient becomes extremely small for inputs with very high or very low absolute values.Maps any real-valued number to a value between 0 and 1, with values close to 0 representing the lower end of the range and values close to 1 representing the upper end (suitable for binary classification tasks or probabilistic outputs).Softplus Softplus(x) = log(1 + exp(x)) Designed to be a smooth and differentiable approximation of the ReLU function, which is non-differentiable at x = 0. Commonly used in variational Has similar properties to ReLU, where positive inputs are passed through unchanged, while negative inputs are mapped to small positive values.It introduces non-linearity to the network, erf = 0.3 GELU has a smooth and non-linear behavior that can help capture complex patterns and gradients; it performs well in NLP and CNNs.It is computationally more expensive than ReLU due to the involvement of erf but improves the performance in certain scenarios.Softmax Softmax(x i ) = exp(xi) ∑ j exp(xj)Given an input vector of x = [x 1 , x 2 , . .., x n ], the Softmax function computes the probability p i for each element x i as: Softmax(x i ) = exp(x i )/sum(exp(x j )) for j = 1 to n The highest probability class is selected as the predicted class label.Boundaries vary based on the x i and x j values.Used as the final activation function in the output layer for multi-class classification tasks (takes a vector of real numbers inputs and outputs a vector of probabilities between 0 and 1 that sum up to 1).Enables the network to assign probabilities to each class, indicating the model's confidence for each class prediction.Hyperbolic Tangent (Tanh)Tanh(x) = (exp(x) − exp(−x))/(exp(x)+ exp(−x)) = (e x −e −x ) (e x +e −x )Non-linear symmetric function around the origin (squeezes the input value into a range between −1 and 1).This fallback mitigates the problem of dead neurons and vanishing gradients associated with the standard Swish activation function.GaussianError Linear Unit (GELU)GELU(x) = 0.5x × (1 + erf(x/sqrt(2)))This is smooth and non-monotonic.x is the input and erf is the error function used to model cumulative distribution.erf = 0.3 GELU has a smooth and non-linear behavior that can help capture complex patterns and gradients; it performs well in NLP and CNNs.It is computationally more expensive than ReLU due to the involvement of erf but improves the performance in certain scenarios.Softmax Softmax(xi) = ) ∑ ) Given an input vector of x = [x1, x2, ..., xn], the Softmax function computes the probability pi for each element xi as: Softmax(xi) = exp(xi)/sum(exp(xj)) for j = 1 to n The highest probability class is selected as the predicted class label.Boundaries vary based on the xi and xj values.Used as the final activation function in the output layer for multi-class classification tasks (takes a vector of real numbers inputs and outputs a vector of probabilities between 0 and 1 that sum up to 1).Enables the network to assign probabilities to each class, indicating the model s confidence for each class prediction. ) = (exp(x) − exp(-x))/(exp(x) + exp(-x)) = ) Non-linear symmetric function around the origin (squeezes the input value into a range between −1 and 1).Useful for tasks that require outputs in the range of −1 to 1 or for modeling symmetric patterns.Suffers from the "vanishing gradient" problem, where the gradient becomes extremely small for inputs with very high absolute values.Sigmoid (logistic) sigmoid(x) = 1/(1 + exp(−x)) A non-linear function that squeezes the input value into a range between 0 and 1. Suffers from the "vanishing gradient" problem, where the gradient becomes extremely small for inputs with very high or very low absolute values.Maps any real-valued number to a value between 0 and 1, with values close to 0 representing the lower end of the range and values close to 1 representing the upper end (suitable for binary classification tasks or probabilistic outputs).Softplus Softplus(x) = log(1 + exp(x)) Designed to be a smooth and differentiable approximation of the ReLU function, which is non-differentiable at x = 0. Commonly used in variational Has similar properties to ReLU, where positive inputs are passed through unchanged, while negative inputs are mapped to small positive values.It introduces non-linearity to the network, Useful for tasks that require outputs in the range of −1 to 1 or for modeling symmetric patterns.Suffers from the "vanishing gradient" problem, where the gradient becomes extremely small for inputs with very high absolute values.Sigmoid (logistic) sigmoid(x) = 1/(1 + exp(−x)) A non-linear function that squeezes the input value into a range between 0 and 1. Suffers from the "vanishing gradient" problem, where the gradient becomes extremely small for inputs with very high or very low absolute values.) = 0.5x × (1 + erf(x/sqrt(2))) This is smooth and non-monotonic.x is the input and erf is the error function used to model cumulative distribution.erf = 0.3 GELU has a smooth and non-linear behavior that can help capture complex patterns and gradients; it performs well in NLP and CNNs.It is computationally more expensive than ReLU due to the involvement of erf but improves the performance in certain scenarios.Given an input vector of x = [x1, x2, ..., xn], the Softmax function computes the probability pi for each element xi as: Softmax(xi) = exp(xi)/sum(exp(xj)) for j = 1 to n The highest probability class is selected as the predicted class label.Boundaries vary based on the xi and xj values.Used as the final activation function in the output layer for multi-class classification tasks (takes a vector of real numbers inputs and outputs a vector of probabilities between 0 and 1 that sum up to 1).Enables the network to assign probabilities to each class, indicating the model s confidence for each class prediction. ) = (exp(x) − exp(-x))/(exp(x) + exp(-x)) = ) Non-linear symmetric function around the origin (squeezes the input value into a range between −1 and 1).Useful for tasks that require outputs in the range of −1 to 1 or for modeling symmetric patterns.Suffers from the "vanishing gradient" problem, where the gradient becomes extremely small for inputs with very high absolute values.Sigmoid (logistic) sigmoid(x) = 1/(1 + exp(−x)) A non-linear function that squeezes the input value into a range between 0 and 1. Suffers from the "vanishing gradient" problem, where the gradient becomes extremely small for inputs with very high or very low absolute values.Maps any real-valued number to a value between 0 and 1, with values close to 0 representing the lower end of the range and values close to 1 representing the upper end (suitable for binary classification tasks or probabilistic outputs).Softplus Softplus(x) = log(1 + exp(x)) Designed to be a smooth and differentiable approximation of the ReLU function, which is non-differentiable at x = 0. Commonly used in variational Has similar properties to ReLU, where positive inputs are passed through unchanged, while negative inputs are mapped to small positive values.It introduces non-linearity to the network, Maps any real-valued number to a value between 0 and 1, with values close to 0 representing the lower end of the range and values close to 1 representing the upper end (suitable for binary classification tasks or probabilistic outputs).Softplus Softplus(x) = log(1 + exp(x)) Designed to be a smooth and differentiable approximation of the ReLU function, which is non-differentiable at x = 0. Commonly used in variational autoencoders (VAEs) and some recurrent neural networks (RNNs).) = 0.5x × (1 + erf(x/sqrt(2))) This is smooth and non-monotonic.x is the input and erf is the error function used to model cumulative distribution.erf = 0.3 GELU has a smooth and non-linear behavior that can help capture complex patterns and gradients; it performs well in NLP and CNNs.It is computationally more expensive than ReLU due to the involvement of erf but improves the performance in certain scenarios.Given an input vector of x = [x1, x2, ..., xn], the Softmax function computes the probability pi for each element xi as: Softmax(xi) = exp(xi)/sum(exp(xj)) for j = 1 to n The highest probability class is selected as the predicted class label.Boundaries vary based on the xi and xj values.Used as the final activation function in the output layer for multi-class classification tasks (takes a vector of real numbers inputs and outputs a vector of probabilities between 0 and 1 that sum up to 1).Enables the network to assign probabilities to each class, indicating the model s confidence for each class prediction. ) = (exp(x) − exp(-x))/(exp(x) + exp(-x)) = ) Non-linear symmetric function around the origin (squeezes the input value into a range between −1 and 1).Useful for tasks that require outputs in the range of −1 to 1 or for modeling symmetric patterns.Suffers from the "vanishing gradient" problem, where the gradient becomes extremely small for inputs with very high absolute values.Sigmoid (logistic) sigmoid(x) = 1/(1 + exp(−x)) A non-linear function that squeezes the input value into a range between 0 and 1. Suffers from the "vanishing gradient" problem, where the gradient becomes extremely small for inputs with very high or very low absolute values.Maps any real-valued number to a value between 0 and 1, with values close to 0 representing the lower end of the range and values close to 1 representing the upper end (suitable for binary classification tasks or probabilistic outputs).Softplus Softplus(x) = log(1 + exp(x)) Designed to be a smooth and differentiable approximation of the ReLU function, which is non-differentiable at x = 0. Commonly used in variational Has similar properties to ReLU, where positive inputs are passed through unchanged, while negative inputs are mapped to small positive values.It introduces non-linearity to the network, Has similar properties to ReLU, where positive inputs are passed through unchanged, while negative inputs are mapped to small positive values.It introduces non-linearity to the network, allows for the modeling of complex patterns, and provides smoother gradients than ReLU (facilitates better training and convergence).Mish Mish(x) = x × tanh(softplus(x)) Mish does not have a closed-form derivative and is often approximated or numerically computed during backpropagation.It introduces non-linear behavior, captures complex patterns and alleviates the vanishing gradient problem.autoencoders (VAEs) and some recurrent neural networks (RNNs).allows for the modeling of complex patterns, and provides smoother gradients than ReLU (facilitates better training and convergence).Mish Mish(x) = x × tanh(softplus(x)) Mish does not have a closed-form derivative and is often approximated or numerically computed during backpropagation.It introduces non-linear behavior, captures complex patterns and alleviates the vanishing gradient problem.Performs well in image classification and NLP.Mish combines the non-linearity of the softplus function with the smoothness of the hyperbolic tangent function (has a similar shape to the Swish activation function but with a gentler slope for negative inputs).Inverse Square ISRU(x, alpha) = x/sqrt(1 + alpha × x 2 ) Alpha is a positive constant that determines the steepness and shape of the ISRU function; a larger alpha value ISRU is used as an alternative to sigmoid or tanh in situations where a more Performs well in image classification and NLP.Mish combines the non-linearity of the softplus function with the smoothness of the hyperbolic tangent function (has a similar shape to the Swish activation function but with a gentler slope for negative inputs).
alpha) = x/sqrt(1 + alpha × x 2 ) Alpha is a positive constant that determines the steepness and shape of the ISRU function; a larger alpha value results in a steeper curve, while a smaller alpha value leads to a more gradual curve.The square root and normalization ensure that the output remains within a reasonable range.alpha = 0.5 ISRU is used as an alternative to sigmoid or tanh in situations where a more gradual transition from low to high activations is desired.But it is not widely used in deep learning models.

Figure 4 .
Figure 4. Overlap among the top 10 most popular CNN frameworks that were ranked based on the search hits using Google Scholar, PubMed and IEEE Xplore.

Figure 5 .
Figure 5. Results from statistical modeling of 231 (PubMed) citations.X-ray tomograph (mainly in relation to pulmonary nodules) and magnetic resonance imaging (Alzheimer's disease and brain neoplasm) seem to be widely analyzed using CNN methods in references indexed by PubMed.Color palettes indicate (a) relationships of studies and (b) publication years.The pandemic does not seem to have significantly changed the patterns of PubMed-indexed publications.

Figure 6 .
Figure 6.Statistical modeling of the 212 (Google Scholar indexed) citations.Color palettes represent the year of publication.This result shows that image segmentation (using a hybrid of a transformer and a CNN) and classification for (mainly COVID-19) diagnostic purposes appear to be dominant tasks more frequently mentioned within the titles of the 212 references.

Figure 7 .
Figure 7. Results from statistical modeling of the 863 (IEEE Xplore) search hits visualized using VOSviewer.This result shows that transfer learning and data augmentation including the use of GANs seems to be a largely pre-COVID-19 method.It is probable that the pandemic and other medical conditions during or post-COVID-19 led to generation of sufficient medical images (image data sets) along with improved CNN approaches (making the use of data augmentations and/or transfer learning less frequent).

Figure 8 .
Figure 8. Pathological conditions, medical images and performance metrics are more frequently mentioned within the 2278 research articles and the 938 proceeding papers.

Figure 9 .
Figure 9. Ranking of more frequently used terms and phrases related to methods, diseases, images and metrics that were mentioned in the 2278 research articles, 938 conference proceeding papers and 903 book chapters that are relevant to the application of CNN for medical image understanding.

Figure 10 .
Figure 10.Distribution of publications per year for (a) found using IEEE Xplore, PubMed, Google Scholar and unique combinations of the search hits from the three search engines or databases; (b) identified using Dimensions (which also includes the search hits obtained from IEEE Xplore, PubMed and Google Scholar).

Figure 11 .
Figure 11.Bibliometrix analyses results showing (a) summary information about 2273 peer reviewed articles on the application of CNN for medical imaging and (b) the annual scientific production and average total citations per year.

Figure 12 .
Figure 12.(a) Most of the CNN-related foundational works seem to have been published between 2015 and 2018, which helped for an explosive growth of the field during the pandemic (or it may be that prior works on applying CNN for medical image classifications were not very numerous and that most of the papers published from 2015 to 2019 were disproportionately cited by the flood of papers during the pandemic and in the ensuing years).(b) Co-citation sources, i.e., the number of papers published by the different journals.Node sizes are proportional to the number of papers or proceedings published in that particular journal, and colors indicate either years of publications or inter-citation clustering.

Figure 13 .
Figure 13.Network showing bibliographic coupling of sources (cross-journal citations).Node size indicates the number of citations, and colors indicated inter-journal citations.

Figure 14 .
Figure 14.Distribution of the 114 articles identified using Dimensions and using the key words "Review AND CNN AND MEDICAL AND (IMAGE OR IMAGING) AND CLASSIFICATION" in the titles and abstracts.

Table 1 .
Improved or hybrid CNN architectures that are applied for medical image understanding.

Table 2 .
Activation functions frequently used in CNN applications for medical image processing.The middle solid line separates the ReLU families and derivates from other classes of activation functions.

Table 3 .
Frequently used CNN frameworks (the order of this list is arbitrary).
A distributed deep learning library for Apache Spark (fast, distributed and secure AI for big data).*Additional frameworks implementing graph neural networks (GNN) are available, such as PyTorch Geometric (PyTorch), TensorFlow GNN (TensorFlow) and jraph (Google JAX).Relevant application domains for GNNs include natural language processing, social networks, citation networks, molecular biology, physics and NP-hard combinatorial optimization problems.

Table 4 .
The most widely used hyperparameters for convolutional neural networks (CNNs).

Table 5 .
List of salient image datasets (that are important for medical themes).

Table 6 .
Comparison of the current study with other method review papers on medical image understanding.x denotes yes, and xx denotes the comprehensive coverage of alterative and/or improved CNN components or algorithms.