Ten Fast Transfer Learning Models for Carotid Ultrasound Plaque Tissue Characterization in Augmentation Framework Embedded with Heatmaps for Stroke Risk Stratification

Background and Purpose: Only 1–2% of the internal carotid artery asymptomatic plaques are unstable as a result of >80% stenosis. Thus, unnecessary efforts can be saved if these plaques can be characterized and classified into symptomatic and asymptomatic using non-invasive B-mode ultrasound. Earlier plaque tissue characterization (PTC) methods were machine learning (ML)-based, which used hand-crafted features that yielded lower accuracy and unreliability. The proposed study shows the role of transfer learning (TL)-based deep learning models for PTC. Methods: As pertained weights were used in the supercomputer framework, we hypothesize that transfer learning (TL) provides improved performance compared with deep learning. We applied 11 kinds of artificial intelligence (AI) models, 10 of them were augmented and optimized using TL approaches—a class of Atheromatic™ 2.0 TL (AtheroPoint™, Roseville, CA, USA) that consisted of (i–ii) Visual Geometric Group-16, 19 (VGG16, 19); (iii) Inception V3 (IV3); (iv–v) DenseNet121, 169; (vi) XceptionNet; (vii) ResNet50; (viii) MobileNet; (ix) AlexNet; (x) SqueezeNet; and one DL-based (xi) SuriNet-derived from UNet. We benchmark 11 AI models against our earlier deep convolutional neural network (DCNN) model. Results: The best performing TL was MobileNet, with accuracy and area-under-the-curve (AUC) pairs of 96.10 ± 3% and 0.961 (p < 0.0001), respectively. In DL, DCNN was comparable to SuriNet, with an accuracy of 95.66% and 92.7 ± 5.66%, and an AUC of 0.956 (p < 0.0001) and 0.927 (p < 0.0001), respectively. We validated the performance of the AI architectures with established biomarkers such as greyscale median (GSM), fractal dimension (FD), higher-order spectra (HOS), and visual heatmaps. We benchmarked against previously developed Atheromatic™ 1.0 ML and showed an improvement of 12.9%. Conclusions: TL is a powerful AI tool for PTC into symptomatic and asymptomatic plaques.


Introduction
Stroke is the third leading cause of mortality in the United States of America (USA) [1]. According to World Health Organization (WHO) statistics, cardiovascular disease (CVD) small, we added the augmentation block as part of the pre-processing step. The AI model block helps to determine whether plaques are symptomatic or asymptomatic. This is accomplished by transforming the test plaque image by the trained TL/DL models. In our proposed framework, because there are 11 models, we run each test patient's plaque using 11 (10 TL + 1 DL) different AI models for predicting 11 kinds of labels. We determine the performance of these 11 architectures, followed by the ranking of their performance.
We proposed an optimized TL model for carotid ultrasound-based plaque tissue classification (Atheromatic™ 2.0 TL, AtheroPoint™, Roseville, CA, USA). Because the features using this system are computed using a deep learning paradigm, we hypothesize that the performance of TL is superior and/or comparable to that of DL. Lastly, we have also designed a computer-aided diagnostics (CAD) system for computing heatmaps using an AIbased approach.   We proposed an optimized TL model for carotid ultrasound-based plaque tissue classification (Atheromatic™ 2.0 TL , AtheroPoint™, Roseville, CA, USA). Because the features using this system are computed using a deep learning paradigm, we hypothesize that the performance of TL is superior and/or comparable to that of DL. Lastly, we have also designed a computer-aided diagnostics (CAD) system for computing heatmaps using an AI-based approach.

Literature Survey
The existing work on carotid plaque characterization using ultrasound with AI techniques is primarily focused on the machine learning paradigm. A handful of the studies are focused on using DL. Our study is the first of its kind that uses the TL paradigm embedded with heatmaps for PTC. The section briefly presents the works on PTC. Detailed tabulation is described in the discussion section.
Seabra et al. [49] used graph cut techniques for the characterization of 3D ultrasound. It allows for the detection and quantification of the vulnerable plaque. The same set of authors in [50] estimated the volume inside the ROI plaque using the Bayesian technique. They compared the proposed method with a gold standard and achieved better results with greyscale median (GSM) < 32. In [51], they characterized the plaque components such as lipids, fibrotic, and calcified using the Rayleigh mixture model (RMM).
Afonso et al. [52] proposed a CAD tool (AtheroRisk™, AtheroPoint, Roseville, CA, USA) to characterize the plaque echogenicity using an activity index and enhanced activity index (EAI). The authors achieved an area-under-the-curve (AUC) of 64.96%, 73.29%, and 90.57% for the degree of stenosis, activity index, and enhanced activity index, respectively. This AtheroRisk™ CAD system was able to measure the plaque rupture risk. Loizou et al. identified and segmented the carotid plaque in M-mode ultrasound videos (MUVs) using a snake algorithm [53][54][55]. In [56], the authors studied the variations in texture features such as spatial gray level dependence matrices (SGLD) and gray level difference statistic (GLDS) in the MUV framework to classify them using a support vector machine (SVM) classifier. Doonan et al. [57] studied the relationship between textural and echo density features of carotid plaque by applying the principal component analysis (PCA)-based feature selection technique. The authors showed a moderate coefficient of correlation (r) between these two features, which range from 0.211 to 0.641. In addition to the above studies, Acharya et al. [58][59][60], Gastounioti et. al. [61], Skandha et. al. [62], and Saba et. al. [63] also conducted studies in the area of PTC using AI methods. This will be discussed in detail in Section 5, labeled benchmarking.

Methodology
This section focuses on patient demographics, ultrasound acquisition, pre-processing, and augmentation protocol. We also described all 11 AI architectures, consisting of ten transfer learning architectures and one deep learning architecture labelled as SuriNet. These are then benchmarked against the deep convolution neural network (DCNN).

Patient Demographics
This cohort consisted of 346 patients with a mean age of 69.9 ± 7.8 and 61% male patients having an internal carotid artery (ICA) stenosis of 50% to 99%. The study was approved by the ethical committee of St. Mary's Hospital, Imperial College, London, UK (in 2000). The cohort consisted of 196 symptomatic and 150 asymptomatic patients. All the symptomatic patients have ipsilateral cerebral hemispheric symptoms (amaurosis fugax) (AF), transient ischemic attacks, and previous history of stroke. Overall, the symptomatic class contained 38 AF, 70 transient ischaemic attack (TIAs), and 88 strokes, totaling 196. All the asymptomatic patients showed no abnormalities during the neurological study. The same cohort was used in our previous studies [29,32,40,58,[62][63][64][65].

Ultrasound Data Acquisition and Pre-Processing
All the US scans were acquired using an ATL machine (Model: HDI 3000; Make: Advanced Technology Laboratories, Seattle, WA, USA) in Irvine Laboratory for Cardiovascular Investigation and Research, St. Mary's Hospital, UK. This scanner was equipped with a linear broadband width 4-7 MHz (multifrequency) transducer with a 20 pixel/mm resolution. We used proprietary software called "PTAS" developed by Icon soft International Ltd., Greenford, London, UK for normalization and plaque ROI delineation, as used in previous studies [29,32,58,62,64,65]. The medical practitioners delineated the plaque region-of-interest (ROI) using the mouse and trackball; these were then saved in a separate file. Full scans and delineated plaques are shown in Figure 2. in previous studies [29,32,58,62,64,65]. The medical practitioners delineated the plaque region-of-interest (ROI) using the mouse and trackball; these were then saved in a separate file. Full scans and delineated plaques are shown in Figure 2.

Augmentation
Our cohort was unbalanced, consisting of 196 symptomatic and 150 asymptomatic. Therefore, we choose to balance using the augmentation strategy prior to offline training and online predicting processes. We accomplished this by adding 4 symptomatic and 50 asymptomatic augmented images using random linear transformations such as flipping, rotation by 90 degrees, rotation by 270 degrees, and skew operations. This resulted in a balanced cohort, containing 200 images in each class. Further, the database was incremented two to six times, consisting of an equal number of images using linear transformations. This resulted in six folds of the augmented cohort. We represent these folds as Augmented 2× (Aug 2×), Augmented 3× (Aug 3×), Augmented 4× (Aug 4×), Augmented 5× (Aug 5×), and Augmented 6× (Aug 6×). Thus, every fold contained 200 × n images in each class, where n is the augmented fold.

Transfer Learning
The choice of the TL architecture for PTC was motivated by (a) the diversity of the TL models and (b) the depth of the neural network models. Thus, we took two architectures from the VGG group (VGG- 16 and 19), two architectures from the DenseNet architectures (DenseNet121 and 169), and two architectures from the ResNet architectures (Res-Net50 and 101). All these models had a depth of neural networks extending to 169 layers while ensuring diversity. Note that some of the architectures such as MobileNet and Xcep-tionNet are the most current, state-of-the-art, and popular TL architectures, demonstrating faster optimization (see Figure 3).

Augmentation
Our cohort was unbalanced, consisting of 196 symptomatic and 150 asymptomatic. Therefore, we choose to balance using the augmentation strategy prior to offline training and online predicting processes. We accomplished this by adding 4 symptomatic and 50 asymptomatic augmented images using random linear transformations such as flipping, rotation by 90 degrees, rotation by 270 degrees, and skew operations. This resulted in a balanced cohort, containing 200 images in each class. Further, the database was incremented two to six times, consisting of an equal number of images using linear transformations. This resulted in six folds of the augmented cohort. We represent these folds as Augmented 2× (Aug 2×), Augmented 3× (Aug 3×), Augmented 4× (Aug 4×), Augmented 5× (Aug 5×), and Augmented 6× (Aug 6×). Thus, every fold contained 200 × n images in each class, where n is the augmented fold.

Transfer Learning
The choice of the TL architecture for PTC was motivated by (a) the diversity of the TL models and (b) the depth of the neural network models. Thus, we took two architectures from the VGG group (VGG- 16 and 19), two architectures from the DenseNet architectures (DenseNet121 and 169), and two architectures from the ResNet architectures (ResNet50 and 101). All these models had a depth of neural networks extending to 169 layers while ensuring diversity. Note that some of the architectures such as MobileNet and XceptionNet are the most current, state-of-the-art, and popular TL architectures, demonstrating faster optimization (see Figure 3). Visual Geometry Group (VGG-16) is a popular pre-trained model developed by Simonyan et al. [66] to increase the neural networks' depth by adding a number of 3 × 3 convolution filters. The purpose of VGGx is to design a very deep CNN for complex pattern understanding in the input features, typically adapted for object recognition in medical imaging and computer vision. The architecture of the VGG-16 and 19 is shown in Figure 4, where the input block accepts the image of size 224 × 224. VGG-19 is three layers Visual Geometry Group (VGG-16) is a popular pre-trained model developed by Simonyan et al. [66] to increase the neural networks' depth by adding a number of 3 × 3 convolution filters. The purpose of VGGx is to design a very deep CNN for complex pattern understanding in the input features, typically adapted for object recognition in medical imaging and computer vision. The architecture of the VGG-16 and 19 is shown in Figure 4, where the input block accepts the image of size 224 × 224. VGG-19 is three layers

InceptionV3
InceptionV3 (IV3) is version 3 of the inception stage and was first developed by Szegedy et al. [69]. This model was developed to overcome the computational cost and low parameters count. This model can handle big data. Thus, this model has overall high efficiency. Inception V3 achieves accuracy greater than 78.1% when using the ImageNet dataset. The architecture model contains several blocks. The blocks contain convolution and max-pooling layers. In the architecture given in Figure 5, DL1 to DL6 represent the depth wise convolution, C1 represents the initial convolution block, T1 to T3 represent the transition layer, and D1 to D4 represent the batch normalization blocks. In the Inception V3 architecture, each block in the top row represents the repeated process of row 2 and row 3. In row 2, each block represents the repeated process of row 3. Each convolution layer is fused with a 1 × 1 convolution filter with stride 1 and padding 0. First, it increases the feature map (FM) size, then a 3 × 3 convolution layer with stride 1 and padding 1 is added. It reduces the FM depth; the resultant FM and the initial FM are fused together to give each block in row 2.

InceptionV3
InceptionV3 (IV3) is version 3 of the inception stage and was first developed by Szegedy et al. [69]. This model was developed to overcome the computational cost and low parameters count. This model can handle big data. Thus, this model has overall high efficiency. Inception V3 achieves accuracy greater than 78.1% when using the ImageNet dataset. The architecture model contains several blocks. The blocks contain convolution and max-pooling layers. In the architecture given in Figure 5, DL1 to DL6 represent the depth wise convolution, C1 represents the initial convolution block, T1 to T3 represent the transition layer, and D1 to D4 represent the batch normalization blocks. In the Inception V3 architecture, each block in the top row represents the repeated process of row 2 and row 3. In row 2, each block represents the repeated process of row 3. Each convolution layer is fused with a 1 × 1 convolution filter with stride 1 and padding 0. First, it increases the feature map (FM) size, then a 3 × 3 convolution layer with stride 1 and padding 1 is added. It reduces the FM depth; the resultant FM and the initial FM are fused together to give each block in row 2.

ResNet
He et al. [70] from Microsoft research proposed ResNet architecture for solving the vanishing gradient problem. It contains residual blocks. Residual blocks contain skip connections. These skip connections skip some layers from training and connect directly to the output. The advantage of these connections is the skipping of layers, so that the model will learn complex patterns. Unlike other TL models, this model is trained on the CIFAR-10 data set. Figure 6 represents the ResNet architecture. In the architecture, two 3 × 3 convolution layers are paired together. The output of these pairs and its input are fused together and fed to next pair. Here, the number of filters is in increasing order from 64 to 512. At the end of the last 3 × 3 convolution layer with 512 filters and an added flatten layer for vectorization of the 2D features, the output is predicted using the softmax activation function.

DenseNet
Huang et al. [48] proposed the DenseNet architecture for solving vanishing gradient problem in deep neural nets. In this model, dense blocks were introduced. It contains a pool of convolution layers with 3 × 3 filters to 1 × 1 filters followed by batch normalization, and every layer uses the "ReLu" activation function. Each of these dense blocks was concatenated with previous block output and input using transition blocks. Each transition block contains a convolution and pooling layer with 2 × 2 to 1 × 1 filters with dropout layers. This concertation of blocks preserves the feature propagation nature. In addition,  Figure 7 shows the DenseNet architecture.

ResNet
He et al. [70] from Microsoft research proposed ResNet architecture for solving the vanishing gradient problem. It contains residual blocks. Residual blocks contain skip connections. These skip connections skip some layers from training and connect directly to the output. The advantage of these connections is the skipping of layers, so that the model will learn complex patterns. Unlike other TL models, this model is trained on the CIFAR-10 data set. Figure 6 represents the ResNet architecture. In the architecture, two 3 × 3 convolution layers are paired together. The output of these pairs and its input are fused together and fed to next pair. Here, the number of filters is in increasing order from 64 to 512. At the end of the last 3 × 3 convolution layer with 512 filters and an added flatten layer for vectorization of the 2D features, the output is predicted using the softmax activation function.

ResNet
He et al. [70] from Microsoft research proposed ResNet architecture for solving the vanishing gradient problem. It contains residual blocks. Residual blocks contain skip connections. These skip connections skip some layers from training and connect directly to the output. The advantage of these connections is the skipping of layers, so that the model will learn complex patterns. Unlike other TL models, this model is trained on the CIFAR-10 data set. Figure 6 represents the ResNet architecture. In the architecture, two 3 × 3 convolution layers are paired together. The output of these pairs and its input are fused together and fed to next pair. Here, the number of filters is in increasing order from 64 to 512. At the end of the last 3 × 3 convolution layer with 512 filters and an added flatten layer for vectorization of the 2D features, the output is predicted using the softmax activation function.  pool of convolution layers with 3 × 3 filters to 1 × 1 filters followed by batch normalization, and every layer uses the "ReLu" activation function. Each of these dense blocks was concatenated with previous block output and input using transition blocks. Each transition block contains a convolution and pooling layer with 2 × 2 to 1 × 1 filters with dropout layers. This concertation of blocks preserves the feature propagation nature. In addition, the author proposed architectures (DenseNet-121, 169, 201, and 264) to increase the dense block. Figure 7 shows the DenseNet architecture.

MobileNet
Howard et al. [42] from Google developed the MobileNet architecture. The main inspiration of MobileNet comes from the IV3 network. It aims to solve resource constraint problems such as working on edge devices like NVIDIA Jetson (www.nvidia.com accessed 20 October 2021) or Rasberry Pi (from Rasberry Pi Foundation, Cambridge, UK). This architecture is a small, low latency, and low power model. This was the first computer vision model developed for TensorFlow for mobile devices. It contains 28 layers and uses the TFlite (database) library. Figure 8 presents the architecture of MobileNet architecture. This model contains bottleneck residual blocks (BRBs), also referred to as inverted residual blocks used for reducing the number of training parameters in the model.

MobileNet
Howard et al. [42] from Google developed the MobileNet architecture. The main inspiration of MobileNet comes from the IV3 network. It aims to solve resource constraint problems such as working on edge devices like NVIDIA Jetson (www.nvidia.com accessed 20 October 2021) or Rasberry Pi (from Rasberry Pi Foundation, Cambridge, UK). This architecture is a small, low latency, and low power model. This was the first computer vision model developed for TensorFlow for mobile devices. It contains 28 layers and uses the TFlite (database) library. Figure 8 presents the architecture of MobileNet architecture. This model contains bottleneck residual blocks (BRBs), also referred to as inverted residual blocks used for reducing the number of training parameters in the model.

XceptionNet
Chollet et al. [71] from Google proposed modifying IV3 by replacing the inception modules with modified depth-wise separable convolution layers. This architecture contains 36 layers. In comparison with IV3, XceptionNet is lightweight and contains the same number of parameters as IV3. This architecture outperforms InceptinV3 with top-1 accuracy of 0.790 and top-5 accuracy of 0.945. Figure 9 represents the architecture of Xception-Net.

XceptionNet
Chollet et al. [71] from Google proposed modifying IV3 by replacing the inception modules with modified depth-wise separable convolution layers. This architecture contains 36 layers. In comparison with IV3, XceptionNet is lightweight and contains the same number of parameters as IV3. This architecture outperforms InceptinV3 with top-1 accuracy of 0.790 and top-5 accuracy of 0.945. Figure 9 represents the architecture of XceptionNet.

XceptionNet
Chollet et al. [71] from Google proposed modifying IV3 by replacing the inception modules with modified depth-wise separable convolution layers. This architecture contains 36 layers. In comparison with IV3, XceptionNet is lightweight and contains the same number of parameters as IV3. This architecture outperforms InceptinV3 with top-1 accuracy of 0.790 and top-5 accuracy of 0.945. Figure 9 represents the architecture of Xception-Net. Figure 9. XceptionNet architecture. Figure 9. XceptionNet architecture.

AlexNet
Alex Krizhevsky et al. [72] proposed AlexNet in 2012 for solving complicated Ima-geNet challenges. It is the first CNN architecture built for solving complex computer vision problems. This architecture achieves a top-5 error rate of 15.3%. This architecture shifts the paradigm of AI entirely. It takes 256 × 256 size image input and contains five convolution layers followed by max-pooling with two fully connected networks. The output layer is the softmax layer. The sample architecture is shown in Figure 10. Alex Krizhevsky et al. [72] proposed AlexNet in 2012 for solving complicated ImageNet challenges. It is the first CNN architecture built for solving complex computer vision problems. This architecture achieves a top-5 error rate of 15.3%. This architecture shifts the paradigm of AI entirely. It takes 256 × 256 size image input and contains five convolution layers followed by max-pooling with two fully connected networks. The output layer is the softmax layer. The sample architecture is shown in Figure 10.

SqueezeNet
Landola et al. [73] proposed a 50× times smaller model than the AlexNet architecture. Nevertheless, the authors achieved 82.5% in top-5 accuracy on ImageNet. This model contains a novel "Fire Module". It contains a 1 × 1 filtered squeeze convolution layer fed to the "Expand Module", which contains a mix of 1 × 1 to 3 × 3 filters for convolution. The squeeze layer (Fire Module) helps to reduce the number of input channels to 3 × 3. The architecture of the SqueezeNet and Fire Module is shown in Figure 11. In this study, we transferred trained weights to SqueezeNet initial layers and fed our cohort at the end layer.

SqueezeNet
Landola et al. [73] proposed a 50× times smaller model than the AlexNet architecture. Nevertheless, the authors achieved 82.5% in top-5 accuracy on ImageNet. This model contains a novel "Fire Module". It contains a 1 × 1 filtered squeeze convolution layer fed to the "Expand Module", which contains a mix of 1 × 1 to 3 × 3 filters for convolution. The squeeze layer (Fire Module) helps to reduce the number of input channels to 3 × 3. The architecture of the SqueezeNet and Fire Module is shown in Figure 11. In this study, we transferred trained weights to SqueezeNet initial layers and fed our cohort at the end layer.

SqueezeNet
Landola et al. [73] proposed a 50× times smaller model than the AlexNet architecture. Nevertheless, the authors achieved 82.5% in top-5 accuracy on ImageNet. This model contains a novel "Fire Module". It contains a 1 × 1 filtered squeeze convolution layer fed to the "Expand Module", which contains a mix of 1 × 1 to 3 × 3 filters for convolution. The squeeze layer (Fire Module) helps to reduce the number of input channels to 3 × 3. The architecture of the SqueezeNet and Fire Module is shown in Figure 11. In this study, we transferred trained weights to SqueezeNet initial layers and fed our cohort at the end layer. Figure 11. SqueezeNet architecture. Figure 11. SqueezeNet architecture.

Deep Learning Architecture: SuriNet
In our study, we benchmarked TL architectures with two DL architectures. One is conventional CNN and the other is SuriNet architecture. Although the UNet network is very popular for segmentation in medical image analysis, we used a modified UNet architecture called SuriNet for classification purposes. In the proposed SuriNet architecture, we used separable convolution neural networks to reduce the overfitting and the number of parameters required for training. Figure 12 shows the SuriNet architecture. Table 1 gives the detailed number of training parameters for SuriNet.

Deep Learning Architecture: SuriNet
In our study, we benchmarked TL architectures with two DL architectures. One is conventional CNN and the other is SuriNet architecture. Although the UNet network is very popular for segmentation in medical image analysis, we used a modified UNet architecture called SuriNet for classification purposes. In the proposed SuriNet architecture, we used separable convolution neural networks to reduce the overfitting and the number of parameters required for training. Figure 12 shows the SuriNet architecture. Table 1 gives the detailed number of training parameters for SuriNet.

Experimental Protocol
Our study used 12 AI models (10 TL and 2 DL) with six augmentation folds and 1000 epochs using the K10 cross-validation protocol. It totals to~720,000 (720 K) runs for finding the optimization point of each AI model. The mean accuracy of each model is calculated using the following section. If η(m, k) represents the accuracy of an AI model "m" using cross-validation combination "k" out of total combinations K, then the mean accuracy for all the combinations for the model "m", represented by η(m) can be mathematically given by Equation (1). Note that we considered K10 protocol in our paradigm, so K = K10 = 10.

Performance Analysis and Visualization of SuriNet
The objective of this experiment was to evaluate the performance of SuriNet using Equation (1). In addition, SuriNet is based on the DL model. It is end-to-end trained on the target labels. So, we can visualize the intermediate layers' feature maps of symptomatic and asymptomatic plaques. In this regard, we considered the optimized augmentation fold out of 10 combinations as the combination with the best performance for the visualization of the filters.

Results
This section discusses three sets of experimentations for comparison of TL versus DL to prove the hypothesis. The first experiment is the 3D optimization of the ten TL architectures by varying the augmentation folds. The second experiment is the 3D optimization of the SuriNet architecture by varying the same fold. The third experiment is the benchmarking TL architectures with SuriNet and CNN by calculating the AUC.

3D Optimization of TL Architectures and Benchmarking against CNN
In this experiment, we used all the TL architectures for finding the optimized TL by varying the augmentation folds. There are 10 TL architectures, 6 augmentation folds, K10 cross-validation protocol, and 1000 epochs. The model is trained by empirically selecting each model's flatten point at a loss versus accuracy, thus there were 12 × 6 × 10 × 1000~720 K runs. We used a total of 720,000 runs to obtain the optimization point. This is a reasonably large number of computations and needs high computation power. Thus, we used the Nvidia DGX V100 supercomputer at Bennett University, Gr. Noida. Figure 13 shows the performance of ten AI architectures, and the red arrow indicates the optimization point for each AI model when ran over six augmentations. The corresponding values are represented in Table 2. Using Equation (1), we calculate the mean accuracy of the AI models.
Diagnostics 2021, 11, x FOR PEER REVIEW 14 of 32 Figure 13. 3D bar chart representation of the AI model accuracy vs. augmentation folds, light blue color bar represents the Aug 1×, orange color bar represents the Aug 2×, gray color bar represents Aug 3×, yellow bar represents the Aug 4×, dark blue color represents Aug 5×, green color bar represents Aug 6×, and red arrow represents the optimization point of each classifier.
As seen in Figure 13, MobileNet and DenseNet 169 show better accuracy than other TL architectures. They showed 96.19% and 95.64% accuracy, respectively. Aug 2× is the optimization point for both models. Table 3 shows the comparison between ten types of TL, which include VGG16, VGG19, DenseNet121, XceptionNet, MobileNet, AlexNet, In-ceptionV3, and SqueezeNet, along with seven types of DL. The ten types of TL and seven types of DL include CNN5, CNN7, CNN9, CNN11, CNN13, CNN15, and SuriNet, respectively. Note that CNN5 to CNN15 were taken from our previous study [62], so we have elaborated on the CNN architecture in Appendix A.  As seen in Figure 13, MobileNet and DenseNet 169 show better accuracy than other TL architectures. They showed 96.19% and 95.64% accuracy, respectively. Aug 2× is the optimization point for both models. Table 3 shows the comparison between ten types of TL, which include VGG16, VGG19, DenseNet121, XceptionNet, MobileNet, AlexNet, InceptionV3, and SqueezeNet, along with seven types of DL. The ten types of TL and seven types of DL include CNN5, CNN7, CNN9, CNN11, CNN13, CNN15, and SuriNet, respectively. Note that CNN5 to CNN15 were taken from our previous study [62], so we have elaborated on the CNN architecture in Appendix A.  In the SuriNet architecture, there are 22 layers, while there is a varying number of layers in the CNN architecture, ranging from 5 to 15. It is important to note that all CNNs except CNN5 have accuracies above 92.27%. The overall mean and standard deviation of the DL accuracies was 90.86 ± 3.15%. The innovation of the current study was the design and development of TLs. They are benchmarking against DL. In Table 3, the mean and standard deviation of ten TLs was 89.35 ± 2.54%. Thus, the mean accuracy of TL systems is comparable to the mean accuracy of DL systems and in the range of~1%. MobileNet has the highest accuracy among all the TL systems (96.19%), while CNN11 has the highest accuracy among all the DL systems (95.66%). Further, it is essential to note that the mean accuracy variations are less than or equal to 3% within the limits of good design and operating conditions (typically, regulatory approved systems have variation of less than 5%).

3D Optimization of SuriNet
In this set of experiments, we used the popular UNet architecture model for classification. Figure 12 represents the SuriNet architecture inspired by UNet. We optimized SuriNet by varying the augmentation folds. Here, we also used the K10 CV protocol for training and testing. We choose 1000 epochs empirically. Therefore, the total number of runs for optimizing SuriNet is 60,000 (1 SuriNet × 6 Aug folds × 10 combinations × 1000 epochs). We used the same set of hardware resources (used in the previous section) for this experiment. Table 2 represents the average accuracy at the augmentation folds. SuriNet is optimized at Aug 5× with an accuracy of 92.77 percent.

Visualization of the SuriNet
We visualized the intermediate layers of SuriNet to understand the learning ability of the model over CUS. Figure 14 represents the mean visualization of the training samples of symptomatic and asymptomatic classes from all the filters at the end layer before vectorization. The turquoise color represents the learned features, yellow represents the high-level features, and green represents the low-level features.

Performance Evaluation
This section aims to evaluate the samples required for the study using standard power analysis. As we are using 12 AI models (10 TL, 2 DL), it is necessary to rank the models by considering all the performance parameters for finding the best performing AI model among the 12 AI models. In addition to that, we compared the performance of all 12 AI models with area-under-the-curve (AUC) using the receiver operating characteristic curve (ROC).

Power Analysis
We used a standardized protocol (power analysis) for analyzing the number of samples required at a certain threshold of the error margin. We considered a 95% confidence interval with a 5% margin of error and a data proportion of 0.5. We used Equation (2) below to compute the number of samples.
Here, n is the number of samples (sample size), z* is the z score (1.96) from the ztable, MoE is a margin of error, and p represents the data proportion. In our study, we had a total of 2400 images. Using the power analysis, the total samples required for the study was 384. Thus, the number of the sample used in this study was 84% higher than the required samples.

Ranking of AI Models
After obtaining the absolute values of 12 AI models' performance metrics, we sorted the AI models into increasing order and then compared each value with the highest possible value in the attribute. We considered five marks. If the percentage was more significant than 95%, we considered four marks. If it was greater than 90 and less than 95, we considered three marks. If it was more significant than 80% and less than 90%, we considered two marks. If it was more significant than 75%, we considered one mark. If it was greater than 50% or less than 50%, it was considered as zero. The resultant rank table of the AI models is shown in Table 4. We color-coded each AI model from red to green. Each

Performance Evaluation
This section aims to evaluate the samples required for the study using standard power analysis. As we are using 12 AI models (10 TL, 2 DL), it is necessary to rank the models by considering all the performance parameters for finding the best performing AI model among the 12 AI models. In addition to that, we compared the performance of all 12 AI models with area-under-the-curve (AUC) using the receiver operating characteristic curve (ROC).

Power Analysis
We used a standardized protocol (power analysis) for analyzing the number of samples required at a certain threshold of the error margin. We considered a 95% confidence interval with a 5% margin of error and a data proportion of 0.5. We used Equation (2) below to compute the number of samples.
Here, n is the number of samples (sample size), z* is the z score (1.96) from the z-table, MoE is a margin of error, andp represents the data proportion. In our study, we had a total of 2400 images. Using the power analysis, the total samples required for the study was 384. Thus, the number of the sample used in this study was 84% higher than the required samples.

Ranking of AI Models
After obtaining the absolute values of 12 AI models' performance metrics, we sorted the AI models into increasing order and then compared each value with the highest possible value in the attribute. We considered five marks. If the percentage was more significant than 95%, we considered four marks. If it was greater than 90 and less than 95, we considered three marks. If it was more significant than 80% and less than 90%, we considered two marks. If it was more significant than 75%, we considered one mark. If it was greater than 50% or less than 50%, it was considered as zero. The resultant rank table of the AI models is shown in Table 4. We color-coded each AI model from red to green. Each model is color-coded in this band. If the model performance is low, it is represented as red. If it performs well, it is represented as green. Please see Appendix B for grading scheme.

AUC-ROC Analysis
We computed the area-under-the-curve (AUC) for all the proposed AI models and compared the performance with our previous existing work [62] consisting of a CNN model with an accuracy of 95.66% and AUC of 0.956. Figure 15 represents the ROC comparison of 10 AI methods. Among all the architectures, MobileNet showed the highest AUC value as 0.961 (p-value < 0.0001) and better performance than CNN [62].

Scientific Validation versus Clinical Validation
In this section, we discussed the validation of the hypothesis. Scientific validation was carried out by heatmap analysis using the TL-based "Grad Cam" technique and clinical validation was proved using a correlation analysis of the biomarker with AI.

Scientific Validation Using Heatmaps
We applied a novel visualization technique called gradient weighted class activation map ("Grad Cam") for identifying the diseased areas in the plaque cut sections using VGG16 transfer learning architecture. Grad-CAM produces heatmaps based on the weights generated during the training. Here, we take feature maps of the final layer. It gives the essential regions of the target, and heatmaps highlight these regions. Figures 16 and 17 represent the heatmaps of the nine patients of symptomatic and asymptomatic class. The dark red color region represents the diseased region in symptomatic plaque, whereas it represents the higher calcium area in asymptomatic plaque.
The Grad-Cam works on the training weights generated during the training phase. The DL model captures the important regions of the target label. We compared the heatmaps with original images of both symptomatic and asymptomatic images. We observed that heatmaps exhibit a darker region surrounded by grayscale regions. Meanwhile, in asymptomatic regions, DL observes grayscale regions. Figure 17(a1,a2,b1,c1) are the important regions observed by DL of symptomatic images, and Figure 17(d1,e1,e2,e3,f1,f2,f3) are the observed important regions of the asymptomatic images by the DL model. This comparison proves our hypothesis that symptomatic plaques are hypoechoic and dark, and asymptomatic plaques are bright and hyperechoic.

Correlation Analysis
We correlated all the biomarkers for the detection of the risk with AI. Table 5 represents the correlation coefficient of all the biomarkers. Among all the biomarkers, GSM versus FD shows a better p-value. We computed the correlation coefficient using MedCalc. We computed the Euclidean distance (ED) between the centers of the two clusters (sym and asym). Table 6 represents the ED between two clusters, symptomatic versus asymptomatic. AI shows constant variation among all the techniques, whereas GSM with FD and higher order spectra (HOS) shows the maximum distance. Figure 18 represents the correlation of AI (SuriNet), GSM, FD, and HOS, and the black dot represents the center of each class. The clusters of symptomatic and asymptomatic are represented with red and violet color, respectively. The black dot represents the center of the cluster and the eclipse on the cluster represents the high-density area. Figure 18b,d,e represent the (a) strong correlation, (c) moderate correlation, and (f) weak correlation between the biomarkers.

Correlation Analysis
We correlated all the biomarkers for the detection of the risk with AI. Table 5 represents the correlation coefficient of all the biomarkers. Among all the biomarkers, GSM versus FD shows a better p-value. We computed the correlation coefficient using MedCalc. We computed the Euclidean distance (ED) between the centers of the two clusters (sym and asym). Table 6 represents the ED between two clusters, symptomatic versus asymptomatic. AI shows constant variation among all the techniques, whereas GSM with FD and higher order spectra (HOS) shows the maximum distance. Figure 18 represents the correlation of AI (SuriNet), GSM, FD, and HOS, and the black dot represents the center of each class. The clusters of symptomatic and asymptomatic are represented with red and violet color, respectively. The black dot represents the center of the cluster and the eclipse on the cluster represents the high-density area. Figure 18b

Discussion
The proposed study is the first of its kind to use ten transfer learning models that classify and characterize the symptomatic and asymptomatic carotid plaques. The proposed models, 10 TL and 1 DL (SuriNet), are optimized using augmentation folds with K10 cross-validation protocol. The proposed MobileNet showed an accuracy of 96.19%,

Discussion
The proposed study is the first of its kind to use ten transfer learning models that classify and characterize the symptomatic and asymptomatic carotid plaques. The proposed models, 10 TL and 1 DL (SuriNet), are optimized using augmentation folds with K10 crossvalidation protocol. The proposed MobileNet showed an accuracy of 96.19%, while SuriNet was relatively high, having an accuracy of 92.70%, and our previous study using CNN [62] showed 95.66%. Our overall performance analysis showed that TL performance is superior to that of the DL models.

Benchmarking
In this section, we benchmarked the proposed system with the existing techniques [29,[58][59][60][61][62][63][74][75][76][77][78][79][80][81][82][83][84]. Table 7 shows the benchmarking table, where the table can be classified into ML-based and DL-based systems for PTC. The table shows columns C1 to C6, where C1 represents the author and the corresponding year, C2 shows the selected features for that study, C3 shows the classifiers used for PTC, C4 displays the dataset size and country, and C5 and C6 give the type of AI model and accuracy along with the AUC. Rows R1 to R17 represent the existing studies on PTC using CUS, while R18 and R19 discuss the proposed studies. In row R1, Christodoulou et al. [76] extracted ten different law texture energy features and fractal dimension features from the CUS and were able to characterize the PTC with diagnostic yield (DY) of 73.1% using SOM and 68.8% using k-NN. Mougiakakou et al. (2006) [44] (R2, C1) extracted first-order statistics and the law of texture energy features from 108 US scans. The authors reduced the dimensionality of the extracted features using ANOVA and then fed the resultant features to neural networks with backpropagation and genetic architecture to classify symptomatic versus asymptomatic plaques. The authors achieved an accuracy of 99.18% and 94.48%, respectively. Seabra et al. [74] (R3, C1) extracted echo-morphological and texture features from 146 US scans. Then, they fused those features with clinical information, later used by AdaBoost classifier for classifying symptomatic versus asymptomatic plaques. The authors successfully achieved 99.2% accuracy using leave-one-participant-out (LOPO) cross-validation. Christodoulou et al. [79] (R4, C1) extracted multiple features such as shape features, morphology features, histogram features, and correlogram features from 274 US scans, which were then used by two sets of classifiers, SOM and k-NN. The authors achieved an accuracy of 72.6% and 73.0%, respectively. Acharya et al. [58] (R5, C1) extracted texturebased features from the Cyprus cohort containing 346 carotid ultrasound scans, which were then fed to (a) SVM classifier with RBF kernel and (b) Adaboost classifier. The authors achieved an accuracy of 82.48% and 81.7% with AUC of 0.82 and 0.81, respectively. Kyriacou et al. [80] (R6, C1) developed a CAD system for predicting the period of stroke using binary logistic regression and SVM, which achieved 77%. Acharya et al. [59] (R7, C1) extracted texture-based features from 346 CUS scans and fed them to the SVM classifier, and achieved an accuracy of 83.78%. The same authors in [60] (R8, C1) extracted discrete wavelet transform (DWT) features using the Cyprus cohort of 346 US scans, and fed them to an SVM classifier, achieving an accuracy of 83.78%. Gatounioti et al. [61] (R9, C1) extracted Fisher discriminant ratio features from 56 CUS scans, and fed them to an SVM classifier, achieving an accuracy of 88.08% with an AUC of 0.90. Molinari et al. [84] (R10, C1) used a data mining approach by taking bidimensional empirical mode decomposition and entropy features from 1173 CUS scans and then used an SVM classifier with RBF kernel for classification. The authors achieved an accuracy of 91.43%.
The second set of studies used DL models for PTC. Skandha et al. [62] (R11, C1) extracted automatic features using optimized CNN from augmented 346 patients. The authors achieved an accuracy of 95.66% and an AUC of 0.956 (p < 0.0001). The authors successfully characterized the symptomatic versus asymptomatic plaques using mean feature strength, higher-order spectrum, and histogram analysis. Saba et al. [63] (R12, C1) used a randomized augmented cohort generated from 346 patient CUS with 13 layered CNN and achieved an accuracy of 89% with an AUC of 0.9 (p < 0.0001).

Comparison of TL Models
TL architectures use the pretrained weights for retraining the model for target label prediction. However, the TL architecture training time depended on the size of the pretrained weights and hardware resources. Various TL models discussed in Table 6 had advantages over the other model, as explained in Tables 8 and 9.   Table 9. Similarities and differences between the TL models.

Architecture Key Findings Similarities Differences
AlexNet First deep neural network using convolution.

VGG
Reducing the number of parameters in convolution and training time.

InceptionV3
Effective object detection for solving variable size objects using kernels of different sizes in each layer.

ResNet
Solving the vanishing gradient problem in the deep neural network using skip (shortcut) connections.

MobileNet
The first model was developed for supporting tensor flow in edge devices using light-weighted tensor flow.

XceptionNet
Fast optimization and reducing the trainable parameters in IV3 using depth-wise convolution.

DenseNet
Increasing the feed-forward nature in the neural networks using dense layers by concatenating the features from its previous layers.

Advantages of TL Models
TL models' designs have similarities and differences between them. These are explained in Table 9, along the key findings of every TL model.

GUI Design
AtheroPoint™ developed the Atheromatic™ 2.0 TL system, a computer-aided diagnostic system for stroke risk stratification. Figure 19 represents the screenshot of the CAD system. This CAD system will provide the plaque risk and heatmaps generated by the Grad-Cam with the help of TL/DL models. In the CAD system, the heatmap would be predicted on the test image once the training model is selected.

GUI Design
AtheroPoint™ developed the Atheromatic™ 2.0 TL system, a computer-aided diagnostic system for stroke risk stratification. Figure 19 represents the screenshot of the CAD system. This CAD system will provide the plaque risk and heatmaps generated by the Grad-Cam with the help of TL/DL models. In the CAD system, the heatmap would be predicted on the test image once the training model is selected. Figure 19. GUI screenshot of the Atheromatic™ 2.0 TL system.

Strengths/Weakness/Extensions
We evaluated the optimization point of the TL models against various augmentation folds and compared the performance of the TL models against that of the DL models such as SuriNet and CNN. The TL model showed an improvement for symptomatic versus asymptomatic plaque classification accuracy. Furthermore, our Atheromatic™ 2.0 TL system predicts the risk of plaque and vulnerability using the color heatmaps on test scans.
Even though the power sample suggests that we have enough samples for the training, the main limitation of this study was the moderate cohort size. In addition to the cohort size, another limitation of this study is the limited availability of the hardware resources such as supercomputer availability, especially in third-world developing countries.
Our study had a manual delineation of ICA data sets. In future, there could be a need to design an automated ICA segmentation system [85]. Another possibility would be to improve the CNN by an improved DCNN model, where the rectified linear unit (ReLU) activation function was modified, ensuring "differentiable at zero" [38]. There are dense networks such as DenseNet121, DenseNet169, and DenseNet201 that could be tried and compared [39]. Further, one can further combine hybrid deep learning models for PTC [86]. Finally, the proposed AI models can be extended to a big data framework by including other risk factors. Figure 19. GUI screenshot of the Atheromatic™ 2.0 TL system with three example cases (a-c).

Strengths/Weakness/Extensions
We evaluated the optimization point of the TL models against various augmentation folds and compared the performance of the TL models against that of the DL models such as SuriNet and CNN. The TL model showed an improvement for symptomatic versus asymptomatic plaque classification accuracy. Furthermore, our Atheromatic™ 2.0 TL system predicts the risk of plaque and vulnerability using the color heatmaps on test scans.
Even though the power sample suggests that we have enough samples for the training, the main limitation of this study was the moderate cohort size. In addition to the cohort size, another limitation of this study is the limited availability of the hardware resources such as supercomputer availability, especially in third-world developing countries.
Our study had a manual delineation of ICA data sets. In future, there could be a need to design an automated ICA segmentation system [85]. Another possibility would be to improve the CNN by an improved DCNN model, where the rectified linear unit (ReLU) activation function was modified, ensuring "differentiable at zero" [38]. There are dense networks such as DenseNet121, DenseNet169, and DenseNet201 that could be tried and compared [39]. Further, one can further combine hybrid deep learning models for PTC [86]. Finally, the proposed AI models can be extended to a big data framework by including other risk factors.

Conclusions
The proposed study is the first of its kind to characterize and classify the carotid plaque using an optimized transfer learning approach and SuriNet (a class of Atheromatic™ 2.0 TL ). Eleven AItherop models were implemented, and the best AUC was 0.961 (p < 0.0001) from MobileNet and 0.927 (p < 0.0001) from SuriNet. We validated the performance using grayscale median, fractal dimension, higher-order spectra, and spatial heatmaps. TL showed equal and comparable performance to deep learning. The Atheromatic™ 2.0 TL model showed a performance improvement of 12.9% over Atheromatic™ 1.0 ML (AtheroPoint, Roseville, CA, USA) compared with the previous machine learning-based paradigm. The system was validated with the widely accepted dataset. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Conflicts of Interest:
Dr. Jasjit Suri is with Atheropoint TM specialization in cardiovascular and stroke imaging. The rest of the authors declare no conflict of interest. The global architecture of the deep convolutional neural network (DCNN) is shown in Figure A1. It is composed of four convolution layers followed by an average pooling layer, thus a total of nine layers. These are followed by a flatten layer for the conversion of the 2D feature map to a 1D feature map. This is followed by two hidden dense layers consisting of 128 nodes. The final output is the "softmax" layer that has two nodes representing symptomatic class and asymptomatic class. We choose the "ReLu" activation function for all the n − 1 layers, as ReLu helps in fast convergence to the solution compared with "sigmoid" or "tanh" activation functions [87]. Equation (A1) gives the categorical cross-function used in the experimentation for all the models.

Abbreviations
where y i is the class label for input and a i is the predicted probability of class being y i . The global architecture of the deep convolutional neural network (DCNN) is shown in Figure A1. It is composed of four convolution layers followed by an average pooling layer, thus a total of nine layers. These are followed by a flatten layer for the conversion of the 2D feature map to a 1D feature map. This is followed by two hidden dense layers consisting of 128 nodes. The final output is the "softmax" layer that has two nodes representing symptomatic class and asymptomatic class. We choose the "ReLu" activation function for all the n  1 layers, as ReLu helps in fast convergence to the solution compared with "sigmoid" or "tanh" activation functions [87]. Equation (A1) gives the categorical cross-function used in the experimentation for all the models. where yi is the class label for input and a i is the predicted probability of class being y i .

Appendix A.2. 3-D Optimization of Deep Convolutional Neural Network Architecture
As the best performance of the DCNN model depends on the number of layers and hyperparameters tuned [63], we thus considered several configurations of DCNN that consisted of a combination of difference convolution, average pooling, and dense layers.

Appendix A.2. 3-D Optimization of Deep Convolutional Neural Network Architecture
As the best performance of the DCNN model depends on the number of layers and hyperparameters tuned [63], we thus considered several configurations of DCNN that consisted of a combination of difference convolution, average pooling, and dense layers. This required undergoing 3D optimization between accuracy, DCNN layers, and folds of the augmentation. Table A1 shows the six types of DCNN. Table A1. Six types of DCNN models consisting of a different combination of convolution, average pooling, and dense layers. The total number of layers is shown as the number "X" at the end of DCNN in column 1.