Endocrine Tumor Classification via Machine-Learning-Based Elastography: A Systematic Scoping Review

Simple Summary The incidence of endocrine cancers (e.g., thyroid, pancreas, and adrenal) has been increasing; these cancers have a high premature mortality rate. Traditional medical imaging methods (e.g., MRI and CT) might not be sufficient for accurate cancer screening. Elastography complements conventional medical imaging modalities by identifying abnormal tissue stiffness of the tumor, in which machine learning techniques can further improve accuracy and reliability. This review focuses on the applications and performance of machine-learning-based elastography in classifying endocrine tumors. Abstract Elastography complements traditional medical imaging modalities by mapping tissue stiffness to identify tumors in the endocrine system, and machine learning models can further improve diagnostic accuracy and reliability. Our objective in this review was to summarize the applications and performance of machine-learning-based elastography on the classification of endocrine tumors. Two authors independently searched electronic databases, including PubMed, Scopus, Web of Science, IEEEXpress, CINAHL, and EMBASE. Eleven (n = 11) articles were eligible for the review, of which eight (n = 8) focused on thyroid tumors and three (n = 3) considered pancreatic tumors. In all thyroid studies, the researchers used shear-wave ultrasound elastography, whereas the pancreas researchers applied strain elastography with endoscopy. Traditional machine learning approaches or the deep feature extractors were used to extract the predetermined features, followed by classifiers. The applied deep learning approaches included the convolutional neural network (CNN) and multilayer perceptron (MLP). Some researchers considered the mixed or sequential training of B-mode and elastographic ultrasound data or fusing data from different image segmentation techniques in machine learning models. All reviewed methods achieved an accuracy of ≥80%, but only three were ≥90% accurate. The most accurate thyroid classification (94.70%) was achieved by applying sequential training CNN; the most accurate pancreas classification (98.26%) was achieved using a CNN–long short-term memory (LSTM) model integrating elastography with B-mode and Doppler images.

less prone to observer variability [40][41][42]. Machine learning (and deep learning) models play an important role in computer-aided diagnosis. Using mathematics and statistics tools, machine learning models extract and segment relevant features, interpret the output, and formulate a predictive model by correlating the data with the diagnosis of the patients [43]. For example, machine learning was applied to contrast-enhanced CT to distinguish large adrenocortical carcinomas from other cortical lesions [44]. However, in addition to requiring a large dataset, some models may not have sufficient power to produce a satisfactory performance in the image segmentation of a specific modality [45]. With the advancement of machine learning techniques, especially deep learning models, we anticipate that the technique will also be applied in elastography for endocrine tumor classification [46].
The aim of this study was to provide a contemporary and comprehensive literature review on the application of machine-learning-based elastography to classify endocrine tumors, including thyroid, pancreas, adrenal, and pituitary tumors.

Search Strategy
In our systematic literature search, we followed the guidance of the Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols Extension for Scoping Reviews (PRISMA-ScR) guidelines [47], which we conducted on: PubMed (title/abstract, journal articles, English), SCOPUS (title/abstract/keywords), Web of Science (topic field, articles, English), IEEEXpress (title/abstract/indexing terms), CINAHL via EBSCOhost (title/abstract/keywords), and EMBASE via OVID (title/abstract/author keywords, English).
Two authors (Y.-J.M. and L.-W.Z.) conducted independent searches in August 2022. The first author (Y.-J.M.) screened abstracts and full texts, which were checked by another author (L.-W.Z.) Any disagreement was resolved by seeking consensus with the corresponding authors.
The search was limited to original journal research articles published in English. The inclusion criteria included (1) application elastography (in any modality) to classify endocrine tumors; (2) deep learning or machine learning technique involving image segmentation, feature extraction, and classification; (3) classifying benign and malignancy; (4) studies conducted on human subject or existing human subject data; and (5) with at least one classification-related performance measure. Studies were excluded if they (1) had insufficient details on the machine learning model; (2) were modeled or evaluated by purely simulated data; or (3) classified metastasis.

Screening and Data Extraction
The PRISMA flowchart shown in Figure 1 illustrates the search and screening process for this systematic review. The review context included the basic information (Table 1), the configuration of the elastography system, image preprocessing and segmentation (Table 2), feature extraction and classification (Table 3), evaluation metrics, and performance (Table 4).

Search Results
As shown in Figure 1, the initial search yielded 90 articles. After the exclusion of duplicates, 56 articles remained. A preliminary screening of the title and abstract led to the removal of 44 articles, for the following reasons: article type, n = 24; not related to endocrine, n = 3; no elastography, n = 9; no machine learning, n = 5; unrelated to tumor classification, n = 3. One article about the segmentation of metastatic was excluded during full-text screening. In the end, 11 articles were eligible for data synthesis [48][49][50][51][52][53][54][55][56][57][58].

Basic Information and Dataset
The 11 articles involved a total of 5612 participants with sample sizes ranging from 65 to 2032 and patient age ranging from 15 to 90 year, as shown in Table 1. All except two studies were published in or after 2018. Though we covered thyroid, pancreas, adrenal, and pituitary tumors in the literature search, only those on the pancreas and thyroid were found, accounting for three (n = 3) and eight (n = 8) studies, respectively. One study further classified tumors into pseudotumoral pancreatitis, neuroendocrine tumor, and ductal adenocarcinoma [54].
Women were generally more prevalent in thyroid research, whereas men were more prevalent in pancreas studies. Most of the studies (8/11) confirmed the diagnosis (i.e., ground truth) by biopsy, whereas the others did not specify how they determined malignancy. Furthermore, the lesion size of the tumors was not available in five articles, which can be an important factor in image processing and classification.
As shown in Table 2, in all reviewed studies, researchers used ultrasound elastography and no researchers applied the magnetic resonance elastography. Of the 11 articles, shear-wave ultrasound elastography was used in 6, all of which targeted thyroid tumors. In the contrast, in four articles, the authors used strain elastography and all of them targeted pancreas tumors. Two studies did not provide sufficient information on the type of ultrasound elastography, and four articles did not provide the name/brand of the system.

Data Processing and Segmentation
As shown in Table 2, for data processing, one study [53] mentioned the application of a median filter for denoising, whereas two studies highlighted the process of contrast enhancement of the acquired images [54,58]. However, other studies did not address any image processing or conditioning (excluding segmentation).
For data segmentation, delineating the region of interest (ROI) is one of the essential steps in image processing to focus the center of attention on the clinically relevant regions and to avoid irrelevant image area information from degrading the efficiency and accuracy of model training. The procedures were often conducted by manually contouring the tumor boundary by radiologists with the assistance of software [42,53,54,59]. Alternatively, Pereira et al. [49] applied a threshold-based method to segment the regions with elastographic stress higher than 70% of the maximum stress, but this threshold level was not justified. Qin et al. [50] pre-extracted the ROI using a color transformation technique before the manual work by the radiologists.
Elastographic images can be segmented by overlaying segmented B-mode ultrasound images. Hu et al. [48] trained a real-time semantic segmentation model, PP-LiteSeg [60], on B-mode ultrasound images for segmentation. Then, they copied the segmented outline from B-mode images to the elastographic images with different offsets. The accuracy of the PP-LiteSeg model was verified by radiologists using the dice similarity coefficient, Cohen's kappa, and 95% symmetric Hausdorff distance.
Data augmentation was implemented by a few authors. The traditional data augmentation involves random transformation, flipping, and scaling [48,50,54]; Hu et al. [48] also considered augmentation of the brightness, contrast, and saturation of the images. In addition, some researchers considered integrating different segmentation methods or combining lower-and higher-dimensional semantic features as a form of data augmentation [48,50].

Feature Extraction and Data Fusion
Using predetermined statistical-based features is one of the common strategies applied in feature extraction. In addition, some researchers extracted features from B-mode ultrasound [49,[53][54][55]57] and Doppler ultrasound [54] for tumor classification, as shown in Table 3.
The statistical features of elastographic images include the mean, standard deviation, range, and highest stress value [49,57]. Pereira et al. [49] also considered the number of pixels with a stress level greater than 80 kPa but without justification. In addition, they applied the circular Hough transform to obtain additional features, including the largest radius detected, the largest value of the accumulator array, and the radius corresponding to the largest value on the accumulator array [49]. Additionally, Zhou et al. [58] extracted features based on the gray-level co-occurrence matrix and gray-level run-length matrix (GLCOM-GLRLM), as well as the multiple subgraph co-occurrence matrix based on multilevel wavelet (MSCOM). GLCOM-GLRLM represented the length of the highest highlight run continuously distributed in the image, whereas MSCOM was used to mark the image area with stripe-like textures [58].
Radiomics features were also considered in these studies, which are different from statistical features in that they are generally ordinal or categorical data classified by radiologists. For example, researchers [56] identified the shape and smoothness of a nodule, the nature of the calcification, and the vascularity (in the four-grade Alder classification scheme). The authors continued by automating the radiomics feature extraction process using IFoundry software (Intelligence Foundry 1.2, GE Healthcare), which considered 6940 radiomics features in six classes [53]. Similarly, Sun et al. [53] automatically extracted features using the Python package, Pyradiomics [61].
Zhao et al. [57] applied machine-learning-assisted feature extraction to filter predetermined statistical features based on their levels of importance (i.e., a feature reduction process) using the random forest algorithm. Additionally, Sun et al. [53] used the VGGNet-16 model [62] to serve as a deep feature extractor on elastographic images; notably, the team adopted a predetermined feature extraction approach on the B-mode images. Lastly, in six studies, researchers used a deep learning approach [48][49][50][51][52]54] in which the feature extraction and classification were nested and streamlined in an unsupervised manner.

Classification and Modeling
For traditional machine learning studies with separate feature extraction and classification processes, a broad spectrum of classifiers or statistical models have been explored, such as logistic regression, decision tree, naïve Bayes, etc. (Table 3). Notably, some researchers adopted a traditional machine learning approach (with separated feature extraction and classification processes) but used deep learning models as either deep feature extractors or classifiers, such as using convolutional neural network (CNN) as deep feature extractor and then connected to k-nearest neighbor (KNN) or extreme gradient boosting (XGBoost) that served as classifier.
As deep learning approaches, CNN and MLP were the typical methods considered. CNN receives input from image data, whereas MLP takes the flattened hue histogram matrix from the elastographic images [51,52]. Hu et al. [48] attempted to construct a series of CNN models using data from different segmentation settings. They applied a stochastic gradient descent of 0.9 momentum and 1 × 10 −4 weight decays while assigning the cross-entropy loss as the loss function. The models were trained with a 128 batch size and 0.01 learning rate. Deep learning models were often pretrained using large datasets in the public domain, in which the ImageNet database [63] was commonly used. The pretraining process relieves the sample size demand in the actual training and can speed up convergence, especially during the early training stages [64].
Some compelling model architectures are worth discussing, particularly data fusion techniques in machine learning models. Sun et al. [53] separately trained the machine learning models on elastography and B-mode data and joined the two models by an uncertainty decision-theory-based voting system consisting of a pessimistic, optimistic, and compromise approach [65]. Pereira et al. [49] averaged the class probabilities of the Bmode-and elastography-data-trained models, which resembled the compromise approach of the voting system. Moreover, they applied a grid search approach on the weighted cross-entropy loss to determine the drop-out probability and learning rate.
Udris , toiu et al. [54] constructed a CNN and LSTM model using sequential ultrasound B-mode, elastographic, and Doppler images trained at 50 epochs. Then, they were merged by a concatenation layer. Qin et al. [50] investigated the differences in fusion methods, including mixed training fine-tuning, fusion followed by feature re-extraction, and feature extraction followed by refusion. In addition, they compared the fully connected layers, spatial pyramid pooling, and global average pooling for the classification layers [50].
To evaluate the model, in six studies, researchers divided the data into training and testing sets, whereas in one study, an external testing set was also used to improve generalizability [57]. In three studies, the authors adopted a cross-validation approach; in one study, both cross-validation and data-slicing approaches were implemented. In two studies, the validation method was not addressed.

Classification Performance
In the majority of the reviewed articles, the authors explored and compared the classification performance between different models or model architectures; Table 4 presents the best-performing or -featuring (in the abstract) model for each study. Accuracy and area under the receiver operating characteristics curve (AUC) were the primary outcomes and were presented in all but one article. AUC evaluates the model performance across different thresholds for a binary classifier, which represents the discriminatory power of a predictive model to distinguish between events and nonevents. Of the 10 articles reporting the accuracy measure, all methods attained an accuracy higher than 80% but only three exceeded 90% [50,54,57]. Methods for the pancreas appeared to be more accurate than those for the thyroid. The accuracy of methods in thyroid studies ranged from 83% to 94.7%, whereas that for methods in pancreas studies ranged from 84.27% to 98.26%. Additionally, all methods obtained a discriminatory power (via AUC) of more than 90%, except one.
Qin et al. [50] and Udris , toiu et al. [54] created the two models with the highest accuracy and discriminatory power: these models accounted for data fusion inside the deep learning model training process.

Discussion
In this contemporary review, we explored the applications of machine learning models of elastography for identifying tumors in the endocrine system. However, we only found mentions of ultrasound elastography in this review. Magnetic resonance elastography was Cancers 2023, 15, 837 9 of 14 available to facilitate the diagnosis of thyroid and pancreas cancer but might not be ready to incorporate with computer-aided diagnoses, such as in machine learning models [66][67][68]. In addition, only thyroid and pancreas tumors were captured by our review, which were assessed in B-mode with shear-wave ultrasound elastography and endoscopic ultrasound strain elastography, respectively. The difference was due to the organ location: the thyroid is superficially located. The use of elastography was not reported for other endocrine organs, such as the adrenal gland and pituitary, because they are not accessible or are beyond the detection depth of the elastography probe [69,70].
Traditional machine learning and deep learning models were common approaches in computer-aided diagnosis: in several studies, researchers combined different approaches to innovatively create unique model architectures, especially via data fusion. Ultrasound elastography often comes with B-mode ultrasound images with spatial information. Hu et al. [48] segmented elastography images by overlaying segmented B-mode images and compiling images with different segmentation approaches in a machine learning model. We also found different feature extraction strategies for B-mode and elastography images with a mixture of predetermined statistical or radiomics features and deep feature extractors. Pereira et al. [49] and Sun et al. [53] developed separate machine learning models for B-mode and elastography images and estimated the outcomes by averaging the probability output of the models. Moreover, Qin et al. [50] and Udris , toiu et al. [54] adopted a data fusion approach by integrating the data using mixed/sequential training and a concatenation layer, respectively, which yielded superior classification performance compared with the other methods in this review.
Reporting quality is an important attribute of publications, with studies of machine learning models being no exception [71]. Some reviewed articles did not present adequate details on the participants and protocols, which hinders the replication and interpretation of findings. Two of the eleven studies did not specify the ground truth reference of the diagnosis. Four studies did not present the demographic information of the patients. Five studies did not report the size of the tumor, which may affect the accuracy of segmentation. Furthermore, two studies reported neither the training-testing data division nor cross-validation of the model performance evaluation. The methodological quality of machine learning studies was also of particular concern, especially those with small sample sizes [72]. Some journals may target on the innovation of the modeling or architecture and may impose less stringent requirements on small dataset studies [73,74]. In this review, we found studies with dataset sizes of 60 to 70 subjects over 2 to 3 classes, which would be deemed insufficient. Data augmentation, transfer learning, and cross-validation are acceptable measures to accommodate the limitations of sample size and to handle over-fitting and convergence problems [50,73,75,76]. Additionally, imbalanced dataset classification is one of the pervasive challenges in machine learning. All studies in this review suffered from unbalanced class sizes, which might distort the validity of performance evaluation. Only one study accounted for the unbalanced class size using the bootstrapping approach [44]. Hyperparameters (or model parameters) are sets of parameters that must be configured for the model learning process [77]; the performance metrics of the model may be overly dependent on the tuning of hyperparameters [78]. The number of trees and nodes in the random forest classifier, the number of clusters (k) in the KNN models, and the number of layers in MLP models are typical model parameters. For deep learning, grid and random searches were common approaches to select the optimal combination of multiple hyperparameters, which are less time-consuming and require less computational resources [79]. We decided not to conduct a thematic and qualitative analysis on the selection of hyperparameters and optimization strategy (e.g., loss functions and cross-entropy), which deserves another standalone in-depth review in an engineering paper. However, three studies did not address the hyperparameter tuning strategies. Other studies mentioned their optimization strategies without the confirmed values of the hyperparameters or vice versa. This review has some limitations. Due to language bias, some relevant research published in languages other than English could have been missed. Moreover, we only included journal articles indexed from the mentioned electronic databases, which we considered as higher quality but might constitute selection bias. In addition, we did not conduct a formal methodological quality assessment for the eligible articles because the focus of the studies was heterogeneous, which would have affected their efforts and focus on the direction of reporting. For example, some studies were more based on clinical applications, whereas others targeted on the innovations of the system development. Furthermore, our search results included articles with terms related to machine learning models. However, some boundaries between machine learning, advanced signal processing, and statistical techniques were ambiguous. Some studies may have been missed or their eligibility was difficult to determine, such as those using logistic regression [80].
Machine-learning-based ultrasound elastography is a recent technological advancement of the field because most of the articles were published after 2018. In addition to statistical models, progress can be observed in the direction of using deep learning models, mixed and sequential training, etc. Image processing or denoising plays an important role in the subsequent medical image analysis [81] but was less discussed in the reviewed studies. Machine learning or deep learning can also be applied in image denoising, segmentation, and augmentation. For example, generative adversarial network (GAN) was proven effective in semantic segmentation and generative image modeling for medical imaging [82,83], whereas linear combinations of datasets can be applied for data augmentation [84]. Additionally, we anticipate that integrating 3D B-mode ultrasound and 3D elastography will be the future trend in improving visualization and, thus, decisionmaking, as well as providing a complete profile of feature information.
Several core challenges are facing this field. Our review showed that the application of machine learning model technology remains at the initial stage. Despite most reviewed articles being contemporary, cutting-edge models were not used; researchers are still using non-deep-learning approaches. Features were mainly predetermined and relied on manual harvesting. Moreover, due to the size of the probe, penetration power, and constraints of shear-wave generation, elastography had not been applied for organs, such as adrenal and pituitary glands, as demonstrated in our review. Existing modalities also tend to reach the physical limits on resolution. Combining measurements with other physical properties may enhance our understanding of the features of tumors.

Conclusions
In this review, we summarized the applications and protocols of machine learning models on elastography to identify tumors in the endocrine system. Shear-wave ultrasound elastography has been applied to assess thyroid tumors, whereas strain elastography with endoscopy has been used for diagnosing pancreatic tumors. Machine learning approaches achieved an accuracy of >80%, whereas three studies reported an accuracy of >90%.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data can be shared up on request.

Conflicts of Interest:
The authors declare no conflict of interest.