Breast Tumour Classification Using Ultrasound Elastography with Machine Learning: A Systematic Scoping Review

Simple Summary Breast cancer is one of the most common cancers among women globally. Early and accurate screening of breast tumours can improve survival. Ultrasound elastography is a non-invasive and non-ionizing imaging approach to characterize lesions for breast cancer screening, while machine learning techniques could improve the accuracy and reliability of computer-aided diagnosis. This review focuses on the state-of-the-art development and application of the machine learning model in breast tumour classification. Abstract Ultrasound elastography can quantify stiffness distribution of tissue lesions and complements conventional B-mode ultrasound for breast cancer screening. Recently, the development of computer-aided diagnosis has improved the reliability of the system, whilst the inception of machine learning, such as deep learning, has further extended its power by facilitating automated segmentation and tumour classification. The objective of this review was to summarize application of the machine learning model to ultrasound elastography systems for breast tumour classification. Review databases included PubMed, Web of Science, CINAHL, and EMBASE. Thirteen (n = 13) articles were eligible for review. Shear-wave elastography was investigated in six articles, whereas seven studies focused on strain elastography (5 freehand and 2 Acoustic Radiation Force). Traditional computer vision workflow was common in strain elastography with separated image segmentation, feature extraction, and classifier functions using different algorithm-based methods, neural networks or support vector machines (SVM). Shear-wave elastography often adopts the deep learning model, convolutional neural network (CNN), that integrates functional tasks. All of the reviewed articles achieved sensitivity ≥80%, while only half of them attained acceptable specificity ≥95%. Deep learning models did not necessarily perform better than traditional computer vision workflow. Nevertheless, there were inconsistencies and insufficiencies in reporting and calculation, such as the testing dataset, cross-validation, and methods to avoid overfitting. Most of the studies did not report loss or hyperparameters. Future studies may consider using the deep network with an attention layer to locate the targeted object automatically and online training to facilitate efficient re-training for sequential data.


Introduction
Breast cancer is the leading cause of death with the second-highest mortality rate among cancers affecting women [1][2][3]. Breast cancer has surpassed liver cancer and become the fourth most commonly diagnosed cancer, with new cases increasing from 0.3 million in 2015 to 0.42 million in 2020 [4]. It is also ranked with the highest incidence rate for cancer [4]. There is one breast cancer patient in every four cancer cases in females, while breast cancer accounts for one in six cancer deaths [5]. The financial burden of breast cancer is enormous. Women with breast cancer spend $13,000 more for healthcare expenses annually than those without breast cancer. In the United States, the cost of breast cancer screening exceeded USD 1 billion annually in 2006 [6] but was believed to be cost-effective to improve health benefits and reduce deaths [7]. Accurate screening and early diagnosis could lead to early and effective prevention and could be why developed countries have a higher survival rates than developing countries [1,3,8].
While breast self-examination using manual palpation is promoted, clinical mammograms remain the primary modality for asymptomatic breast cancer screening that is proven to be clinically evident and able to reduce the mortality rate [9,10]. However, ionizing radiation of mammograms may add carcinogenic risks and has been blamed for frequent overdiagnosis [11,12]. Besides that, breast magnetic resonance imaging (MRI) is used to diagnose primary malignancy and perform preoperative evaluations with high accuracy [13,14]. However, both mammograms and breast MRI are confined to the hospital setting and may not be suitable for large cohort screening because of their high cost and complicated operation [15]. This is of particular concern to developing countries with limited healthcare resources but higher breast cancer mortality [3,8,16].
Real-time B-mode ultrasound has emerged as an alternative imaging technique despite the fact that small tumours could be challenging to identify and occluded by the sternum and ribs [17]. In addition, speckle noise and low contrast in B-mode may impede the observation features to identify potential abnormalities. With the integration of another ultrasound imaging approach, ultrasound elastography can measure and quantify the stiffness distribution or differences of the soft tissue for tumour detection, under the premise that the lesion of breast tumours exhibits higher shear elasticity [18]. Ultrasound elastography was pioneered by Ophir et al. [19] in 1991. This elasticity imaging technique complements conventional B-mode imaging by superimposing stiffness measures onto the spatial information. Radiologists could conduct the assessment or diagnosis based on the Breast Imaging Reporting and Data System (BIRADS) protocol [20]. With the development of the extended combined autocorrelation method for lesion tracking, realtime freehand strain elastography could demonstrate good diagnostic performance in differentiating benign and malignant lesions [21]. Later, real-time shear wave elastography was invented in an attempt to remedy the problem of manual palpation [22], while some researchers further advanced the technique by incorporating colour Doppler into the shear wave imaging to improve the visualization of the shear wave wavefront [23]. Nowadays, ultrasound imaging with elastography has improved the sensitivity of small breast tumour detection [24], demonstrated high specificity for breast cancer diagnosis and become one of the prior examinations before the invasive breast biopsy [25].
There are still limitations with integrated B-mode and ultrasound elastography in breast tumour detection. The operation of ultrasound is highly dependent on the physicians' experience [26]. Measurement errors due to inter and intra-observer variability in probe placement/orientation and annotation are undeniable [26,27]. Moreover, it could be difficult to distinguish the lesion boundary between the normal and tumour tissue and that between benign and malignant lesions. The accuracy of the malignancy scoring system could be jeopardized by necrosis and liquefaction in malignant lesions, or mechanization and calcification in benign lesions [28,29].
In light of the system weaknesses, computer-aided diagnosis (CAD) has been developed to improve the reliability of the system and is facilitated by the identification of critical image features by medical experts. The machine learning approach, such as deep learning, can improve the objectivity and reliability of identification and annotation of features, thus further extending the strength of CAD by enabling automated segmentation and thus staging for breast tumours [30,31]. To this end, the objective of this study was to review the methods and accuracy performance of state-of-the-art machine learning techniques used in ultrasound elastography for breast tumour classification and shed light on the improvement of CAD for early and accurate screening of breast cancer.

Search Strategy
A systematic literature search was performed to review diagnostic studies involving breast cancer screening or breast tumour detection using ultrasound elastography and machine learning techniques. The literature search was conducted according to the Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols Extension for Scoping Reviews (PRISMA-ScR) guidelines [32]. The literature search was performed on databases including PubMed (title/abstract, journal articles, English), Web of Science (topic field, articles, English), CINAHL via EBSCOhost (default field) and EMBASE via OVID (topic field, English). Two authors (Y.-J.M. and D.W.-C.W.) conducted independent searches in November 2021. The first author (Y.-J.M.) conducted the screening on abstracts and full-text, which was checked by the corresponding author (D.W.-C.W.). Any disagreement was resolved by seeking consensus with the other corresponding author (J.C.-W.C).
The search was limited to original journal research articles in English. The inclusion criteria included: (1) screening by both B-mode ultrasound and ultrasound elastography; (2) machine learning technique either in image segmentation, feature extraction, or classification; (3) diagnostic/screening accuracy test to classify benign and malignant breast tumours; (4) test involved with and evaluated by human subject data; (5) at least one accuracy performance measure. Studies were excluded if they: (1) targeted axillary lymph node breast cancer; (2) were non-machine learning techniques in all the three aforementioned aspects; (3) had insufficient details on the machine learning model; (4) involved additional modality other than B-mode ultrasound and elastography; (5) were modelled or evaluated by simulation data.

Screening and Data Extraction
The search and screening process for the systematic review is shown in Figure 1. There was no disagreement among authors in the selection of studies for the review. The review context included basic information on subject information and dataset (Table 1), the configuration of the ultrasound system, image pre-processing and segmentation (Table 2), feature extraction, fusion, and reduction, classification (Table 3), evaluation metrics and performance (Table 4).

Basic Information and Dataset
The 13 articles involved a total of 1988 participants with a dataset of 3216 tumour images (1708 benign and 1508 malignant), as shown in Table 1. The sample size for patients ranged from 80 to 363, while all of them had at least 100 image samples. It should be noted that articles coming from the same research team were likely to have the same set of participants or source data based on the demographic information-for example, the articles among research teams of Sasikala et al. [37,38], Wu et al. [39,40], and Zhang et al. [42,43]. There was also a mismatch between the sample size of patients and dataset images, which could be due to multiple lesions from the same patient justified by a few articles. Based on the available data, the age range was from 16 to 97. Most articles (11 out of 13) indicated that diagnosis (reference standard, or ground truth) of benign or

Basic Information and Dataset
The 13 articles involved a total of 1988 participants with a dataset of 3216 tumour images (1708 benign and 1508 malignant), as shown in Table 1. The sample size for patients ranged from 80 to 363, while all of them had at least 100 image samples. It should be noted that articles coming from the same research team were likely to have the same set of participants or source data based on the demographic information-for example, the articles among research teams of Sasikala et al. [37,38], Wu et al. [39,40], and Zhang et al. [42,43]. There was also a mismatch between the sample size of patients and dataset images, which could be due to multiple lesions from the same patient justified by a few articles. Based on the available data, the age range was from 16 to 97. Most articles (11 out of 13) indicated that diagnosis (reference standard, or ground truth) of benign or malignant lesion was made by biopsy or histopathology. Among them, three of the articles noted that biopsy tests were conducted only for those screened by ultrasound or other modes of examination. The lesion size information of seven articles was not available and could be an influencing factor towards classification performance. An equal number of studies collected data retrospectively or prospectively (n = 6), while one study did not present the respective details [33]. Four studies highlighted the proportion of data used for model training and independent testing, which was approximately 75% to 80% for model training [34,35,44,45]. Two of them involved an additional dataset for external testing [44,45], and one dataset was sourced from a hospital different from the model training dataset [44]. Five studies neither addressed the division of model training and testing dataset, nor described a cross-validation, while two studies used a cross-validation [42,43]. Cross-validation directs different proportions of data for training and testing on different iterations [46]. For example, a 5-fold cross-validation splits the dataset into 5 proportions of equal size (fold). Four folds are used to train the model, and one fold is used for testing, in which the process is repeated for each fold. Similarly, the leave-one-out cross-validation picks one sample for testing and repeats the process until all samples are exhausted. Essentially, the performance evaluation would be computed by the average performance of the iterations. Nevertheless, nearly half (n = 6) of the studies applied the cross-validation, as shown in Table 3.
To "enlarge" the sample size for model training, the data augmentation technique is often used in the field of machine learning to facilitate convergence and robustness. As shown in Table 3, five studies implemented the data augmentation procedure [34,35,44,45]. The classic data augmentation technique involves image flipping, random rotation, and rescaling.

Ultrasound Elastography
Out of the 13 articles, six applied shear wave elastography (SWE). In contrast, the others involved strain elastography (SE) using freehand (FH)/an externally applied force (n = 5) or acoustic radiation force (ARF) (n = 2), as shown in Table 2.  [45] Aixplorer, SuperSonic SWE Image separation CNN † - † Image segmentation function was not standalone and facilitated by machine learning model. Acc: Accuracy; Ao: area overlapped; Ad: area difference; ARF: acoustic radiation force; CNN: convolution neutral network; DSC: dice similarity coefficient; FH: freehand; GAD: Gabor-based anisotropic diffusion; ICC: intraclass correlation; MAD: mean absolute distance; MxAD: maximum absolute distance; NAD: normalized area difference; NCT: normalized center translation; NSM: normalized slope of metric value; p < 10×: percentage of points with different less than 10 pixels; PGBM: point-wise gated Boltzmann machine; RD: reaction diffusion level set; RMSE: root mean square error; Ro: radiologist; Sp: specificity; SWE: shear wave elastography; SE: strain elastography; w/: with; w/o: without. SE estimates elastic modulus by the ratio of known force over a compression area to the ultrasound-measured dimension depth change of the soft tissue (strain) [47]. The system targets lesions near the surface at about 5 cm depth [48]. The advantage is that it is convenient for real-time strain visualization [48]. However, the externally applied compression is conducted freehand, in which the data collection quality may be dependent on the operators' experience and subject to interobserver variability [49]. The semiquantitative compensation of this problem by B-mode ultrasound may hinder estimation of the exact elasticity values [50,51]. Some other researchers attempted to generate three-dimensional elastography by SE images [52]. ARF on SE remedies this problem by a controlled pushing pulse to induce tissue displacement, which is followed by an ultrasound pulse to cap-ture the position and displacement of the tissue. It is more effective than freehand SE in measuring deeper tissues [48].
SWE induces and measures the propagation speed of the shear wave (c), which is dependent on the density (ρ) and elastic modulus (E) of the tissue, E = 3ρc 2 [48,53]. The strength of SWE is its reproducibility and the mapping of tissue elasticity onto the morphological information of the B-mode ultrasound, which improves the specificity of B-mode ultrasound without losing sensitivity [54,55], despite a higher cost. Stiffer nonhomogeneous masses are more susceptible to malignancy [54]. Therefore, examining the peritumoral region could be more important than the lesion region itself [56].

Image Pre-Processing, and Segmentation
Image pre-processing techniques could involve cropping, resampling, denoising, conversion, and image separation, while some studies only lightly described in their routine procedures. Among the studies, Misra et al. [35] decided to compare the model performance with and without image cropping. Zhang et al. [42,43] and Zhou et al. [45] isolated and extracted the pure shear wave elastography for analysis by a technique (image separation) that subtracted the B-mode grayscale image from the composite colour image data and then calibrated the elasticity modulus [57,58]. Wu and colleagues attempted two different pre-processing techniques (Harris corner operation and fractional order operation) in two publications [39,40]. The fractional order operation method adopted a multiscale image approach to enhance the higher frequency components of the images (i.e., edge information) [59], while the Harris corner operation implemented the filter through convolution with a structured tensor [60].
For image segmentation, there could be manual segmentation, algorithm-based segmentation, deep learning models (bypassing image segmentation), or a mixture of the methods above. Moon et al. [36] conducted the manual segmentation for the region-ofinterest (ROI) by radiologists without any pre-processing technique. Two papers involved manual segmentation after different pre-processing techniques [39,40]. Another article implemented manual segmentation and algorithm-based segmentation together [41]. Level sets and fuzzy level sets were algorithm-based methods that used a threshold or a fuzzythreshold level segmentation and were applied in five articles.
Sometimes, image pre-processing and segmentation procedures were indistinguishable because some pre-processing techniques were essential steps to facilitate or reduce the burden for segmentation, such as image cropping and contouring. Anisotropic diffusion filtering with sticking, speckle reducing anisotropic diffusion (SRAD), Gabor-based anisotropic diffusion (GAD), and active contour were the common processes to remove speckle noise using an edge-sensitive technique computed by the function of local gradient or entropy magnitude [61], while it was also regarded as an image segmentation procedure.
Additionally, Zhang et al. [43] merged the GAD with reaction diffusion (RD) based level set segmentation. The significant contribution from Yu et al. [41] was that they proceeded with a series of pre-processing steps, including k-means clustering, active contour, and dyadic wavelet transform. The dyadic wavelet transform initialized the image into an energy field that could achieve a sufficient signal-to-noise ratio to drive the active contour, with the region then smoothened by GAD and refined by k-means clustering [41].
For the evaluation of image segmentation, some studies applied and evaluated the performance of manual segmentation [41,44]. Based on the spatial overlapping, the dice similarity coefficient was used to evaluate the intra and inter-rater reproducibility of segmentation [62], in addition to accuracy performance measures [43]. In contrast, some studies applied algorithm-based segmentation and evaluated by manual segmentation as the reference [41,43]. Chen et al. [33] believed that the detected edges of the segmented images were acceptable based on empirical verification by experienced radiologists. Distance-based measures, such as mean absolute distance, were used in two articles for evaluation [33,41].

Feature Extraction, Fusion, and Reduction
Generally, feature extraction and classification of the studies were based on two approaches or a mixture of these two approaches. The first approach was a deep learning workflow that embedded all relevant functions (image segmentation, feature extraction/reduction, classification) into the machine learning or deep learning model [63], particularly CNN. The second approach was to configure the feature extraction and classifier separately, also known as the traditional computer vision workflow [63].
For the feature extraction, three studies pre-determined the features to be used for classification [33,36,41], as shown in Table 3. Feature extraction techniques were generally based on the image presentation, such as pixel, intensity, grey level, etc. They included local binary pattern (LBP) [37,38], local ternary pattern (LTP) [37], grey level co-occurrence matrix (GLCM) [38], grey level difference method (GLDM) [38], LAWs texture energy measure [38], point-wise gated Boltzmann machine (PGBM) with restricted Boltzmann machine (RBM) [42], contourlet-based texture feature extraction [43], Harris corner convolution [39], and fractional order convolution [40]. On the one hand, a unique point of the contourlet-based texture feature extraction was that it integrated the tumour elasticity in the spatial-frequency domain with the morphological features for better classification [43]. On the other hand, PGBM utilized a gating mechanism using a stochastic switch unit to estimate whether the feature pattern occurred [42]. Besides, if the extracted features were radiomic parameters, least absolute shrinkage and selection operator (LASSO) regression could be applied to weigh selected features for reduction [44].
Feature fusion could also be implemented using serial fusion, parallel fusion, or particle swarm optimization (PSO). Instead of feature fusion, Wu and colleagues [39,40] applied the PSO model to improve model learning only, whilst Sasikala et al. [38] used an optimum path forest (OPF) to optimize the performance of PSO. Subsequently, the number of extracted features could be large, as many as 286, as demonstrated by Zhang et al. [42]. Feature reduction could be achieved by principal component analysis (PCA) [37], canonical correlation analysis (CCA) [37], deep polynomial network (DPN) [43], or multiple kernel learning (MKL) [43]. The advantage of the novel DPN was that it weighs and identifies high-level features over multiple output layers, which enables effective learning from small samples [43].

Classification
Support vector machine (SVM) was often used as the binary classifier with prior confirmed extracted features (n = 6), as shown in Table 3. SVM was recognized as the most robust and accurate classifier before deep learning [64]. It classified the data by a hyperplane with a dimensional space at the order of the number of features. Other classifiers included random decision forest [39], multilayer perceptron neural network (MPNN) [36], Bayesian classification [36], and generalized regression neural network (GRNN) [39,40].

Deep Learning
As mentioned in Section 4.3, the deep learning model, particularly CNN in this review, embedded all relevant functions (image segmentation, feature extraction/reduction, classification) and minimized any manual procedures or decision-making. The basic principle of CNN was to train a kernel (or filter) to recognize specific image features (convolution layer) [63]. The model then computed the level of feature overlapping between the kernel and the input image (known as the receptive field), followed by a pooling layer for higherlevel features and a fully connected layer to flatten the data into a feature vector [65]. The output layer of the model computed the probability of the output class through a dense network and a regression function [66]. Fujioka et al. [34] and Misra et al. [35] embedded all relevant functions using a deep learning model, CNN. Before training the CNN, the authors pre-trained the model (or transfer learning) by ImageNet (https://www.image-net.org, accessed on 20 December 2021), which is a free image database organized according to Word-Net Hierarchy [67], and has been recognized as the most commonly used dataset [68,69].
The transfer learning process trained the model by an existing large dataset before learning a specific scenario. Nevertheless, Fujioka et al. [34] and Misra et al. [35] sought different approaches in using CNN. Fujioka et al. [34] attempted and compared a pool of different CNN models, including Xception [70], InceptionV3 [71], InceptionNesNetV2 [72], DenseNet1 [73], DenseNet161 [74], and NASNetMobile [73]. In contrast, Misra et al. [35] selected two CNN models (AlexNet [75] and ResNet [76]) and integrated the models and ultrasound modalities (i.e., B-mode and SWE) by ensembled learning. On the other hand, Zhang et al. [44] and Zhou et al. [45] configurated the feature extraction and classifier separately, despite the application of CNN. A basic introduction to the different models is available in another scoping review [68].

Evaluation Metrics
The evaluation metrics used in the articles were the same as the diagnostic metrics used in epidemiology, as shown in Figure 2. Sensitivity (or true positive rate) indicates the proportion of sample receiving a positive test result that actually has the condition, while specificity (or true negative rate) indicates the proportion of sample receiving a negative test result that actually does not have the condition. Positive predictive value (PPV) is the probability of having the condition with a positive test result, while the negative predictive value (NPV) is the probability of not having the condition with a negative test result. Accuracy is the fraction of correct test results over the total number of tests. However, the measure fails to account for the ratio between positive and negative tests and is thus not recommended to be used for highly imbalanced class problems that commonly appear in health science [77]. Recall and precision are two essential evaluation parameters in data science, which are equivalent to sensitivity and PPV. The different nomenclature is due to the concept of "relevance" in information retrieval. Recall indicates the percentage of relevant instances retrieved (recall), while precision is the fraction of relevant instances retrieved. The combination of recall and precision establishes some evaluation metrics. F1-score is the harmonic mean of recall and precision; balanced classification rate (BCR) is the geometric mean (G-mean) of recall and precision to avoid overfitting the negative class and underfitting the positive class [38]. The Matthews correlation coefficient (MCC) was proposed by Brian Matthews in 1975 [78] and was believed to be the most informative single metric for the evaluation of binary classifiers in prediction [79]. It quantifies the association between the ground truth and the prediction (test value) and is equivalent to the Phi coefficient in the Pearson chi-squared statistics.
The receiver-operating characteristics (ROC) curve is a standard tool to present the true positive rate as a function of false-positive rate for the continuum of all cut-off values for classification. The area under ROC curve (AUC) represents the probability of the classifier to correctly recognize the classes of a pair of randomly drawn positive and negative instances [80]. It serves as an overall performance indicator of discrimination capability, whilst Youden's index (YI) evaluates the ability to avoid misclassification [35,[37][38][39][40].
In biostatistics and epidemiology, the prediction or test is considered reliable with sensitivity ≥ 80%, specificity ≥ 95%, and PPV ≥ 95% [81,82]. As a rule of thumb, AUC ≥ 0.85 and 0.75 ≥ AUC ≥ 0.85 are considered convincing and partially convincing performance, respectively [83]. For machine learning or deep learning, we believe that accuracy or an F-score ≥ 90% is acceptable, while that ≥95% is good, with the premise that human labellers (ground truth) achieve 99% accuracy and the best model network achieves 95% accuracy on ImageNet [84]. Recall and precision are two essential evaluation parameters in data science, which are equivalent to sensitivity and PPV. The different nomenclature is due to the concept of "relevance" in information retrieval. Recall indicates the percentage of relevant instances retrieved (recall), while precision is the fraction of relevant instances retrieved. The combination of recall and precision establishes some evaluation metrics. F1-score is the harmonic mean of recall and precision; balanced classification rate (BCR) is the geometric mean (G-mean) of recall and precision to avoid overfitting the negative class and underfitting the positive class [38]. The Matthews correlation coefficient (MCC) was proposed by Brian Matthews in 1975 [78] and was believed to be the most informative single metric for the evaluation of binary classifiers in prediction [79]. It quantifies the association between the ground truth and the prediction (test value) and is equivalent to the Phi coefficient in the Pearson chi-squared statistics.
The receiver-operating characteristics (ROC) curve is a standard tool to present the true positive rate as a function of false-positive rate for the continuum of all cut-off values for classification. The area under ROC curve (AUC) represents the probability of the classifier to correctly recognize the classes of a pair of randomly drawn positive and negative instances [80]. It serves as an overall performance indicator of discrimination capability, whilst Youden's index (YI) evaluates the ability to avoid misclassification [35,[37][38][39][40].
In biostatistics and epidemiology, the prediction or test is considered reliable with sensitivity ≥ 80%, specificity ≥ 95%, and PPV ≥ 95% [81,82]. As a rule of thumb, AUC ≥ 0.85 and 0.75 ≥ AUC ≥ 0.85 are considered convincing and partially convincing performance, respectively [83]. For machine learning or deep learning, we believe that accuracy or an F-score ≥ 90% is acceptable, while that ≥95% is good, with the premise that human labellers (ground truth) achieve 99% accuracy and the best model network achieves 95% accuracy on ImageNet [84].

Test Performance
The evaluation of models and systems in the articles often came with a comparison over different stages and aspects, which could be generally categorized into image preprocessing [34], image segmentation [37,39,40,42,43], feature extraction/reduction [37][38][39][40][41][42][43][44][45][46][47][48][49][50], and classifier/classifier settings [35,36,44,45]. Some of them compared multiple factors and levels. For example, Sasikala et al. [37] compared the performance between combinations of different image segmentation (LBP vs. LTP), feature fusion (serial vs. parallel), and reduction (PCA vs. CCA) techniques; Zhang et al. [42] compared the performance between combinations of different image segmentation (level set vs. PGBM vs. PGBM with RBM), feature reduction (PCA vs. t-test vs. no reduction), and classifier (ELM vs. KNN vs. SVM). Table 4 highlights the results of either the proposed model or the best performing model in the articles. Nearly all articles applied sensitivity/recall and specificity as the primary outcome. Five studies used the F1-score to evaluate the model. Out of the 10 articles with available accuracy measures, the models of seven articles achieved an accuracy ≥ 90%. All models in the articles had a sensitivity ≥ 80%, while only half of them attained an acceptable specificity (i.e., ≥95%). However, it was interesting to know that cases that were tested wrong by the model were also misdiagnosed by radiologists [34]. All models with reported AUC (n = 6) demonstrated convincing classification performance. Deep learning models [34,35,44,45] did not necessarily perform better than the traditional computer vision approach.
Zhang et al. [44] reported a "perfect" test or model with 100% sensitivity and specificity and AUC = 1.0. It should be noted that the evaluation metric could be affected by overfitting when the model fits exactly against the training dataset. Cross-validation is a way to prevent overfitting [85], while some studies did not address how they handle overfitting or did not mention which dataset they used to calculate the evaluation metrics [33,34,36,37]. Moreover, the definition or calculation of evaluation metrics could be different, such as using cross-validation with different proportions [38][39][40][41][42][43] or testing datasets with different sample sizes [44,45]. Their findings may not be comparable, despite that some research was targeted to minimize manual operation rather than superior accuracy [41].

Remarks
Reporting quality is an essential component in the quality assessment of articles, including the investigations of machine learning [86]. More than half of the articles (9/13) clearly indicated the reference of the diagnosis (ground truth); nonetheless, a few (2/9) stated that the diagnostic test was only conducted for those screened positive and could be mistaken if the screening test had a low specificity. Out of the 13 articles, three specified neither the training and testing data set derivation nor cross-validation. One study applied an external testing set to improve generalizability [44]. Additionally, a few studies did not describe the demographic data (4/12) and lesion size (6/12), while two studies provided the details in the subgroups of training and testing set [34,44], and two studies in the subgroups of benign and malignant lesions [39,40]. Four studies included information relating to loss function or hyperparameters, though not all studies were applicable to those parameters. However, this information reflects how the training behavior of the model is controlled and has significant impact on model performance [87].
It should be noted that there were blatant examples of terminological confusion towards the training, testing, and validation dataset, while some studies were guilty of model peeking (i.e., the testing dataset was not completely separated from model training) [88]. The testing dataset should always be held out for the assessment of performance for the final tuned model only [89,90]. The training dataset is used for the model learning basically via fitting the parameters to the classifiers [89,90]. The validation dataset is used to optimize the model training by fine-tuning the hyperparameters and may serve as an intermediate evaluation. In the case of cross-validation (a bootstrap approach), the training, validation and testing datasets are nested without data splitting [91] and have been recommended for small sample sizes (e.g., <100), though this is controversial. Furthermore, Yusuf et al. [86] briefly noted that the nomenclature among communities is different. The validation set for a medical research community is equivalent to the testing set in the field of machine learning [86].
Segmentation-based methods could lead to the loss of peri-tumour and surrounding tissue information. The strain ratio between surrounding tissue and lesion is an important feature for classification and could not be calculated when the information of surrounding tissue is unknown. Moreover, inputting images without segmentation to the deep network demands higher computer resources and may lead to non-convergence or poor accuracy. Therefore, cropping an ROI at reasonable size to encompass the lesion and surrounding tissue is necessary. In fact, ultrasound has more difficultly in preserving peri-tumour tissue due to the limitations in image contrast, spatial resolution, and speckle noise. Preprocessing techniques, in particular smoothing, could overcome these limitations and are important to both automatic and manual segmentation. Nevertheless, the speckle information is a collection of echogenicity to reflect three-dimensional spatial information for surrounding issue, despite that the image is two-dimensional. Speckle literally contains morphological information of the surrounding tissues and has been used to estimate the motion of the ultrasound probe, such as the speckle decorrelation for three-dimensional reconstruction [92]. Moreover, the speckle "noise" could be extracted by the deep learning network as an important feature, while the smooth filter may weaken the irregular edge feature. Thus, it is controversial to completely smooth the image in the pre-processing stage.
We speculated an evolution of feature extraction techniques in deep learning, such that raw images are input instead of the smoothened and segmented images. It should also be noted that image compression may degrade the image quality and details, such as the use of JPEG [35]. A fuzzy level set method was used to accommodate the ambiguity and inhomogeneity of the image, which could be superior to the existing level set method [37,38]. We believe that the deep learning network could be more adaptive to noise during the image segmentation process.
In general, our review summarized that ultrasound elastography with machine learning was preceded either by traditional computer vision (traditional machine learning) or the deep learning approach. Traditional computer vision handled different functions of the workflow separately with different methods, such as manual or algorithm-based segmentation, and ended with a classifier, while the deep learning model, in particular CNN, integrated all the tasks [63]. Deep learning models are generally more reliable, time-consuming, and perform better than traditional algorithm-based methods or computer vision workflow. Instead of being programmed and using hand-crafted features, the deep learning models adopted an end-to-end learning approach that was trained with a class-annotated dataset to establish the most descriptive and salient features from the images [63]. For traditional computer vision, an expert in biomedical science, imaging, and computing is required to determine and justify the features to be extracted and the feature extraction methods, which could be a trial-and-error process requiring extensive time for fine-tuning and would be problematic in cases involving a plethora of features [63]. In addition, algorithms are more domain-specific, whereas models can always be trained by another dataset.
Traditional computer vision techniques are not without benefits. They are more computationally efficient and do not necessarily perform worse than deep learning models, as demonstrated in our review. Deep learning models require very demanding computer requirements and big datasets but lack explainability. The most common dataset, ImageNet, consists of over 1.5 million of images over thousands of object categories [93], though normally facilitated to the models by transfer learning. The lack of a large dataset may yield overfitting issues or reduce external validity that is often overlooked [94]. The full transparency in algorithm-based methods is also superior to the inscrutable Blackbox model to obtain physical meaning from the features and better insights into potential problems with the solutions, which could be imperative for clinicians [95]. The learning models would not only be confined to "garbage-in", "garbage-out" [96], but also "garbage-learnt".
There were some limitations in this review. First of all, the review was confined to journal articles written in English, which may lead to selection bias. In fact, many research articles in the fields of computing were published via conference full papers. Nevertheless, extensive efforts would be needed to screen conference materials for peerreviewed full papers with sufficient context and quality. Secondly, we did not conduct a systematic analysis or meta-analysis for the diagnostic/screening performance in this review, though they had common evaluation metrics. There was high heterogeneity in the methods and dataset to generate the evaluation metrics among studies, such as crossvalidation, external validation, or loss functions. Moreover, a number of studies did not account for over-fitting in their models that could overestimate the accuracy performance. A meta-analysis would likely mislead the readers during the comparison between systems and models. Furthermore, we confined the elastography review to strain or shear wave elastography, although the incorporation of ultrasound Doppler has received attention requiring development of specific machine learning techniques [23].
Attention layer [97] is increasingly applied in deep networks such as U-Net [98] to improve the performance of segmentation. It mimics the human cognitive attention function to focus on a particular object. A deep learning network with attention layer could guide the model to focus on a particular object in the image during the learning process. That approach can replace the segmentation process and improve the effectiveness of the learning and relevance of the extracted features. Currently, all input data are processed and pre-prepared before training. If there are new data, the model needs to be retrained for the full dataset. An online training method could be adopted, such that the model could be re-learnt and updated with sequential future data without retraining the whole dataset [99].