Ensemble Learning of Multiple Models Using Deep Learning for Multiclass Classification of Ultrasound Images of Hepatic Masses

Ultrasound (US) is often used to diagnose liver masses. Ensemble learning has recently been commonly used for image classification, but its detailed methods are not fully optimized. The purpose of this study is to investigate the usefulness and comparison of some ensemble learning and ensemble pruning techniques using multiple convolutional neural network (CNN) trained models for image classification of liver masses in US images. Dataset of the US images were classified into four categories: benign liver tumor (BLT) 6320 images, liver cyst (LCY) 2320 images, metastatic liver cancer (MLC) 9720 images, primary liver cancer (PLC) 7840 images. In this study, 250 test images were randomly selected for each class, for a total of 1000 images, and the remaining images were used as the training. 16 different CNNs were used for training and testing ultrasound images. The ensemble learning used soft voting (SV), weighted average voting (WAV), weighted hard voting (WHV) and stacking (ST). All four types of ensemble learning (SV, ST, WAV, and WHV) showed higher values of accuracy than the single CNN. All four types also showed significantly higher deep learning (DL) performance than ResNeXt101 alone. For image classification of liver masses using US images, ensemble learning improved the performance of DL over a single CNN.


Introduction
Hepatocellular carcinoma (HCC) is the most common primary liver cancer, the sixth most common cancer worldwide, and the second leading cause of cancer death, resulting in almost 800,000 annual deaths worldwide [1][2][3]. Surviving HCC is mainly influenced by the stages of disease at the time of diagnosis [4]. Ultrasonography (USG) is a simple, noninvasive tool with no risk of radiation exposure. USG is a safe, portable, relatively inexpensive, and easily accessible imaging modality, making it a useful diagnostic and monitoring tool in medicine [5,6]. Conventional USG is usually used as the first choice for the surveillance of HCC [4,7,8]. The reported sensitivity of US alone for HCC diagnosis ranges from 60 to 90%, with excellent specificity of over 90% [9,10]. Ultrasonography is often used to screen the liver, especially for HCC, due to its low cost, accessibility, and lack of X-ray exposure [11].
There are published studies of artificial intelligence (AI) in USG, including thyroid, breast, abdomen, pelvic area, obstetrics and gynecology such as fetus, heart and vascular. AI-based image classification of liver masses has been studied in computed tomography (CT) [12,13] and also AI-based image classification of colorectal cancer liver metastases has been studied in magnetic resonance imaging (MRI) [14]. Focal liver lesions (FLLs) are described as abnormal portions of the liver that are primarily derived from hepatocytes, biliary epithelium, and mesenchymal tissue [15]. Ultrasonography is the preferred method images in each of the four classes were collected from 227 BLT cases, 200 LCY cases, 210 MLC cases, and 191 PLC cases. In order to check for similarity in test images of lesions collected from the same case, the test images for each class were hashed and the Hamming distance was measured to ensure that there were no similar images. Similarity was judged as not similar if the bit agreement between the hash values of the two images was less than 80% [50]. The training images consisted of 6970 BLT images, 2310 LCY images, 9470 MLC images, and 7590 PLC images. Pixel information in US grayscale images is provided by numbers ranging from 0 to 255, with 0 being white and the numbers becoming black as they approach 255. All US image data used were standardized by arraying, converting the data type to a 64-bit floating-point number, and then dividing by 255, with the pixel information being a number in the range 0 to 1.The study was approved by the Institutional Ethical Committee.

Training of 16 CNNs
16 different CNN models were fine-tuned by replacing only fully-connected layer, or by replacing part of the convolutional layer and fully-connected layer were used for training and testing ultrasound images ( Figure 2). EfficientNetB0-B6 [51] were pre-trained on ImageNet with Noisy Student Training and were fine-tuned, and the other 9 models were

Training of 16 CNNs
16 different CNN models were fine-tuned by replacing only fully-connected layer, or by replacing part of the convolutional layer and fully-connected layer were used for training and testing ultrasound images ( Figure 2). EfficientNetB0-B6 [51] were pre-trained on ImageNet with Noisy Student Training and were fine-tuned, and the other 9 models were pre-trained on ImageNet and were fine-tuned. Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. ImageNet first trains an EfficientNet model on the labeled images, which are used as a teacher to generate pseudo-labels for the 300 million unlabeled images. Next, a larger EfficientNet is trained as a student model for the combination of labeled and pseudo-labeled images. This process is iterated by returning the student as the teacher. During student learning, inject noise into the student, such as dropouts, probabilistic depth, and data augmentation with RandAugment, to ensure that the student generalizes better than the teacher [52]. 16 types of training were performed using the different models and the aforementioned training images. The resolution of all input images was set to 256 × 256 in order to compare them and ensure reproducibility of training results. The batch sizes were all set to 32, the number of epochs to 50, and the random number seed was fixed to an arbitrary seed value. Width shift, height shift and horizontal flip were used as data augmentation. The floating-point number for width shift and height shift was set to 0.1. K-partition cross-validation (k = 10) was used as a method to validate the predictive performance of the machine learning models (Figure 2). The model fitting callback was the early stopping function of Keras, which stops training before overtraining occurs, and training was stopped if there was no improvement during 10 epochs of val_loss values. We also reduced the learning rate by 0.1 if there was no improvement for 3 epochs using ReduceLROnPlateau in Keras.
pre-trained on ImageNet and were fine-tuned. Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. ImageNet first trains an EfficientNet model on the labeled images, which are used as a teacher to generate pseudo-labels for the 300 million unlabeled images. Next, a larger EfficientNet is trained as a student model for the combination of labeled and pseudo-labeled images. This process is iterated by returning the student as the teacher. During student learning, inject noise into the student, such as dropouts, probabilistic depth, and data augmentation with RandAugment, to ensure that the student generalizes better than the teacher [52]. 16 types of training were performed using the different models and the aforementioned training images. The resolution of all input images was set to 256 × 256 in order to compare them and ensure reproducibility of training results. The batch sizes were all set to 32, the number of epochs to 50, and the random number seed was fixed to an arbitrary seed value. Width shift, height shift and horizontal flip were used as data augmentation. The floating-point number for width shift and height shift was set to 0.1. K-partition cross-validation (k = 10) was used as a method to validate the predictive performance of the machine learning models (Figure 2). The model fitting callback was the early stopping function of Keras, which stops training before overtraining occurs, and training was stopped if there was no improvement during 10 epochs of val_loss values. We also reduced the learning rate by 0.1 if there was no improvement for 3 epochs using ReduceLROnPlateau in Keras.

Classification Metrics of Test Results for 16 CNNs
After training with the training set images, 16 CNN models were tested with 1000 test set images and probabilities of four classes and predicted classes of all 1000 images

Classification Metrics of Test Results for 16 CNNs
After training with the training set images, 16 CNN models were tested with 1000 test set images and probabilities of four classes and predicted classes of all 1000 images were output ( Figure 2). Performance metrics for multiclass classification are first calculated using the confusion matrix to calculate true positive (TP), true negative (TN), false positive (FP), and false negative (FN), and then these are used to calculate values for precision, sensitivity (recall), specificity, and f1 score. The calculation of TP, TN, FP, and FN in a multiclass is described below. TP: The true positive value is where the actual value and predicted value are the same; FN: The false-negative value for a class will be the sum of values of corresponding rows except for the TP value; FP: The false-positive value for a class will be the sum of values of the corresponding column except for the TP value; TN: The true negative value for a class will be the sum of values of all columns and rows except the values of that class that we are calculating the values for.
Using Table 1 as a calculation example, the TP, TN, FP, FN for the class BLT in multiclass are calculated as follows:  TN BLT = (cell6 + cell7 + cell8 + cell9 + cell10 + cell11 + cell12 + cell13 + cell14 + cell15 + cell16) TP, TN, FP, and FN are calculated in the same way for the remaining three classes. In the case of multiclass classification, one class is selected as positive and the other classes are selected as negative, and the evaluation values are calculated. There are two types of averages: micro-average and macro-average. When the number of samples in each class differs greatly, the macro-average may not be able to calculate the actual accuracy, and the micro average may be selected. In this study, the number of samples for each class is the same, so the macro-average is used.
The precision, sensitivity, specificity, f1 score and accuracy for the class BLT are given by the following equations [53,54]. These values can be calculated for each class. The macro-average is the average of these values. The formula for calculating the macro-average of precision (Precision Macro ) in this study is as follows.
Macro averages for other evaluation metrics are also calculated using the same formula as above. The 95% confidence intervals for precision, sensitivity, specificity, and accuracy were all calculated using the Newcombe method [55].
The receiver operating characteristic (ROC) curve and ROC area under the curve (AUC) score are not immediately applicable to a multiclass classifier. One vs. Rest, where each class is compared to the other classes simultaneously, allows drawing ROC curves for multiclass classifiers. 16 CNNs with 4 classes had ROC curves AUCs computed. Macro averages of the AUCs were then also computed for each model [56,57]. Finally, for each of the 16 CNN models, the macro-averaged accuracy values of the test results were ranked and sorted from highest to lowest.

Correlation Matrix for Each of the 16 CNN Test Results
A quantitative measure of the ensemble's effectiveness is to look at the Pearson's correlation coefficient (CORR) between classifiers. The CORRs of the estimated labels of 16 CNN classifiers are calculated.

Voting Ensemble including WHV, SV, and WAV
Hard voting is a simple majority voting image classification in which each classifier has one vote. In the case of hard voting, in order to obtain the optimal number of ensemble pruning based on the ranking method described in the next section, using an even number of classifiers may result in a tie number of votes for the four classes, and the same number of votes may make the class determination for image classification impossible. To avoid this situation and improve the performance of ensemble learning, there is a technique called WHV, also known as Weighted Majority Voting, which assigns different weights to each classifier. WHV, weighted majority voting is computed by associating a weight w j with classifier C j :ŷ where χ A is the characteristic function C j (x) = i ∈ A, and A is the set of unique class labels. WHV is a method that assigns different weights to each classifier [61]. SV also has a method called WAV, which aims to improve performance by optimizing each classifier with different weights [48]. In SV, we predict the class labels based on the predicted probabilities p for classifier.ŷ where w j is the weight that can be assigned to the i th classifier.
In simple unweighted SV, the weight w j is assigned to 1. WAV also assigns different weights to each classifier. WHV and WAV require optimization of the respective weights. The difference between WHV and WAV is illustrated in Figure 3, citing two examples. There are some methods for weight optimization, such as grid search, random search, and Bayesian optimization [62,63]. In this study, we optimized the weighting for both WHV and WAV using Bayesian optimization with our own Python code.

= ,
where is the weight that can be assigned to the i th classifier. In simple unweighted SV, the weight is assigned to 1. WAV also assigns di weights to each classifier. WHV and WAV require optimization of the respective w The difference between WHV and WAV is illustrated in Figure 3, citing two exa There are some methods for weight optimization, such as grid search, random searc Bayesian optimization [62,63]. In this study, we optimized the weighting for both and WAV using Bayesian optimization with our own Python code.

Stacking Ensemble
In ST, the 16 CNN models are first trained together as a basic learner to obta dictions. Then, the individual predictions are treated as the next set of training da added from another layer called the meta-learner ( Figure 4). There are also studi use extreme gradient boosting (XGBoost), a representative method of gradient bo as a meta-learner for ST [64,65]. In this study, we employed the ST method usin gradient boosting machine (LightGBM) version 3.2.1 (https://lightgbm.readthed

Stacking Ensemble
In ST, the 16 CNN models are first trained together as a basic learner to obtain predictions. Then, the individual predictions are treated as the next set of training data and added from another layer called the meta-learner ( Figure 4). There are also studies that use extreme gradient boosting (XGBoost), a representative method of gradient boosting, as a meta-learner for ST [64,65]. In this study, we employed the ST method using light gradient boosting machine (LightGBM) version 3.2.1 (https://lightgbm.readthedocs.io/ accessed on 20 November 2022.) as the meta-learner, which is a newer method of gradient boosting than XGBoost and is expected to have higher performance [66,67] (Figure 4). For Light-GBM, k = 5 was set using k-partition cross-validation. The number of gradient boosting iterations (num_boost_round) was set to 1000, and the number of times that training was terminated (early_stopping) was set to 50 if the score did not improve a certain number of consecutive times in the evaluation data, even if the specified training count was not reached. The multiclass log loss (multi_logloss) was used as the metric for the LightGBM evaluation function. Other parameters of LightGBM were optimized by Optuna version 2.10.0 (https://www.preferred.jp/ja/projects/optuna/ accessed on 20 November 2022), an open-source software framework for automating the optimization of hyperparameters, which is a type of Bayesian optimization. We also fixed the random number seeds in the program to ensure reproducibility when using Optuna with LightGBM.

Ensemble Pruning and Evaluation Metrics
As in previous papers on ensemble learning using CNNs for medical images [35,37,47], we employed a ranking method, a type of ensemble pruning, in which the top 1 classifier is selected from the classifiers ordered by the highest accuracy value obtained from the test results of each of the 16 CNNs alone. In this study, in order to find the optimal number of selected classifiers for pruning, we first started with ensemble learning using all 16 types of classifiers. Next, pruning was performed by decreasing the number of classifier members one by one, starting with the top 16 with the lowest accuracy. Finally, ensemble learning was performed using only top1 and top2, for a total of 15 types of ensemble learning. By running each of ensemble pruning for SV, WAV, WHV, and ST, 60 ensemble learning models were created ( Figure 5). As with the testing of the 16 CNNs, 60 patterns of ensemble learning models were tested using 1000 B-mode FFLs US images of the test set. TP, TN, FP, FN, precision, sensitivity, specificity, f1 score, and accuracy were calculated as the evaluation metrics. since WHV is a kind of majority voting and outputs only prediction classes, it is usually not possible to draw ROC curves. Therefore, ROC-AUC values were calculated with SV, WAV, and ST. accessed on 20 November 2022.) as the meta-learner, which is a newer method of gradient boosting than XGBoost and is expected to have higher performance [65,66] (Figure 4). For LightGBM, k = 5 was set using k-partition cross-validation. The number of gradient boosting iterations (num_boost_round) was set to 1000, and the number of times that training was terminated (early_stopping) was set to 50 if the score did not improve a certain number of consecutive times in the evaluation data, even if the specified training count was not reached. The multiclass log loss (multi_logloss) was used as the metric for the LightGBM evaluation function. Other parameters of LightGBM were optimized by Optuna version 2.10.0 (https://www.preferred.jp/ja/projects/optuna/ accessed on 20 November 2022.), an open-source software framework for automating the optimization of hyperparameters, which is a type of Bayesian optimization. We also fixed the random number seeds in the program to ensure reproducibility when using Optuna with LightGBM.

Ensemble Pruning and Evaluation Metrics
As in previous papers on ensemble learning using CNNs for medical images [35,37,47], we employed a ranking method, a type of ensemble pruning, in which the top 1 classifier is selected from the classifiers ordered by the highest accuracy value obtained from the test results of each of the 16 CNNs alone. In this study, in order to find the optimal number of selected classifiers for pruning, we first started with ensemble learning using all 16 types of classifiers. Next, pruning was performed by decreasing the number of classifier members one by one, starting with the top 16 with the lowest accuracy. Finally, ensemble learning was performed using only top1 and top2, for a total of 15 types of ensemble learning. By running each of ensemble pruning for SV, WAV, WHV, and ST, 60 ensemble learning models were created ( Figure 5). As with the testing of the 16 CNNs, 60 patterns of ensemble learning models were tested using 1000 B-mode FFLs US images of the test set. TP, TN, FP, FN, precision, sensitivity, specificity, f1 score, and accuracy were calculated as the evaluation metrics. since WHV is a kind of majority voting and outputs only prediction classes, it is usually not possible to draw ROC curves. Therefore, ROC-AUC values were calculated with SV, WAV, and ST.

Statistical Examination
McNemar's test is used when there is a need to compare the performance of two classifiers [67,68]. The McNemar's test is designed to focus primarily on the differences between the two classifiers, i.e., cases predicted in different ways. An example of the McNemar's test for comparing two classifiers (classifier 1, classifier 2) is shown below. First, the correct answer classes of the test set and the predicted classes of each of classi-

Statistical Examination
McNemar's test is used when there is a need to compare the performance of two classifiers [68,69]. The McNemar's test is designed to focus primarily on the differences between the two classifiers, i.e., cases predicted in different ways. An example of the McNemar's test for comparing two classifiers (classifier 1, classifier 2) is shown below. First, the correct answer classes of the test set and the predicted classes of each of classifier1 and classifier 2 are tabulated. These are compared to each other to calculate the number of cases correctly classified by both classifiers, the number correctly classified by classifier 1 but not by classifier 2, the number correctly classified by classifier 2 but not by classifier 1, and the number misclassified by both classifiers, respectively. These are used to create a contingency table (Table 2). Finally, since the McNemar's test is essentially a form of Chi-square test with correspondence, we calculate the Chi-square values using the following formula.
(n 01 − n 10 ) 2 n 01+ n 10 The null hypothesis is that two cases do not agree by the same amount. If the null hypothesis is rejected, it suggests that the cases do not match in different ways and that there is evidence to suggest that the disagreement is skewed [67,68]. Given the selection of a significance level, the p-value calculated by the test can be interpreted as follows: alpha: significance level, H0: null hypothesis. p > alpha: fail to reject H0, no difference between classifier 1 and classifier 2. p ≤ alpha: reject H0, significant difference between classifier 1 and classifier 2.
Based on this method, the following statistical analyses were conducted.

Comparison of 16 Different CNNs
The McNemar test of statistical significance between the Top 1 model with the highest accuracy value and the results of each of the other best 2 to 16 CNN models among the 16 single CNN models.

Comparison of Stand-Alone CNN and Ensemble Models
The method with the highest accuracy value among the SV, ST, WAV, and WHV ensemble learning models was selected. The McNemar test was then performed between the test results of all 15 members of the ensemble pruning from top2 to top16 of the best selected ensemble learning method and the test results of the Top1 CNN model with the highest accuracy value among the 16 CNNs.

Evaluation of Test Results for 16 CNN Models
Four-class confusion matrix in 16CNNs are shown in Figure 6. True positive, false positive, false negative, and true negative values for 4-class classification in 16 CNNs are shown in Table 3. The evaluation metrics for each of the 16 CNNs are shown in Tables 4 and 5. Among the 16 CNNs, ResNext101 (0.719) had the highest accuracy. With the exception of EfficientNetB0 (0.694), which had the 6th highest accuracy, the CNNs with the 4th through 9th highest accuracy were ResNet-based. The CNNs from the 11th to the lowest accuracy were EfficientNet-based. The CNN with the highest f1 score was ResNeXt101, consistent with the accuracy ranking. When comparing the values of precision, sensitivity, and specificity, all 16 ranking CNNs were high in the order of specificity, precision, and sensitivity. Comparing the test results of ResNeXt101, which ranked first in accuracy, with the other 15 CNN models, there was no significant difference in test result performance between the second to ninth place rankings. On the other hand, there was a significant difference (p < 0.05) between ResNeXt101 and each of the CNNs ranked 10th through 16th ( Table 5). The CORRs among the estimated labels of the 16 CNN classifiers totaled 120. Of these, 8 were in the 0.3 range and 112 were 0.4 or higher. All 120 CORRs were below 0.95 (Table 6).

Evaluation of Test Results for Ensemble Learning and Ensemble Pruning
A comparison of the accuracy values for the ensemble learning is shown in Figure 6. For SV, top16 was used for SV (SV16) with the highest accuracy of 0.776. For ST, top9 (ST9) had the highest accuracy at 0.776. Referring to Table 5, this ST9 refers to ST using a total of nine different classifiers: ResNeXt101, Xception, InceptionResNetV2, SeResNeXt50, ResNeXt50, EfficientNetB0, SeResNeXt101, ResNet101, and ResNet50, starting from top1. For WHV, top13 (WHV13) had the highest accuracy at 0.779. top7 (WAV7) had the highest accuracy for WAV at 0.783, which was the highest accuracy value for all methods. For WHV, top13 (WHV13) had the highest accuracy at 0.779.
Four-class confusion matrix of the models with the highest accuracy by pruning in each of the four types of ensemble learning are shown in Figure 7. TP, FP, FN and RN values for 4-class classification in the models with the highest accuracy by pruning in each of the four types of ensemble learning are shown in Table 7. The respective metrics for SV16, ST9, WAV7, and WHV13 are shown in Table 8. As well as accuracy, precision (0.790), sensitivity (0.783). and specificity (0.928) for WAV7, specificity (0.928), f1-score (0.786), and the macro-AUC (0.935) for ROC was the highest value for all methods calculated in this study.      Referring to Figure 6 and Table 8, WAV was chosen as the method with the best performance in the test results, among the four ensemble methods. Table 9 showed the statistical analysis of the test results for all WAV members from WAV2 to WAV16 and for Res-NeXt101, the top1 accuracy ranking for each CNN. For WAV6 to WAV16. the test results were significantly better than for ResNeXt101, a single CNN (p < 0.01).  Table 10 showed the image classification results of 4 different liver masses for ResNeXt101, the model with the highest accuracy among 16 CNNs, and WAV7, the model with the highest accuracy among ensemble learning. All the indices (precision, sensitivity, specificity, f1-score, ROC-AUC) for both ResNeXt101 and WAV7 were highest for LCY, followed by BLT and PLC and lowest for MLC was the lowest. In comparison of the values of precision, sensitivity, and specificity, only the liver cyst of WAV7 had 100% precision and specificity, while all the others, both ResNeXt101 and WAV7, had 100% precision, specificity, specificity, sensitivity, and specificity. All others were high in the order of specificity, precision, and sensitivity for both ResNeXt101 and WAV7.

Discussion
In this study, we used the value of accuracy as a metric that is often used to compare multiple models to rank the test results of 16 different CNN models [54], When comparing the accuracy of multiple training models, the random seed needs to be fixed to ensure reproducibility of the test results. For this reason, we took this problem into consideration by fixing the random seed at multiple locations in the program as appropriate. However, since the main purpose of this paper is to focus on the comparison of multiple models and the usefulness of ensemble learning, fixing the random seed does not necessarily mean that the test results of the training model will output the best maximum value. It should be taken into account that each of the evaluation metrics listed tends to be somewhat lower. EfficientNet is newer than the resnet and inception-based CNN models, and EfficientNet is generally expected to perform better in ImageNet with Noisy Student training than ImageNet without Noisy Student [52]. For this reason, we employed ImageNet with noisy student training for pre-training of EfficientNet. However, even with EfficientNet, no better accuracy was obtained compared to the resnet and inception-based CNN models, except for EfficientNetB0. In the future, it seems necessary to examine the usefulness of EfficientNet by increasing or decreasing the number of images.
The type of model chosen to achieve high accuracy may also depend on the type and resolution of the target medical images. Therefore, a simple comparison of other papers on the usefulness of ensemble learning and on subjects that differ from each other is limited. In this study, 16 types were widely selected from inception, resnet, and EfficientNet systems. A quantitative measure of the effectiveness of ensemble is to look at the Pearson's CORR between classifiers, and it is said that ensemble learning is expected to be effective when Pearson's CORR is 0.95 or less [70]. The 16 CNNs used in this study all had CORRs of 0.95 or less, suggesting that ensemble learning can help improve prediction performance.
There was no significant difference between RexNeXt101, the CNN with the highest accuracy values, and the CNNs ranked 2nd through 9th in accuracy values. Therefore, ResNeXt101 was adopted as the representative of the single CNN for comparison of test results with ensemble learning. In the comparison of SV, ST, WAV, and WHV, WAV showed the best accuracy. As for ST, even though the gradient boosting method was used as a metamodel, it did not perform as well as SV and WAV, as in the previous paper on ensemble learning using medical images [36].
Ensemble pruning was also performed, and test results were compared between WAV, the best performing of the four ensemble learning methods, and ResNexT101, a single CNN. The results showed that WAV significantly outperformed ResNeXt101 up to WAV6, suggesting that pruning can reduce the number of classifiers that are members of the ensemble learning. However, further research is needed on pruning methods and parameters.
In this study, the average computation time with SV was 4.26 s, while WAV took 13 min. Advances in hardware have made faster computation speeds possible, making shorter processing times possible. The results of this study also show that ensemble learning clearly improves accuracy compared to single learning models, and it is expected that ensemble learning will become more popular in the future, as the trend toward improved computation speed is expected to continue.
Although very rare, cases of PLC associated with BLT have been reported, and US images have been included in these case reports [71,72]. It is believed that the general DL, including this study, cannot correctly diagnose such rare cases, which is a limitation of this study. Further studies are needed for computer assisted diagnosis of such rare diseases. However, if the DL model predicts BLT on one scan and PLC on another scan for the same lesion, it may provide a hint for radiologists to consider the possibility of rare cases for the paradoxical prediction of DL.

Conclusions
In multiclass classification of FLLs using US B-mode images, ensemble learning was shown to improve accuracy and the deep learning performance over a single CNN. It was thought that future research should explore ensemble pruning for medical images such as X-ray, CT and MRI, other than US images, especially to find out how many classifier members are useful in ensemble learning by ranking methods.  Informed Consent Statement: This study was approved by the institutional review board and written informed consent was waived because of the retrospective design.

Data Availability Statement:
The models are available upon request via this GitHub link: https: //github.com/imedix2021/ame_liver_ensemble accessed on 20 November 2022. The original hepatic US dataset used in this study was selected from US images of hepatic mass lesions from the national project of "the construction of big database of US digital image: toward the diagnostic aid with artificial intelligence", a part of the "ICT infrastructure establishment and implementation of artificial intelligence for clinical and medical research" fund (https://www.amed.go.jp/en/program/list/14 /02/002.html accessed on 20 November 2022.), which is an effort to develop ICT infrastructure that utilizes the results of medical data analysis has been addressed in a fully integrated manner by the government of Japan. The contact information for this database is rinshoict@amed.go.jp.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The