Deep Features from Pretrained Networks Do Not Outperform Hand-Crafted Features in Radiomics

In radiomics, utilizing features extracted from pretrained deep networks could result in models with a higher predictive performance than those relying on hand-crafted features. This study compared the predictive performance of models trained with either deep features, hand-crafted features, or a combination of these features in terms of the area under the receiver-operating characteristic curve (AUC) and other metrics. We trained models on ten radiological datasets using five feature selection methods and three classifiers. Our results indicate that models based on deep features did not show an improved AUC compared to those utilizing hand-crafted features (deep: AUC 0.775, hand-crafted: AUC 0.789; p = 0.28). Including morphological features alongside deep features led to overall improvements in prediction performance for all models (+0.02 gain in AUC; p < 0.001); however, the best model did not benefit from this (+0.003 gain in AUC; p = 0.57). Using all hand-crafted features in addition to the deep features resulted in a further overall improvement (+0.034 in AUC; p < 0.001), but only a minor improvement could be observed for the best model (deep: AUC 0.798, hand-crafted: AUC 0.789; p = 0.92). Furthermore, our results show that models based on deep features extracted from networks pretrained on medical data have no advantage in predictive performance over models relying on features extracted from networks pretrained on ImageNet data. Our study contributes a benchmarking analysis of models trained on hand-crafted and deep features from pretrained networks across multiple datasets. It also provides a comprehensive understanding of their applicability and limitations in radiomics. Our study shows, in conclusion, that models based on features extracted from pretrained deep networks do not outperform models trained on hand-crafted ones.


Introduction
Radiomics has emerged as a promising image analysis technique, providing insights into the characterization and quantification of radiological imaging, especially supporting diagnostic and prognostic tasks [1][2][3].Essentially, radiomics involves the application of a classical machine learning pipeline to process radiological data [4,5].A central step is generating features from radiological imaging, since these are designed to capture the content in a comprehensive fashion.For this task, hand-crafted features based on morphology, intensity, and texture have been developed, stemming back from ideas in the field of image analysis in the 1970s and 80s [6,7].Even though these features have proven successful, they might be suboptimal, since they were designed mainly for other domains.Thus, they might not fully capture information in radiological imaging.
Recently, deep learning models have been applied successfully in many areas [8][9][10].A key advantage is that they can automatically learn relevant features from imaging data, which may be especially beneficial compared to hand-crafted features, leading to models with improved prediction and increased utility.Yet, a fundamental problem in applying these methods lies in the small sample sizes typically involved in radiomic datasets.Deep learning methods usually possess a large number of trainable parameters that are indispensable in learning features independently; however, at the same time, a large number of data are needed to train these parameters effectively.If the sample sizes are small, as in radiomics, they might not perform well.
A simple alternative is to use pretrained deep networks as feature extractors.In this approach, networks are first trained on other large datasets, such as everyday images or medical images that are not directly related to the existing data.Ideally, because of the larger amount of data, the network will learn features that represent the data more abstractly at a higher level.The hope is that these features are more general in nature and will work well on other datasets, too.In radiomics, these features could thus form a viable alternative to manually generated features and, in the best case, lead to better models.
However, a large number of pretrained models exist.These differ mainly in the dataset and the methods used during training.For example, many models are trained on the ImageNet dataset containing around 14 million images [11], but recently, networks pretrained on medical data were also introduced.In addition, training can be conducted in a supervised fashion and via self-supervision, which has the benefit of being an unsupervised method, e.g., no labels are needed.It is unclear which pretrained models should be used as feature extractors in radiomics, and whether they can provide an advantage over handcrafted features.Similarly, the question is whether the two feature classes could complement each other, leading to even higher predictive models.
This study, therefore, utilized multiple pretrained deep neural networks and tested their predictive performance across multiple radiomic datasets.Our study aimed to answer which pretrained networks work better, and whether models fused using deep and handcrafted features together can further improve performance.The major contribution of this research is conducting a comparative analysis of models trained on hand-crafted, deep features extracted from pretrained deep networks or a combination of these.A novelty is the utilization of a broad set of deep models comprising models trained on ImageNet (like DeiT III and Convnext-v2) and medical data (like MedicalNet).While the main metric is the area under the receiver-operating characteristic curve (AUC), we also employed other metrics for this comparison (like the Matthew correlation coefficient and F-score).This study is one of the first to compare these models over several datasets, allowing for a more robust assessment of their performance compared to studies using only a single dataset.It contributes to a more comprehensive understanding of the applicability and limitations of these approaches, and can guide future research and applications.
The remaining paper is organized as follows: Section 2 provides an overview of related works, while Section 3 describes the materials and methods used in this study.Section 4 presents the analysis results, and Section 5 discusses the results, providing concluding remarks and directions for possible future work.

Related Works
Studies comparing models using hand-crafted features and deep features have been conducted previously (Table 1), but nearly all these studies only compared them on single datasets.Since such comparisons cannot be readily generalized to other datasets, the results in the literature are accordingly mixed.While some studies have reported a decreased or no improvement at all (e.g., [12][13][14][15]), others reported improved performance when using deep features (e.g., [16][17][18][19][20]).The reasons for these differences are manifold and hard to identify.One reason could be the different validation schemes utilized.For example, Fu et al. employed an (uncommon) 4-fold CV [21], while Yan et al. used an internal cohort that was acquired at the same site [12].In contrast, Feng et al. used an external cohort for their validation [19].Yet, no clear pattern can be seen; for example, Feng et al. observed a gain of +0.10 in AUC, while Hou et al. also used an external cohort, but did not see any gain in using deep features [15].Furthermore, even though many hand-crafted features were defined by the image biomarker standardization initiative [22] and thus should be comparable across different studies, some libraries for extracting hand-crafted features do not adhere to this standard, and can lead to differences in predictive performance between the studies.The same is true for the preprocessing steps; these can have a large impact on the performance, but there is no standardization.For example, normalization of the features using different methods like z-Score or min-max can influence the resulting model [23], but they differ between the studies.A study comparing the predictive performance of multiple pretrained deep networks across multiple radiomics datasets still needs to be conducted.

Materials and Methods
For this retrospective study, several publicly available datasets were used; corresponding approvals from the ethical review boards have been granted.Ethical approval for this study was waived by the local ethics committee (Ethik-Kommission, Medizinische Fakultät der Universität Duisburg-Essen, Essen, Germany).All methods and procedures were performed following the relevant guidelines and regulations, the CheckList for EvaluAtion of Radiomics research (CLEAR) [27], and in accordance with the Declaration of Helsinki.

Study Design
The study follows the standard radiomics pipeline very closely [3,28] and consists of preprocessing, feature extraction, selection, and classification (Figure 1).Briefly, the datasets were first preprocessed via resampling to a homogenous spacing.The volumes were then used to extract hand-crafted and deep features.The hand-crafted features were extracted directly from the volumes.For the deep features, first, slices were extracted.The slices were then processed using pretrained deep models, and the features were extracted from a layer of the networks.These steps resulted in multiple feature sets, which were then used for model building, consisting mainly of a feature selection and classification step.A 10-fold stratified cross-validation was utilized to determine which methods performed best.During cross-validation, the training data were normalized, and a feature selection algorithm was applied.On the resulting features, a classifier was trained.This classifier was then used to obtain predictions on the validation folds.These predictions were then pooled across the folds, and several metrics were computed.
classifier was then used to obtain predictions on the validation folds.These predictions were then pooled across the folds, and several metrics were computed.

Datasets
Ten radiomic datasets were collected from the Workflow for Optimal Radiomics Classification (WORC) [29] and the cancer imaging archive (TCIA) [30] (Table 2).All datasets were binary, i.e., the outcome is either 0 or 1.Each dataset consisted of a scan (e.g., a CT or an MR sequence) and a segmentation of the volume of interest (VOI) (Figure 2).All samples were included, except for a few wherein either the slice thickness was comparatively too large (i.e., larger than twice the median slice thickness) or a technical error occurred while computing the hand-crafted feature vectors.All scans were resampled using bicubic interpolation to a homogenous spacing of 1 × 1 × 1 mm 3 .The segmentations were resampled to the same spacing, but using the nearest neighbor method to avoid partial volume effects.

Datasets
Ten radiomic datasets were collected from the Workflow for Optimal Radiomics Classification (WORC) [29] and the cancer imaging archive (TCIA) [30] (Table 2).All datasets were binary, i.e., the outcome is either 0 or 1.Each dataset consisted of a scan (e.g., a CT or an MR sequence) and a segmentation of the volume of interest (VOI) (Figure 2).All samples were included, except for a few wherein either the slice thickness was comparatively too large (i.e., larger than twice the median slice thickness) or a technical error occurred while computing the hand-crafted feature vectors.All scans were resampled using bicubic interpolation to a homogenous spacing of 1 × 1 × 1 mm 3 .The segmentations were resampled to the same spacing, but using the nearest neighbor method to avoid partial volume effects.Class balance denotes the ratio of the major class to the minor class.DCE: Dynamic contrast enhanced; TCIA: The cancer imaging archive; WORC: Workflow for optimal radiomics classification.Class balance denotes the ratio of the major class to the minor class.DCE: Dynamic contrast enhanced; TCIA: The cancer imaging archive; WORC: Workflow for optimal radiomics classification.

Preprocessing
For each dataset, slices were extracted from the bounding box around the segmentation.Slices in which the segmentation occupied less than 50 pixels (around 7 mm 2 ) were excluded, since the region-of-interest (ROI) would be too small to extract meaningful features.The same intensity normalization was applied per slice, regardless of whether a CT or MR dataset was processed.For the networks trained on ImageNet data, slices were normalized with respect to the mean and standard deviation of the ImageNet dataset.Slices for the RadImageNet were normalized by linearly scaling the intensities into the range [−1, 1].For MedicalNet, the intensities of the volumes were z-Score normalized.In addition, the volume of interest was resized to 112 × 112 × 56.Slices were not further rescaled during preprocessing.
For the hand-crafted features and 3-D networks, the scan was cropped using a bounding box around the segmentation.

Preprocessing
For each dataset, slices were extracted from the bounding box around the segmentation.Slices in which the segmentation occupied less than 50 pixels (around 7 mm 2 ) were excluded, since the region-of-interest (ROI) would be too small to extract meaningful features.The same intensity normalization was applied per slice, regardless of whether a CT or MR dataset was processed.For the networks trained on ImageNet data, slices were normalized with respect to the mean and standard deviation of the ImageNet dataset.Slices for the RadImageNet were normalized by linearly scaling the intensities into the range [−1, 1].For MedicalNet, the intensities of the volumes were z-Score normalized.In addition, the volume of interest was resized to 112 × 112 × 56.Slices were not further rescaled during preprocessing.
For the hand-crafted features and 3-D networks, the scan was cropped using a bounding box around the segmentation.

Feature Extraction
After preprocessing, hand-crafted and deep features were extracted from either the slices or the scans.
For the hand-crafted features, 2-D and 3-D features were extracted from the scans using PyRadiomics v3.1 [35].For this, each scan was first processed via multiple preprocessing filters in addition to the unprocessed scans: Laplacian-of-Gaussian (with sigma 1.0.2.0.3.0.4.0.5.0), wavelet, square root, logarithm, square, exponential and gradient filter.Three types of features were extracted: morphological, intensity, and texture features.All feature classes were extracted that were available in PyRadiomics, namely, shape, firstorder, glcm (with the exception of SumAverage), glrlm, glszm, gldm and ngtdm.The computation of the features differed for CT and MRI.For MRI data, the intensities were normalized and scaled by 100.For CT data, no normalization was applied.For extraction, in both cases, the image values were quantized using a bin width of 25.Altogether, 1470 hand-crafted features were computed for each scan.
In addition to 3-D features, 2-D hand-crafted features were computed using the force2D option in PyRadiomics.This option ensured that all texture and intensity features were computed slice by slice.Altogether, five feature sets comprising hand-crafted features were computed.First, all features were extracted from the 3-D data (called "Hand-crafted, 3-D, All").Therefore, these features can be regarded as the standard, and are used as the baseline for all comparisons.In addition, since morphological features are known to be important, a feature set comprising only morphological features was also generated ("Hand-crafted, 3-D, Morph").Finally, a third feature set was created with all features except morphological ones ("Hand-crafted, 3-D, No morph").In addition, two 2-D feature sets were generated: One comprising all 2-D features with the 3-D morphological features ("Hand-crafted, 2-D, Morph"), and one without ("Hand-crafted, 2-D, No morph").
Extraction of the deep features from the pretrained neural networks was performed as follows [36]: Slices were first resized to 224 × 224 pixels and converted to 3-channel RGB images by adding the segmentation and a linear combination of the scan and segmentation as extra channels.Features were then extracted from the last convolutional layer using a global average pooling if necessary.In order to obtain a single feature vector for each patient, the features from all slices of a patient were merged using max pooling.
For each network, the corresponding features were extracted by max-pooling all slices.Additional feature sets were generated by combining the features with either morphological features only ("+Morph") or with all hand-crafted features ("+All").Since 20 pretrained networks were selected, this resulted in 60 feature sets involving deep and combined features.

Training
Each feature set was processed by a standard radiomics pipeline based on machine learning.First, the data were split into ten folds, ensuring the splits were the same across all feature sets to avoid any kind of bias from different subsamples.Then, for the training folds, the data was normalized using z-Scores.A feature selection was applied to remove redundant and irrelevant features.Five different methods were employed for this task [51]: ANOVA, Bhattacharyya, LASSO (with regularization parameter C = 1), random forest (with 100 trees), and t-Score.Because these methods do not select but only score the relevance of each feature, varying numbers of highest-scoring features were selected, ranging from 1, 2, 4, . .., 64, 128.On the resulting feature set, one of three different classifiers was applied: logistic regression (with C chosen among 2 −8 , 2 −7 , . .., 2 7 , 2 8 ), neural networks (with three layers of size chosen among 16, 64, and 256), and random forests (with either 125, 250, or 500 trees).

Performance Metrics
Multiple metrics were used for comparison of the models.The primary metric used was the area under the receiver-operating characteristic curve (AUC), since this metric is rather insensitive to class imbalances in the data, and can be regarded as the de facto standard metric and is used in many radiomic studies.In addition, other metrics that are of interest to judge the quality of the model were employed: Accuracy, which measures the proportion of correctly predicted instances; recall, which quantifies a model's ability to correctly identify positive instances; and precision, which calculates the proportion of true positive predictions among all positive predictions.In addition, the F-score, which is a harmonic mean of recall and precision, was measured [52].Finally, we computed the Matthew correlation coefficient (MCC), which can be understood as a measurement that produces high values only if the model predicts both classes well [53].

Evaluation
For each model, its predictions on the left-out validation fold were computed.The model's performance was then evaluated by pooling all predictions and computing the area under the curve of the receiving operator characteristic (AUC).The best-performing model for each feature set was then selected as the final model.These models were then studied with regard to the following two aspects: first, the performance gain of models using deep features compared to those using hand-crafted features; second, the improvement in performance resulting from adding morphological or all hand-crafted features to deep features.The hand-crafted model, which uses all features extracted from the 3D volume, was taken as a reference.
Statistical significance between the results was computed using a Wilcoxon signedrank test.A p-value below 0.05 was considered to be statistically significant.

Results
Overall, 65 feature sets were computed and analyzed.A graphical display of the AUC curves of the best models can be seen in Figure 3.

Hand-Crafted Features
Models that used all hand-crafted features with or without morphological ones performed very similarly overall, with the exception of the model based on morphological features alone (Table 3).The two models with 2-D features performed slightly better than those with 3-D features.The model that only employed morphological features performed worse, especially on the C4KC-KiTS dataset.However, it performed well on I-SPY1 and Lipo.

Results
Overall, 65 feature sets were computed and analyzed.A graphical display of the AUC curves of the best models can be seen in Figure 3.  due to the morphological features, closely followed by SimSiam with ResNet-50 backbone, which gained 0.011 in AUC.
Models using deep features performed slightly worse than those using standard hand-crafted features regarding AUC.Only the ConvNeXt V2 model achieved essentially the same performance as the hand-crafted radiomic model (Table 3, and Table S3 in the Supplementary Materials).Other models performed worse, and the loss in AUC was larger than 0.022.However, a larger difference could be seen in MCC for all models (Figure 4).These observations were also true for the networks pretrained on medical data (MedicalNet, RadImageNet), which performed inferior to the standard hand-crafted model (loss in AUC greater than 0.05).

Deep Features Fused with Morphological Features
Models using morphological features in addition to the deep features performed slightly better; the difference in AUC was statistically significant (+0.02 in AUC; p < 0.001) (Tables 3 and S4 in the Supplementary Materials).Other metrics were also slightly better, although again, a performance drop in MCC could still be clearly seen (Figure 5).The bestperforming model was still ConvNeXt V2, which virtually showed no improvement due to the morphological features, closely followed by SimSiam with ResNet-50 backbone, which gained 0.011 in AUC.

Deep Models Fused with All Hand-Crafted Features
Adding all hand-crafted features to the deep features improved the predictive performance further (Table 3 and S5 in the Supplementary Materials), and an overall difference of 0.034 in AUC over the models using only morphological features could be seen (p < 0.001).Compared to using only deep features, adding the hand-crafted features led to a higher improvement of 0.054 in AUC (p < 0.001).Yet, the previously best-performing model, ConvNeXt V2, showed only small improvements (+0.009 in AUC).In contrast, Sim-Siam showed a gain of +0.021 and performed slightly better than the standard, handcrafted 3-D model (with a gain of +0.009).Regarding the other performance metrics, there

Deep Models Fused with All Hand-Crafted Features
Adding all hand-crafted features to the deep features improved the predictive performance further (Table 3 and Table S5 in the Supplementary Materials), and an overall difference of 0.034 in AUC over the models using only morphological features could be seen (p < 0.001).Compared to using only deep features, adding the hand-crafted features led to a higher improvement of 0.054 in AUC (p < 0.001).Yet, the previously best-performing model, ConvNeXt V2, showed only small improvements (+0.009 in AUC).In contrast, SimSiam showed a gain of +0.021 and performed slightly better than the standard, hand-crafted 3-D model (with a gain of +0.009).Regarding the other performance metrics, there was virtually no difference; this was also true for the MCC (Figure 6).While for several models, the gains in AUC compared to the standard hand-crafted model were minor but positive, the majority of the models still did not reach its level of performance.

Discussion
In this study, we investigated whether models using features extracted from pretrained deep neural networks can outperform models based on hand-crafted features.We utilized a large set of pretrained deep networks and compared them to ten radiomic datasets.
Our results indicate that, on average, the models using deep features do not improve over those using standard hand-crafted features.The overall best deep feature set, Con-vNeXt V2, performed on average virtually the same; however, many other feature sets performed worse.It was also true for the features extracted from networks pretrained on radiological data and networks trained with better techniques (like self-supervised training), which can produce state-of-the-art results on ImageNet.
Adding morphological features did lead to a slight increase in overall performance, yet the best-performing model, the ConvNeXt V2, did not benefit.This observation starkly contrasts the intuition that morphological features, such as volume or sphericity, can improve the performance of 2-D networks, since these do not directly see the global shape.There could be two reasons why no such increase was observed: The networks can extract some morphological information about the lesions, even though only slices were processed.This might happen if the network extracts the segmentation size on each slice as a feature.On the other hand, it is also possible that for the datasets used in this study, the morphological information is not critical for the prediction, even though a recent study showed that volume is important in head and neck tumors [54].
Fusing all hand-crafted features with the deep features did lead to a slight perfor-

Discussion
In this study, we investigated whether models using features extracted from pretrained deep neural networks can outperform models based on hand-crafted features.We utilized a large set of pretrained deep networks and compared them to ten radiomic datasets.
Our results indicate that, on average, the models using deep features do not improve over those using standard hand-crafted features.The overall best deep feature set, Con-vNeXt V2, performed on average virtually the same; however, many other feature sets performed worse.It was also true for the features extracted from networks pretrained on radiological data and networks trained with better techniques (like self-supervised training), which can produce state-of-the-art results on ImageNet.
Adding morphological features did lead to a slight increase in overall performance, yet the best-performing model, the ConvNeXt V2, did not benefit.This observation starkly contrasts the intuition that morphological features, such as volume or sphericity, can improve the performance of 2-D networks, since these do not directly see the global shape.There could be two reasons why no such increase was observed: The networks can extract some morphological information about the lesions, even though only slices were processed.This might happen if the network extracts the segmentation size on each slice as a feature.On the other hand, it is also possible that for the datasets used in this study, the morphological information is not critical for the prediction, even though a recent study showed that volume is important in head and neck tumors [54].
Fusing all hand-crafted features with the deep features did lead to a slight performance boost.Even though this is reasonable, there are two reasons why this is somewhat surprising.First, adding all hand-crafted features will have doubled the number of features in many cases.Given the small sample size and the curse of dimensionality [55], more features could easily have led to diminished performance.Second, one could have expected that deep and hand-crafted features are complementary in nature.Thus, fusing both should have led to much higher-performing models.However, this was not observed.A reason for this could be that even though hand-crafted features could be sub-optimal, they have been shown in many cases to be highly discriminative.Deep features from pretrained networks might not be able to improve much over that.In addition, many outcomes might depend on information simply unavailable in the radiological data.Thus, regardless of whether hand-crafted or deep features are used, it cannot be expected that higher performance can be obtained.
Our results may also be interpreted in light of the 'no free lunch'-theorem [56], which mainly states that no single classifier consistently outperforms other classifiers.Indeed, our results indicate that deep features can yield a higher gain over hand-crafted features for specific datasets.In other words, to obtain the highest predictive performance, the features must be further tuned to the dataset by some kind of training.Yet, there are two problems to consider: First, training involves many hyperparameters (e.g., the learning rate and schedule, the head architecture, augmentations, and regularization).Tuning them for a small dataset is a problem of its own, and can easily lead to overfitting when not handled properly.Second, a key advantage of using pretrained networks only as feature extractors is that the extracted features will be more reproducible [57].In contrast, any training of a deep network, including finetuning, will inevitably destroy this property since the features will depend strongly on the dataset.In other words, training will decrease reproducibility substantially.Nonetheless, if predictive performance is the goal, it seems that using features extracted from pretrained deep networks is not very helpful, and some training should be conducted.
Although some studies exist that compare hand-crafted and deep features, these studies cannot be readily compared; therefore, we conducted a comparison analysis, where we can ensure comparability, e.g., that the preprocessing steps and cross-validation are the same for all feature sets.Our results also reflect the differences seen in the literature: If we consider a specific dataset and model (as is the case in nearly all of the above-cited studies), we can see either a large or nearly no difference between the models using handcrafted and deep features.For example, on I-SPY1, the model using hand-crafted features obtained an AUC of 0.671, while the DeiT III model using all features obtained one of 0.788, indicating that the deep model is clearly better.Yet, had we used the ConvNeXt V2 model, we would have seen no difference in AUC (0.661 vs. 0.671).Therefore, we believe that a comparison across many datasets is key to understanding the performance of the models using pretrained deep features.However, as shown, we believe that for specific datasets and with further optimizations, the AUC of the deep models could be improved more [36].Therefore, our observation underlines that deep features are not the panacea, but may be helpful for certain datasets [58].
Our study has certain limitations.First, we only used the pretrained deep networks as feature extractors.As discussed, finetuning them could lead to higher performance, albeit for a price.Second, nearly all networks we considered were two-dimensional.Such networks cannot directly utilize the spatial coherence present in the data.Simply adding morphological features did not help in this regard.Third, we only considered a single deep-learning pipeline.Tuning the extraction parameters for each dataset could increase the overall performance [36].Furthermore, we used a simple cross-validation scheme, which might be prone to overfitting.However, all feature sets would be affected similarly, and it is reasonable to expect that the relative performances would more or less stay the same.In addition, it was shown that the amount of overfit for the classifiers we employed can be expected to be rather low [29].Finally, we considered only binary problems, since these are the most common in the clinical context.Results might largely be different if multi-class problems are considered [59].

Future Work and Conclusions
Our study showed that the hope that deep features from pretrained networks are more predictive than hand-crafted features is unfounded.However, since we did not train and fine-tune the networks, their features were not specific to the dataset at hand; this could be a reason for the low predictive performance we observed.Therefore, training them could lead to models with higher predictive performance.This should be tested in a future study.Furthermore, we could not observe that networks pretrained on medical data outperformed those pretrained on ImageNet data in our study, which is unintuitive.The reasons for this are unclear, and a more thorough study on this should be performed, since a deeper understanding of this could be used to obtain more robust and predictive models.Our study did not analyze how far the features from different networks agree in terms of correlation.However, such an analysis could provide insights into why the fusion of hand-crafted and deep features did not yield a significant improvement.Finally, we did not test whether fusing features from different networks could be more helpful.This should also be tested in a future study.
In conclusion, our study showed that models based on features extracted from pretrained deep networks did not perform better on average.Fusing hand-crafted and deep features can yield some minor improvements, but will be specific to the dataset.

Supplementary Materials:
The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/diagnostics13203266/s1,Table S1: List of pretrained deep networks used; Table S2: AUC of the models using hand-crafted features.Table S3.AUC of the models using deep features.Table S4.AUC of the models using deep features fused with morphological features.Table S5.AUC of the models using deep features fused with hand-crafted features were added.
Funding: This research received no external funding.
Institutional Review Board Statement: Ethical review and approval were waived for this study due to its retrospective nature.
Informed Consent Statement: Not applicable.

Figure 1 .
Figure 1.Overall flowchart of the study.For each dataset and pretrained neural network, a 10-fold stratified cross-validation is employed.

Figure 1 .
Figure 1.Overall flowchart of the study.For each dataset and pretrained neural network, a 10-fold stratified cross-validation is employed.

Figure 2 .
Figure 2. Example slices for each dataset.Images were normalized to the range between 0 and 255, regardless of whether they were CT or MR sequences.The segmentation is depicted in red.The HN image was cropped for visualization purposes.

Figure 2 .
Figure 2. Example slices for each dataset.Images were normalized to the range between 0 and 255, regardless of whether they were CT or MR sequences.The segmentation is depicted in red.The HN image was cropped for visualization purposes.

Figure 3 .
Figure 3. AUC curves for the standard model, "Hand-crafted, 3-D, All" (in black), the overall bestperforming model using deep features (ConvNeXt V2, in yellow), using deep and morphological

Figure 4 .
Figure 4. Graphical display of the performance metrics for all models using deep features.The red bars depict the performance of the standard model, "Hand-crafted, 3-D, All".

Figure 4 .
Figure 4. Graphical display of the performance metrics for all models using deep features.The red bars depict the performance of the standard model, "Hand-crafted, 3-D, All".Diagnostics 2023, 13, x FOR PEER REVIEW 11 of 16

Figure 5 .
Figure 5. Graphical display of the performance metrics for all models using deep and morphological features.The red bars depict the performance of the standard model, "Hand-crafted, 3-D, All".

Figure 5 .
Figure 5. Graphical display of the performance metrics for all models using deep and morphological features.The red bars depict the performance of the standard model, "Hand-crafted, 3-D, All".

Diagnostics 2023 , 16 Figure 6 .
Figure 6.Graphical display of the performance metrics for all models using deep and all handcrafted features.The red bars depict the performance of the standard model, "Hand-crafted, 3-D, All".

Figure 6 .
Figure 6.Graphical display of the performance metrics for all models using deep and all hand-crafted features.The red bars depict the performance of the standard model, "Hand-crafted, 3-D, All".

Table 1 .
A list of studies between 2019 and 2022 that compare models based on hand-crafted features and deep features from pretrained networks.

Table 2 .
Datasets used for benchmarking.

Table 2 .
Datasets used for benchmarking.