Impact of Wavelet Kernels on Predictive Capability of Radiomic Features: A Case Study on COVID-19 Chest X-ray Images

Radiomic analysis allows for the detection of imaging biomarkers supporting decision-making processes in clinical environments, from diagnosis to prognosis. Frequently, the original set of radiomic features is augmented by considering high-level features, such as wavelet transforms. However, several wavelets families (so called kernels) are able to generate different multi-resolution representations of the original image, and which of them produces more salient images is not yet clear. In this study, an in-depth analysis is performed by comparing different wavelet kernels and by evaluating their impact on predictive capabilities of radiomic models. A dataset composed of 1589 chest X-ray images was used for COVID-19 prognosis prediction as a case study. Random forest, support vector machine, and XGBoost were trained (on a subset of 1103 images) after a rigorous feature selection strategy to build-up the predictive models. Next, to evaluate the models generalization capability on unseen data, a test phase was performed (on a subset of 486 images). The experimental findings showed that Bior1.5, Coif1, Haar, and Sym2 kernels guarantee better and similar performance for all three machine learning models considered. Support vector machine and random forest showed comparable performance, and they were better than XGBoost. Additionally, random forest proved to be the most stable model, ensuring an appropriate balance between sensitivity and specificity.


Introduction
Quantitative imaging biomarkers, i.e., radiomic features, can be used to extract information complementary to the visual approach of the radiologist [1]. Radiomics has been exploited in different scenarios to support the decision making of clinicians at different stages of the care process, from diagnosis to prognosis. With more details, radiomic signatures can be used for the diagnosis of several pathologies, to predict response to therapy, and to categorize clinical outcomes in general. Although several best practices [2] and standardization initiatives [3] have been proposed, full reproducibility of the radiomic process is still lacking. The imaging used, acquisition protocol, feature extraction setting, preprocessing, and the machine learning (ML) workflow are variables undermining the reproducibility of the radiomic process. Several researchers have tried to evaluate the robustness and the predictive capabilities of radiomic features, depending on specific external parameters (e.g., segmentation method, quantization level, and preprocessing steps) [4][5][6]. This already complex scenario becomes even more complicated when the classical radiomic feature set is extended by also considering high-level features, such as wavelet transforms, Laplacian filters (LoG), and intensity transformation (e.g., logarithm, exponential, gradient, etc.) Wavelet-derived features showed strong predictive capabilities in several contexts: tumor-type prediction of early stage lung nodules in CT [7], neoadjuvant chemotherapy treatment prediction for breast cancer in MRI [8], low-dose rate radiotherapy treatment response prediction of gastric carcinoma in CT [9], liver cirrhosis detection [10], glioblastoma multiforme differentiation from brain metastases in MRI [11], and grading of COVID-19 pulmonary lesions in CT [12]. The wavelet-derived features are calculated on the image decompositions-four for 2D images (e.g., X-ray, mammography, ultrasound, etc.) and eight for 3D volumes (e.g., CT, MRI, etc.)-providing multi-resolution images. This leads to a substantial increase in the number of features that, if not properly managed, can bring predictive systems to incur the curse of dimensionality [13]. In addition, several families of wavelet transforms exist, some improved for noise reduction, others for image compression; however, in general, all provide a multi-resolution representation of the initial image. Considering that the predictive ability of wavelet-derived features overcomes the original ones, analyzing and comparing the behavior of wavelet kernels are worthwhile to provide a recommendation for their use.
In radiomics, wavelets are often used without paying attention to the type of kernel involved: the commonly followed approach is to use the default kernel for feature extraction, without evaluating a-priori which is more suitable for the specific clinical scenario. Few researchers have approached this problem. In [14], wavelet kernels were compared to evaluate the role of the CT radiomic features for lung cancer prediction: In [15], a similar approach was used for the diagnosis of colorectal cancer patients in contrast-enhanced CT. Both studies used CT imaging, a modality that can provide images with higher resolution and more defined details, compared with CXR projective imaging.
In this study an in-depth analysis is performed by comparing different wavelet kernels and by evaluating their impact on predictive capabilities of radiomic models. As case study, three machine learning radiomics models were implemented to predict the prognosis of COVID-19 patients from chest X-ray (CXR) images. The Biorthogonal, Coiflets, Daubechies, Discrete Meyer, Haar, Reverse Biorthogonal, and Symlets wavelet families were considered to quantify and compare the predictive performance of the radiomic features. The radiomic features were extracted from the decomposed images, preprocessed, selected, and then used to train several ML models, including random forest (RF), XGBoost (XGB), and support vector machine (SVM). The models and the different wavelets were compared to evaluate the most predictive kernel.
The remainder of this paper is structured as follows: Section 2 describes the dataset used, the extraction and preprocessing of radiomic features, and the model training. Section 3 provides the results obtained for the wavelet kernels and trained models. Section 4 discuss the experimental findings. Finally, Section 5 outlines the study conclusions.

Dataset Characteristics and Lung Delineation
The dataset used is composed of 1589 CXR images of COVID-19 patients, labeled as 'SEVERE' and 'MILD' disease, according to [16]. This dataset is split into 1103 and 486 patients used for the training and test phases, respectively. This partition was established by the organizing committee of the COVID CXR Hackathon competition [17], who made these datasets available. In more detail, the training dataset (1103 samples) include 568 severe and 535 mild; the test dataset (486 samples) include 180 severe and 306 mild. From a visual evaluation, it was possible to note that the multicentric dataset (collected by six different hospitals) is heterogeneous in terms of image size, quality, gray levels distribution, and origin (some are native digital images, whereas others are obtained by scanning traditional X-ray films). In particular, regarding the size: for the training dataset, the most frequent size is 2336 × 2836 pixels (35.7%); the other images have variable sizes (1396 − 4280 × 1676 − 4280 pixels). For the test set (composed of 486 images), 88.27% of the images have 2336 × 2836 pixels; the others have variable sizes (1648 − 3027 × 1440 − 3000 pixels). According to [16], in our study, all the images were re-sized at 1024 × 1024. The CXR images were stored in .PNG format with a 96 DPI resolution (pixel size 0.265 × 0.265 mm), and no metadata related to acquisition details are available.
To extract the radiomic features, the regions of interest (ROIs) containing the lungs have to be identified. To achieve this, a MATLAB ® -coded custom tool was implemented to delineate the lung ROIs: in particular, a semi-automatic delineation modality for the identifying of the maximum elliptical ROI contained within the lung was implemented. This choice was motivated by the need to focus the attention on the central region of the lung, excluding peripheral zones. In Figure 1 two segmentation examples. Due to the excessive heterogeneity of the CXR dataset, automatic lung segmentation approaches did not provide satisfactory results. For this reason, it was decided to implement a tool that can easily support clinicians during the image annotation procedure. Specifically, the implemented semi-automatic algorithm allows the automatic identification of the bounding-box containing the lung and determining the maximum elliptical ROI contained within it. The clinician can decide whether to accept it, if the result is satisfactory, or to modify it to find a more suitable fitting between the elliptic ROI and the lung area by changing orientation and size of the ellipse. Because this is a supervised semi-automatic approach (each segmentation is directly validated), and considering the experience of the clinicians who supported this study, no evaluation step was performed. We decided to implement an ad hoc computer-assisted tool that can simply and intuitively guide clinicians through the steps of lung segmentation, from CXR image selection to segmentation mask storage.

Wavelet Transform
The widespread use of wavelet transforms-used in several applications concerning signal and image processing-is due to their ability to capture information in both the frequency and time domains. In this study, discrete wavelet transform (DWT) was applied to CXR images. The image passed through high-pass h ψ and low-pass h φ filtering operations, decomposing the images into high-frequency (details) and low-frequency components (approximation). The decomposition of the image into subimages at different resolution allows for multi-resolution analysis [18,19]. For this reason, DWT has found numerous applications in image processing, focusing on denoising [20] and compression [21,22]. DWT is computed through two functions, the scaling function and the wavelet function [23]. For a 2D signal f (x, y) of size M × N (such as the CXR images considered in this study), the DWT is defined as: where: • Equation (1) (4)).
Finally, the combined application of the scaling function (φ) and the wavelet function (ψ) obtains four image decompositions (LL, LH, HL, and HH) for 2D transforms, as indicated in Equations (5)- (8). The same formulas can be adapted and extended for the 3D case, to consider volumetric imaging (e.g., MRI, CT, etc.). DWT then depends on the low-pass h φ and high-pass h ψ kernel chosen for decomposition.
As each wavelet family consists of several kernels, and they were experimentally selected to qualitatively and visually maintain decomposed images similar to the original image. Table 1 summarizes the chosen kernels and the respective number of coefficients. Except for Dmey, kernels with a number of coefficients less than or equal to 10 were chosen.

Radiomic Features Extraction
The radiomic features were extracted from CXR images using PyRadiomics [39] and segmentation masks were obtained through the semi-automated tool described in Section 2.1. The extracted radiomic features belong to the following six categories:  [45].
Moreover, each ROI was filtered considering all families of the wavelet transforms discussed above. Specifically, for each image and for each wavelet family, four decompositions (LL, LH, HL, and HH) were calculated, and then the features were extracted by obtaining a total of 93 × 4 = 372 features. Finally, the original radiomic features (without wavelet filtering) were also extracted to compare the wavelet-derived vs. original predictive capabilities.

Radiomic Features Preprocessing
To obtain a subset of non-redundant features with relevant information content, preprocessing and selection were performed with the following steps [2]: • Near-zero variance analysis: aimed at removing features with low information content. This operation considered a variance cutoff of 0.01: features with a variance less than or equal to this threshold were discarded; • Correlation analysis: aimed at removing highly correlated features, by means of the Spearman correlation for pairwise feature comparison. For each set of N correlated features, N-1 were removed. Specifically, the correlation matrix was first calculated, and then it was analyzed according to the following decomposition priority: LH, HL, HH, and LL. As values larger than 0.80 are commonly used for Spearman correlation [46][47][48][49], and a threshold of 0.85 was chosen. • Statistical analysis: the Mann-Whitney U test was used to test the difference between mild and severe distribution, computing the p-value for each features selected from the previous step. The p-value threshold was set to 0.05.

Features Selection and Model Training
To select the most discriminating features, the sequential feature selector (SFS) [50] algorithm was used. The sequential feature selector was set in forward mode, which, at each step, includes a feature, and in floating mode, which, after the inclusion, performs the exclusion, allowing us to consider more feature subset combinations. This allowed for the selection of the best radiomic signature for each model considered (i.e., RF, SVM, and XGB). The SFS algorithm was applied considering a 10-fold stratified cross validation (CV) and using all the selected features in the preprocessing step.
The experiments were conducted in the Python 3.7 environment, using the scikit-learn 1.0.2 and xgboost 1.2.1 libraries for model training. In particular: • RF was trained using the bootstrap technique with 100 estimators and the Gini criterion. • SVM was trained setting the regularization parameter C = 1.0, considering the radial basis function as kernel, the coefficient γ = 1/(n f eatures × σ), the shrinking method [51], and the probability estimates to enable the AUROC computation. In addition, for SVM, the features were standardized before the training. • XGB was trained using 100 estimators, depth max = 6, and 'gain' as the importance type. In addition, the binary logistic loss function was used to model the binary classification problem, considering a learning rate of η = 0.3. Accuracy, sensitivity, specificity, and AUROC were calculated for all the three ML models and all the wavelet kernels considered. In addition, for a precise estimation of the trained models, in the training phase, performance was calculated by considering a stratified CV, repeated 20 times on the 1103 samples dataset. Successively, to evaluate the models generalization capability on unseen data, a test phase was performed on the 486 samples dataset. Figure 2 shows the selected radiomic features after each preprocessing step (upper), considering each wavelet decomposition (lower). After near-zero variance analysis, Dmey, Bior1.5, and Haar kernels had the most features selected, while, at the end of preprocessing, the number of features was comparable for each kernel used. As expected, a marked overlap between the selected features belonging to the LL decomposition was observed ( Table 2). This finding resulted from the nature of the LL decomposition, which is essentially derived from a resizing. For the other decompositions (LH, HL, and HH), this overlap decreased because the application of the kernel corrupted the image in different ways among the various wavelet families. For each kernel, the complete list of radiomic features remaining after the preprocessing phase is reported in Table 2.

Features Selection
The radiomic features remaining at the end of the preprocessing phase represented the input of the SFS wrapper method, which were used to select the most discriminating radiomic signature for the SVM, RF, and XGB models. The number of features maximizing accuracy was selected. For equal accuracy among the different radiomic signatures, the one with the lower standard deviation was selected. The features selection stage allows us to select only the discriminating features: in fact, considering all available features (at the end of preprocessing) can lead to a performance degradation. Table 3 summarizes the number of features for each wavelet kernel and ML model used for training. On average, a signature consisting of 12-15 features was selected. Additionally, overall, the decision tree-based models (RF and XGB) performed better, with more features than SVM. The details of the features selected for each wavelet kernel and for each ML model are reported in Appendix A (Tables A1-A3).

Predictive Model Results
Tables 4-7 show the accuracy, sensitivity, specificity, and AUROC obtained in the experimental trials, both in the training and testing phases. The reported metrics obtained in the training are represented as mean ± standard deviation, because the averaged values calculated considering 20 repetitions of the 10-fold stratified CV.
To verify whether the values obtained by CV repetitions are statistically different, the ANOVA test was used: for each model (i.e., XGB, SVM, and RF), the accuracy values obtained for each fold were compared, considering the used kernels as groups, obtaining the p-values << 0.05.   Table 6. Specificity values obtained in training (mean ± standard deviation) and in testing phases for each wavelet kernel, with three ML models considered. Boldface values highlight the best three obtained results. Focusing on wavelet kernels, Db3, Dmey, and Rbio1.5 proved to be the worst for all three ML models. This assessment resulted from the high imbalance between sensitivity and specificity, suggesting overfitted models (high sensitivity vs. low specificity or vice versa).
The AUROC is the most widely used index of global diagnostic accuracy, since higher values correspond to a better selective ability of the biomarkers [52]. Excluding Db3, Dmey, and Rbio1.5 for their unbalanced performance, overlapping AUROC values were obtained in tests for Bior1.5, Coif1, Haar, and Sym2 kernels.
Focusing on ML models, XGB was the least accurate model, compared with SVM and RF, which showed comparable accuracy on the test set. Moreover, RF, considering the smaller gap in training and testing performance, had a strong generalization ability. In summary, wavelet-derived features were more predictive than the original ones. This result was confirmed for all three ML models employed. Figure 3, shows the confusion matrices obtained by the three ML classifiers considering Haar as the wavelet kernel. Table 7. AUROC values obtained in training (mean ± standard deviation) and in testing phases for each wavelet kernel with three ML models considered. Boldface values highlight the best three obtained results.

Discussion
Radiomic features can support the radiologists by providing a quantitative viewpoint that is complementary to the human visual perspective. Although many researchers have used classical radiomic features (i.e., FO, GLCM, GLRLM, GLSZM, and GLDM), many others have achieved increased predictive power by exploiting high-level features, computed from filtered images by means of wavelet transforms, LoG filters, and intensity transformation. In particular, the wavelet-derived features have demonstrated their predictive capability in several contexts [7][8][9]11,12]. Despite their widespread use in the literature, comparing the different wavelets kernels is difficult because of the few studies [14,15] focused on the problem. Consequently, which wavelet kernel has stronger discriminatory power is unclear.
In this study, a CXR dataset was used to evaluate the impact of wavelet kernels on the radiomic model performance for predicting COVID-19 prognosis. Despite their projective nature, CXR images (consisting of less detailed images than volumetric CT series) are crucial to sustainably (e.g., cheaply and quickly) support healthcare systems. In this context, the ability of wavelets to provide multi-resolution imaging can be used to improve the predictive capabilities of radiographic imaging. Bior1.5, Coif1, Haar, and Sym2 kernels were the best, in terms of the predictivity for all three ML models used (i.e., XGB, SVM, and RF). Conversely, Db3, Dmey, and Rbio1.5 showed a serious imbalance between sensitivity and specificity, suggesting overfitted models. Finally, RF showed the strongest generalization ability, demonstrating less performance degradation between the training and testing phases. Figure 4 shows an example of the Haar transform in the four decompositions (LL, LH, HL, and HH). LL represents the rescaled image, which retains features approximating the original image. For the other decompositions (LH, HL, and HH), however, the image shows completely different characteristics, especially in the lung regions. The results obtained encourage the use of higher-level features, such as those obtained by wavelet transforms. The models trained with wavelet-derived features outperformed the model trained with original features. This is the proof that the ability to provide a multi-resolution image representation improves the prediction performance of the ML models. The findings of this study, applied to the prognosis of COVID-19, are a starting point for other pathologies analyzed by radiographic imaging. However, further analysis is required when volumetric imaging (e.g., CT and MRI) is used, in which the wavelet transform involves three spatial components.  Figure 4. Four decompositions (LL, LH, HL, and HH) calculated by means of wavelets. In the example, the Haar kernel was used.

289
Wavelet transforms represent a powerful tool for extracting biomarkers in radiomics. 290 This study showed how the correct choice of the wavelet kernel used for the filtering 291

Conclusions
Wavelet transforms represent a powerful tool for extracting biomarkers in radiomics. This study showed how the correct choice of the wavelet kernel used for the filtering and extraction of radiomic features improves the classification process, with respect to the use of the original features. At present, the discussion between shallow learning and deep learning is still open, especially in clinical scenarios, where the amount of available samples is not sufficient to train and use deep architectures. In this context, traditional ML techniques are also experiencing a second wave, in relation to the increased possibility of obtaining interpretable features and explainable models. In addition, traditional ML does not require a large amount of data to train the model.
Wavelet-derived features must be used wisely. In fact, having a reduced dataset limits the number of features that can be extracted and used to avoid incurring the 'curse of dimensionality' [13]. This phenomenon shows that, with a fixed number of training samples, the average (expected) predictive power of a classifier improves as the dimension and the number of used features increase. This is due to the model's overfitting on high-dimensionality and redundant data, which leads to improved performance on train data, but reduces the generalization capabilities on unseen test data. Although the study mainly focuses on evaluating how predictivity impacts the wavelet kernel changes, the interpretability of radiomic features must also be studied. In clinical contexts, explainable models are crucial, so that the models can be clinically validated and compared with the medical literature. Explainability improves the usability and acceptability of artificial intelligence (AI) models, as it allows the users to be involved in the debugging and model building processes [53]. In many intensive decision-based tasks, the interpretability of an AI-based system may emerge as an indispensable feature [54]. However, radiomic features have the disadvantages of being low-level, less abstract than deep features, and less informative, which can degrade model performance. This phenomenon can be mitigated by considering higher-level radiomic features, such as wavelet-derived ones, where the interpretability is sacrificed for the benefit of performance. If correlating the quantitative value with clinical meaning is easy with the original radiomic features, for wavelet-derived features, this task becomes more complicated because the physician does not use a wavelet transform in regular activities. However, their use achieves a trade-off between performance and interpretability.
Funding: This work was partially supported by the University of Palermo, Italy. Grant EUROSTART, CUP B79J21038330001, project TRUSTAI4NCDI.
Institutional Review Board Statement: Ethical review and approval were waived due to the retrospective nature of the study.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: