Pilot Study for the Assessment of the Best Radiomic Features for Bosniak Cyst Classification Using Phantom and Radiologist Inter-Observer Selection

Since the Bosniak cysts classification is highly reader-dependent, automated tools based on radiomics could help in the diagnosis of the lesion. This study is an initial step in the search for radiomic features that may be good classifiers of benign–malignant Bosniak cysts in machine learning models. A CCR phantom was used through five CT scanners. Registration was performed with ARIA software, while Quibim Precision was used for feature extraction. R software was used for the statistical analysis. Robust radiomic features based on repeatability and reproducibility criteria were chosen. Excellent correlation criteria between different radiologists during lesion segmentation were imposed. With the selected features, their classification ability in benignity–malignity terms was assessed. From the phantom study, 25.3% of the features were robust. For the study of inter-observer correlation (ICC) in the segmentation of cystic masses, 82 subjects were prospectively selected, finding 48.4% of the features as excellent regarding concordance. Comparing both datasets, 12 features were established as repeatable, reproducible, and useful for the classification of Bosniak cysts and could serve as initial candidates for the elaboration of a classification model. With those features, the Linear Discriminant Analysis model classified the Bosniak cysts in terms of benignity or malignancy with 88.2% accuracy.


Introduction
Radiomics, a discipline that aroused huge interest in personalized medicine during the past few years, allows the extraction of information from medical images for their further analysis. These mineable data are studied in combination with patient information for clinical decision support [1,2]. Texture analysis allows the quantification of a region of interest by calculating the distribution of voxel gray levels and their relationships, reflecting underlying physiological processes and offering information about the lesion under study.
Different radiomic models in combination with other clinical parameters have been reported for the characterization of renal diseases [3][4][5][6]. The widespread use of medical imaging tests in recent decades led to a significant increase in the incidental detection of renal masses. Most kidney tumors are diagnosed with radiological tests performed for other causes. Hence, imaging plays a key role in the diagnosis and management of these patients. In 1986, Dr. Bosniak established a first classification of cystic renal masses based on CT [7]. This classification, modified in 1996, is now widely accepted and establishes five categories of renal cystic masses, considering types IIF, III and IV as complex cystic lesions [8]. A subclassification of III types into IIIs or IIIn has been proposed, considering that approximately half of the patients with III cysts undergo unnecessary surgeries [9,10].
In the era of personalized management and active surveillance, the differentiation of benign from malignant renal cystic masses is of the utmost importance to establish appropriate patient management [11]. Artificial intelligence tools offer the possibility to develop a useful clinical decision support system based on radiomic features for the characterization of complex cystic renal masses in terms of benignity/malignancy.
Since the diagnosis and management of Bosniak cysts are complex due to the heterogeneity pattern they present in CT images, there is a need to develop diagnostic support tools based on the search for radiomic biomarkers. Works regarding the use of radiomics in cystic lesions have not been extensively studied to date. In recent years, several papers support the hypothesis that the use of quantitative biomarkers could be useful in the classification and prognosis of renal cysts not only to categorize them but also to provide a percentage of potential aggressiveness [12]. Minisk et al. [13] used six first-order features to classify cysts as benign (I and II) or potentially malignant (considered from IIF onwards) from a single slice of the CT scan. The different machine learning models used demonstrated high specificity and low sensitivity. Dana et al. [14] also analyzed the usefulness of machine learning models using radiomic variables to distinguish between benign and malignant renal cysts regardless of their Bosniak class and using pathological or long-term follow-up criteria. Recent works [15,16] address the same problem in both the arterial and nephrographic phases of the CT scan.
Efforts should be made to standardize radiomic studies and to link radiomics features with clinical endpoints so that the results could be validated and implemented in the clinical routine [17]. In this sense, multicenter studies with phantoms are the initial step to study feature robustness in terms of repeatability and reproducibility for dimensionality reduction and feature selection for the subsequent training of a classification model [18]. Moreover, since small variations in segmentations could lead to significant changes in the value of the features, the inter-observer correlations should be calculated when lesions are marked by different radiologists.
A differential aspect of this work compared to other publications on the topic is that it includes a hybrid feature selection approach, using a phantom through the different scanners to analyze the stability and robustness of the variables and transferring these results to clinical cases of Bosniak cysts. This allows understanding in a more detailed way the behavior of the different radiomic features under different conditions (protocol, scanner . . . ) to transfer the problem to Bosniak lesions. Ursprung et al. [17] indicate that one of the initial parameters to take into account in the Radiomics Quality Score (RQS) for radiomic work on renal masses is a phantom study to study feature validity. According to Azadikhan et al. [19], in a meta-analysis of 87 papers, none of them presented the phantom study. This work is focused not only on developing a machine learning model that works as a predictor but also on understanding the role of the features that are used in this pilot study.
This study is a preliminary analysis of the most suitable radiomic variables for the elaboration of classification models of malignancy or benignity of Bosniak cysts. For this purpose, a hybrid approach is performed that reduces the dimensionality based on both the Diagnostics 2023, 13, 1384 3 of 16 robustness of the variables from several scanners and the correlation in the segmentation in the lesions. Initially, a phantom test-retest, intra-CT and inter-CT feature analysis is presented to select the most robust radiomic features across different machines using a texture phantom on five CT scanners. Then, from the anatomical CT images of the Bosniak cysts, the radiomic variables with the highest inter-observer correlation are selected. Finally, the matching variables between the phantom results and the anatomical images are selected, and their benign/malignant classification capacity is evaluated through machine learning models. The results of this study will help to perform further CT radiomic studies for Bosniak cysts.

Phantom Study
A CCR (Credence Cartridge Radiomics) phantom composed by ten inserts of different materials, such as the one employed by Mackin et al. [20], was selected. Complete images of the phantom were obtained with 5 CT scanners from several vendors and models: one Somaton X.cite (Siemens Healthineers, Erlangen, Germany), two LightSpeed VCT (General Electric Healthcare, Chicago, IL, USA), one Somaton Drive (Siemens Healthineers, Erlangen, Germany) and an Ingenuity (Philips, Amsterdam, The Netherlands). Eight different protocols were applied in each CT scanner. Table 1 summarizes the series acquisition parameters. For simplification purposes, the eight protocols are only detailed in the first scanner. In all protocols, the reconstruction kernel was the default or standard one with different names depending on the manufacturer and machine. The pitch factor was always set to 1, the image matrix was 512 × 512 and the mA value was fixed without any modulation system. FBP corresponds to filtered back projection. The protocols were selected according to previous literature [21]. However, given the intrinsic characteristics of each scanner, some parameters such as FOV, slice thickness or reconstruction algorithm could not be selected exactly the same, although an attempt was made to maximize their similarity in all cases. Tube current modulation was disabled.
Ten regions of interest (ROI) were registered and segmented using ARIA 15.1 software from Varian (Siemens Healthineers, Germany), delimitating a cylindrical ROI with a volume of 116 cm 3 in each insert of the phantom. This method ensured that the same areas were analyzed in the radiomic study through the different protocols and machines.
Quibim Precision 2.8 platform (Quibim S.L., Valencia, Spain) with CE marking and IBSI compliant was used for the extraction of the radiomic features. Image intensity was normalized and voxels were isotropically resampled to 1 × 1 × 1 mm 3 . The software internally removes the outliers, and it sets distance to neighbor to 1. Shape-related features were not analyzed in this paper, since ROI cylindrical shapes were imposed during registration. The software automatically extracted 105 radiomic features, 14 of them related to shape. Therefore, 91 radiomic features were calculated for each series (see Supplementary Material for the complete list of the radiomic features).
Repeatability and reproducibility assays were performed with the CCR phantom to assess radiomic feature robustness. For test-retest, protocol 1 of each scanner was repeated twice in a five-minute interval without phantom repositioning. Intraclass correlation coefficient (ICC) in two-way and agreement modality [22] was calculated for each pair of acquisitions. Moreover, within-subject coefficient of variation (wCV) was also computed [23]. Finally, the coefficient of variation (CV) as the ratio between standard deviation and the mean value was obtained for each scanner and insert material. R software was employed for the statistical analysis. A feature was considered repeatable when fulfilling ICC ≥ 0.9 and wCV ≤ 1% in, at least, 4 of the 5 CT scanners. Threshold conditions in this work were selected for achieving approximately the same number of radiomic features in each group of analysis. Reproducibility was divided into intra and inter-CT experiments. In intra-CT studies, the eight protocols were compared pairwise for each scanner. The Concordance Correlation Coefficient (CCC) was calculated for each pair of measurements. Intra-CT reproducibility criteria were CCC ≥ 0.9 and wCV ≤ 10% in 5 or more of the comparisons. From those, only the reproducible features in four CT scanners or more were selected. Inter-CT reproducibility studies compared the feature extraction obtained from different scans using the same acquisition protocol. In this case, CCC and wCV values were calculated. Features with CCC ≥ 0.9 were considered reproducible.
Finally, features that fulfilled the three criteria of test-retest, intra, and inter-CT were considered robust. A sketch of the experiment is depicted in Figure 1.

Intra-Observer Correlation Coefficient of Radiomic Features in Bosniak Lesions
Subjects were enrolled from June 2019, and 82 cases were selected (mean age 67.86 ± 12.46 years, 51 men). Of these, 41 cysts were diagnosed by expert radiologists in their clinical routine as IIF type, 16 were diagnosed as III type and 26 were diagnosed as IV, following the updated classification proposed in 2019. Inclusion criteria were patients with cystic renal masses who will undergo a CT scan to characterize these masses and who underwent surgery due to the characteristics of the mass, patients with cystic renal masses who underwent biopsy for pathological diagnosis or patients with active surveillance criteria. Exclusion criteria were patients who did not undergo complete multiphase CT due to technical error, patients with renal insufficiency, patients allergic to iodine contrast or patients whose cystic renal masses were not primary. Bosniak cysts IIF that had not changed over the course of two years were considered benign. All Bosniak cyst III and IV included had anatomical pathology confirmation of their benignity or malignancy. The database contains 49 Bosniak cysts considered benign and 33 considered malignant. The demographic information for each group and anatomopathological results from malignant cysts are summarized in Table 2.
Diagnostics 2023, 13, x FOR PEER REVIEW Figure 1. Flowchart of repeatability and reproducibility analysis of phantom assays and its co ison with Bosniak cyst feature statistical analysis.

Intra-Observer Correlation Coefficient of Radiomic Features in Bosniak Lesions
Subjects were enrolled from June 2019, and 82 cases were selected (mean age 6 12.46 years, 51 men). Of these, 41 cysts were diagnosed by expert radiologists in thei ical routine as IIF type, 16 were diagnosed as III type and 26 were diagnosed as IV, fo ing the updated classification proposed in 2019. Inclusion criteria were patients with renal masses who will undergo a CT scan to characterize these masses and who u went surgery due to the characteristics of the mass, patients with cystic renal masse underwent biopsy for pathological diagnosis or patients with active surveillance cr Exclusion criteria were patients who did not undergo complete multiphase CT d technical error, patients with renal insufficiency, patients allergic to iodine contrast tients whose cystic renal masses were not primary. Bosniak cysts IIF that had not ch over the course of two years were considered benign. All Bosniak cyst III and IV inc had anatomical pathology confirmation of their benignity or malignancy. The dat contains 49 Bosniak cysts considered benign and 33 considered malignant. The d graphic information for each group and anatomopathological results from mali cysts are summarized in Table 2 Phantom study of radiomic features  Figure 1. Flowchart of repeatability and reproducibility analysis of phantom assays and its comparison with Bosniak cyst feature statistical analysis. In Table 2, qualitative variables were presented as frequency and percentage. For the quantitative variables, the Kolmogorov-Smirnov test was performed to determine if they Diagnostics 2023, 13, 1384 6 of 16 followed a parametric distribution or not. In the case of age, since it was a parametric variable, it was presented as mean and standard deviation. In cyst size, which follows a non-parametric distribution, the value was presented as median and limit values. For the analysis of statistical significance (considering 0.05 as threshold) in the case of quantitative variables, the chi-square test was used. To compare the means of age according to the benign-malignant group, Student's t-test was performed. Lastly, to compare the maximum size of the cysts between both groups, as it was a non-parametric variable, the Mann-Whitney U test was performed. No statistically significant differences were found between the two groups. Note that, as reported by Terada et al. [24], men are more likely to suffer from renal cysts than women.
The intravenous contrast phase was selected in each case, as it is the one employed by the clinicians for the classification. Representing a clinical routine sample, series from these subjects were acquired with the same CT scanners employed for the phantom study. This ongoing prospective study received the approval of the Local Ethics Committee, and all participants signed the informed consent.
With the aim of assessing the reliability of radiomic features, the ICC for each feature is calculated by three different radiologists with different degrees of expertise (resident, junior, and senior). They performed the volumetric segmentation of the lesions, and radiomic features were extracted using the same software as in the phantom (Quibim Precision 2.8 platform). ICC among the three radiologists was calculated, and only features with ICC ≥ 0.90 were selected.

Classification Capacity of the Selected Features for Bosniak Cyst Prediction
The selected radiomic features were those resulting from the conducted phantom study and from applying the ICC ≥ 0.90 criteria of the cyst segmentation, obtaining a total of 12 out of 91 features. For comparison purposes with phantom results, shape features were not considered in this study.
To test the ability of the 12 features to distinguish between the benignity and malignancy of cysts included in the database, data were split into a train (65/82) and a test group (17/82), and different machine learning models were trained. The data included in the database were transformed by subtracting the mean and normalizing this value to the unit variance. The pipeline was composed by the following models: Linear Discriminant Analysis (LDA), Support Vector Machine (SVM) and Gaussian Naïve Bayes (GNB) and their corresponding hyperparameters. To optimize the parameters of different models, a grid search space was defined, using the f1-score as the scoring parameter. The cross-validation parameter was set to 10. The performance of each model was evaluated, and the best performing hyperparameter combination was selected.

Phantom Repeatability
Test-retest was performed on CT series acquired using protocol 1 of each scanner. After rigid registration, segmentation and feature extraction, the radiomic features were obtained. ICC and wCV were calculated for each pair of acquisitions. Results are plotted in Figure 2 and summarized in Table 3.
The ICC value for most of the features was excellent in the five scanners. However, when wCV was calculated, differences among the machines were observed. Scanners 1 and 5 were the most repeatable machines, since all radiomic features presented an excellent ICC, and more than 95% of them had a wCV less than 10%. On the contrary, scanners 3 and 4 were the less repeatable ones. The ICC value was not 100% for all radiomic features in these machines, and the number of features with wCV above 1% was higher for these scanners compared to the others.

Phantom Repeatability
Test-retest was performed on CT series acquired using protocol 1 of each scanner. After rigid registration, segmentation and feature extraction, the radiomic features were obtained. ICC and wCV were calculated for each pair of acquisitions. Results are plotted in Figure 2 and summarized in Table 3.  Table 3. Interclass correlation coefficient (ICC) and within-subject coefficient of variation (wCV) values classified for the 91 radiomic features computed. Repeatability analysis was performed over the five CT scanners using protocol 1 of each machine. Test and retest acquisitions were obtained consecutively and without repositioning. The ICC value for most of the features was excellent in the five scanners. However, when wCV was calculated, differences among the machines were observed. Scanners 1 and 5 were the most repeatable machines, since all radiomic features presented an excellent ICC, and more than 95% of them had a wCV less than 10%. On the contrary, scanners 3 and 4 were the less repeatable ones. The ICC value was not 100% for all radiomic features in these machines, and the number of features with wCV above 1% was higher for these scanners compared to the others.

Test-Retest Repeatability Analysis
S c a n n e r 1 S c a n n e r 2 S c a n n e r 3 S c a n n e r 4 S c a n n e r 5 0 50 100 ICC Percentage (%)

Test-retest ICC
S c a n n e r 1 S c a n n e r 2 S c a n n e r 3 S c a n n e r 4 S c a n n e r 5 0 50 100 wCVPercentage Test-retestwCV Figure 2. ICC and wCV histograms for the test-retest analysis of the radiomic features for the five CT scanners using protocol 1. Regarding the radiomic variables, features with excellent ICC and wCV ≤ 1% in, at least, four of the five scanners were considered repeatable. This resulted in a set of 39 repeatable features.
For the test-retest experiments, the feature CV for each material and scanner was also calculated. Detailed figures can be found in the Supplementary Material. Regarding the materials, wood was the most repeatable one through different scanners, and polyurethane the least repeatable one.

Phantom Intra-CT Reproducibility
Reproducibility analysis was repeated on the five CT scanners to identify the most reproducible machines and to select the most reproducible features when working with different protocols. Protocol 2 and 3 slightly modified voltage and tube current values. Protocol 4 involved a sharp increase in tube current. In protocols 5 and 6, the slice thickness was halved and doubled, respectively. In protocol 7, the field of view was changed and, finally, in the last protocol, the iterative reconstruction was disabled. CCC and wCV values were calculated, and the results are plotted in Figures 3 and 4. Tables related with these data are available in the Supplementary Material. different protocols. Protocol 2 and 3 slightly modified voltage and tube current values. Protocol 4 involved a sharp increase in tube current. In protocols 5 and 6, the slice thickness was halved and doubled, respectively. In protocol 7, the field of view was changed and, finally, in the last protocol, the iterative reconstruction was disabled. CCC and wCV values were calculated, and the results are plotted in Figures 3 and 4. Tables related with these data are available in the Supplementary Material.   In general, a smooth variation of current and voltage conditions (protocol 1 vs. 3 and 1 vs. 2, respectively) had no significant impact on the reproducibility of radiomic features. However, when this change was severe (protocol 1 vs. 4), the number of features with high CCC and low wCV diminished. Slice thickness modification (protocol 1 vs. 5 and 6) was also a source of variation, obtaining higher reproducibility when the thickness was reduced in comparison to when it was increased. The use or lack of iterative reconstruction (protocol 1 vs. 8) also modified the number of reproducible features in a similar way to  However, when this change was severe (protocol 1 vs. 4), the number of features with high CCC and low wCV diminished. Slice thickness modification (protocol 1 vs. 5 and 6) was also a source of variation, obtaining higher reproducibility when the thickness was reduced in comparison to when it was increased. The use or lack of iterative reconstruction (protocol 1 vs. 8) also modified the number of reproducible features in a similar way to the tube parameters. The modification of the field-of-view (FOV) (protocol 1 vs. 7) led to the lowest reproducibility of features for all the scanners.
Reproducible features were chosen following the criteria described in the methodology, and the results are presented in Table 4. Siemens Somaton X.cite was the most reproducible scanner regarding parameter modifications (44 of 91 reproducible features) and GE Lightspeed VCT-1 scanner was the least one, with only 24 reproducible variables. A final dataset of intra-CT reproducible features with data from the 5 scanners was composed of 35 out of 91 radiomic features. Table 4. Intra and inter-CT reproducibility results of the five different scanners regarding the 91 radiomic features analyzed. In all inter-scanner comparisons, protocol 1 of each CT was employed. In comparison with the test-retest study, the wCV condition was increased from ≤1% to ≤10%, obtaining a final dataset with approximately the same number of features in each section that would allow a final cross-comparison of the results. This relaxation of the constraints is logical, since more feature variation is expected when CT parameters are modified than when the same image is acquired under the same conditions.

Phantom Inter-CT Reproducibility
The radiomic feature reproducibility between different scanners was analyzed by calculating the CCC value of the radiomic variables between different scan acquisitions using protocol 1. Results are summarized in Table 4 and depicted in Figure 5. Figure 5 shows that the most reproducible comparison is between the GE LightSpeed VCT-1 and Philips Ingenuity machines (Scanner 2 vs. 5), where 95.6% of the features presented an excellent CCC. Comparison between the same model and vendor scanners (GE LightSpeed VCT) resulted in a 78.1% of reproducible features. Interestingly, the comparison between both Siemens scanners was the least reproducible, with only 47.2% of the features with an excellent CCC. Nevertheless, even if the vendor was the same for both scanners, the model of the machine was different.
For the selection of the inter-CT variables, 41 out of the 91 features (45.05%) were considered reproducible when working between different CT scanners.
wCV was also computed in this comparison. The results are shown in the Supplementary Material. Considering that in this situation the phantom was not only repositioned but also moved to another machine, the wCV variability of the features was very high in all cases and was not statistically suitable for feature filtering. Nevertheless, a similar tendency to CCC behavior was presented in the inter-CT wCV values. In particular, comparisons between GE LightSpeed VCT scanners and GE LightSpeed VCT-1 and Philips Ingenuity were the most reproducible.
to ≤10%, obtaining a final dataset with approximately the same number of features in each section that would allow a final cross-comparison of the results. This relaxation of the constraints is logical, since more feature variation is expected when CT parameters are modified than when the same image is acquired under the same conditions.

Phantom Inter-CT Reproducibility
The radiomic feature reproducibility between different scanners was analyzed by calculating the CCC value of the radiomic variables between different scan acquisitions using protocol 1. Results are summarized in Table 4 and depicted in Figure 5.

Intra-Observer Correlation Coefficient of Radiomic Features in Bosniak Lesions
From the segmentation by three radiologists with different degrees of experience with Bosniak lesions, it was found that 44/91 variables showed an ICC greater than or equal to 90%, which was considered as excellent. Next, we compared this dataset of features with the one obtained in the previous section, where the most robust features were collected through the different CT scanners. This results in 12 matching features that can be considered candidates for the development of machine learning models for Bosniak cyst classification. Table 5 shows the two lists of variables and, in bold, those that coincide in both datasets.     Table 6 shows the results obtained using the variables previously selected with the criteria of the phantom and ICC study. The main classification metrics (accuracy, precision, recall, AUC and f1-score) were calculated. From the numerical data of Table 6, it can be concluded that the model with the best performance is the Linear Discriminant Analysis (LDA), with an accuracy of 88.2%. To evaluate the performance of the classification models, the Receiver Operation Curve (ROC) was calculated. Figure 6 represents the ROC curve from training and test datasets for each of the models included. This curve shows the true positive rate (TPR), represented on the y-axis, and the false positive rate (FPR), represented on the x-axis. Additionally, the area under the curve (AUC) is shown. Figure 6 shows that the most efficient model is the Linear Discriminant Analysis with an AUC = 0.90. Therefore, it can be concluded that the chosen features show good distinguishing ability between malignancy and benignity classes in this model.

Discussions
In this work, an exhaustive reproducibility and repeatability analysis of feature stability in textural phantoms was presented, obtaining the most robust radiomic features for five CT scanners from different vendors. Overall, 25.3% (23/91) of the studied features were classified as repeatable and reproducible. Moreover, the features that showed an excellent correlation according to the radiologists who segmented the Bosniak lesion (44/91) have been calculated. Features that fulfill both conditions have been selected: stability and robustness through acquisition CT and good inter-observer correlation, obtaining a total number of 12. Finally, the benign/malignant predictive capacity of the Bosniak selected features of cystic renal masses was evaluated, finding that they could be good classifiers for this type of injury. This is an initial study that aims to serve as a basis for the development of future classification models. In particular, entropy-related features have already been shown to be useful in differentiating renal pathologies [17].
Phantom studies have several advantages in comparison with anatomic images, as there is no subject variability [21], and errors in image acquisition due to patient movement are also eliminated. Repositioning in repeatability experiments was avoided to minimize sources of variability [25]. Furthermore, the same feature extraction platform was employed in phantom and cyst experiments, ensuring no possible discordance between feature values related to software design. Studies have demonstrated that the feature calculation platform has a direct impact on radiomics variability, since different types of software return different feature values depending on code design, especially on the utilization of different preprocessing methodologies [26,27].
In the intra-CT reproducibility analysis, it was found that image-related parameters, such as FOV, had a significant impact on feature variability, as has been previously described [28,29], in particular those related to reconstruction more than the acquisition ones [18]. All series were resampled to isotropic voxels before feature calculation, as it was shown as a useful pre-processing technique for feature stability across different protocols [30][31][32], but it seemed clearly ineffective when the FOV was modified. In this work, the modification of the field-of-view value had a direct impact on the robustness of radiomic features, as occurred in other studies [33]. In addition, several CT scanners were used in this work, as multicenter studies allow to increase the statistical power and the reliability of the results, guaranteeing reproducibility across different hospitals [34].
This hybrid approach in the selection of the classifiers, combining the phantom and patient data, allows us to reduce the dimensionality of the model and maintains the features that are robust through the scanners and the segmentation of the lesions of the pathology. Regarding cyst delimitation, one advantage of this work is that a three-dimensional volumetric segmentation of the region of interest was performed. Other works selected only one representative slice of the series for the feature extraction with the subsequent limitations of this approach [5].
The recent version of Bosniak classification included new updates for CT and MRI complex renal cyst diagnosis [35]. Some of these considerations are highly observerdependent, such as septa and nodule measurement. Image markers such as radiomic features could be helpful in the classification of these complex cysts and could be employed for the development of machine learning models for these purposes. In this work, 12 radiomic features were found as good classifiers for the elaboration of classification models that could predict the benignity or malignity of the lesion [36]. In particular, the LDA model is a good tool for separating groups as in this case. Some sets of features linked to matrices such as NGTDM were discarded as they were neither robust nor informative. Some studies in the literature already support the importance of first-order features as well as entropy characteristics from high-order textures [18], as it was found out in this study for Bosniak cysts.
Although this is a pilot study with a limited number of cases, it is an initial step to determine which radiomic features may be good candidates for the elaboration of a model that serves as a support for the clinical decision on whether the cyst is benign or malignant. As indicated by Krishna et Schieda [12], one of the limitations of the Bosniak 2019 radiological classification is that its final objective is to classify the type of cyst (I-II-IIF-III-IV) and not its malignancy or benignity. Radiomic studies in this direction could be the answer to provide, through imaging, not only the classification but also a prediction of the aggressiveness of the lesion. This would have a direct impact on patient management and better stratification of lesions that are candidates for surgery. A greater number of cases and external validation would allow these results to be brought closer to the clinical routine at a future stage.
This study presents some limitations. Shape features were not analyzed, as cylindrical segmentation was imposed in the phantom acquisitions and there exists inherent feature variability due to the extraction software [37]. In Bosniak cyst analysis, these features were not considered since no comparison with phantom results was possible. Nevertheless, those could be informative for diagnosis purposes, as it was demonstrated in other anatomical areas [38]. Filtering conditions for variable selection were chosen in order to obtain a similar number of robust features in each subset, allowing a final comparison among test-retest, intra-CT and inter-CT stable features. On the other hand, if the filtering conditions were loosened, the number of selected variables would be excessive for the further training of a benignity/malignity diagnosis tool, and a high-dimensionality and redundancy problem will occur [39]. Bosniak cases were not equally selected across the five CT scanners, as they were a representation of clinical routine where most of the renal cyst are discovered incidentally during the image exploration [8]. Nevertheless, benign and malignant cysts were 59.7-40.2%, respectively, of the whole dataset. Moreover, the number of subjects analyzed in this work is limited. A higher number of enrolled participants will be ideal for the next steps in this research. Nevertheless, this study is presented as an initial point for further investigation in Bosniak classification using a radiomics model. Further work related to shape features should be carried out, since some of those features could be useful in the elaboration of an improved model.
There are several initiatives related to the standardization of radiomics pipeline, such as IBSI or EIBALL [40,41]. This study aimed to serve as the starting point for the development of a diagnostic tool based on artificial intelligence that follows the recommended steps to obtain a proper radiomics quality score [17,42]. In this sense, imaging protocols were public, multiple and volumetric segmentations of the lesion were performed, studies with phantoms were carried out to assess the robustness of features, and a dimensionality reduction was proposed when working with renal complex cysts. Overall, 12 features were determined as candidates for classification model training with a special interest in first-order and entropy ones.