Radiomics, i.e., the extraction of various texture features from radiologic images, is an emerging and rapidly evolving technique. The aim is to detect subtle changes in imaging data imperceptible to the human eye [1
After image acquisition, preprocessing, and segmentation of a lesion or a tumor, different subgroups of radiomic features can be extracted: shape features that describe the shape and geometry [2
], first-order features that provide information on global characteristics of the gray level intensity distribution [4
] without considering spatial relationships [3
], as well as second- and higher-order features, which are derived using complex functions to describe the spatial arrangement of voxel intensity values [2
Explorative analysis and modeling of these data attempt to correlate radiomic features with prediction targets, such as clinical endpoints and genomic features [6
]. Especially for numerous malignant entities and solid tumors, e.g., brain tumors [7
], head and neck cancer [8
], renal tumors [9
], or prostate cancer [10
], correlations between radiomic features, histopathology, and outcome have been shown recently. Although there is a growing body of data on the application of radiomics as “quantitative imaging biomarkers” [11
], the reliability of the data is not yet fully assured [12
]. However, reproducibility is an essential property of a quantitative biomarker [13
Radiomic feature extraction from medical images requires segmentation of the volume of interest. Variability in the segmentation process can already bias radiomic features [14
]. Besides the segmentation process, voxel size in computed tomography (CT) impacts a substantial number of radiomic features [17
]. Additionally, inter-scanner and inter-vendor variability of numerous radiomic features have been reported for CT [18
] and MR imaging [19
]. Overall, published data suggest that all steps prior to a radiomics analysis can affect feature values, including image acquisition, preprocessing, reconstruction algorithms, and applied software [6
], increasing the demand for a standardization of radiomics studies [11
]. Additionally, Berenguer et al. suspected CT-based radiomics of being fundamentally influenced by noise [24
], which Lu et al. were recently able to disprove for individual features [25
]. As other improvements, Van Timmeren et al. suggested test-retest strategies to select robust radiomic features [26
]. Kalpathy-Cramer et al. recommended training on phantoms to counteract variations due to different segmentation [16
Additionally, it has already been addressed that the size of the segmented volume influences radiomic features: Different first-order features (energy, total energy, root mean squared) are confounded by volume, because—in generalized terms—the pixels’ gray levels in a region of interest (ROI) are summed, i.e., a ROI with more pixels leads to a higher feature value and vice versa [2
]. Additionally, the first-order feature variance is supposed to be influenced by ROI size [27
]. Therefore, these features cannot reliably distinguish between different pathologies unless they are derived from identically sized ROIs. For example, in a study investigating radiation-induced lung disease in CT scans, Choi et al. found that only 16 of 27 texture features were robust across different tumor sizes [28
]. Roy et al. found 16 radiomic features dependent on tumor size in breast cancer lesions and suggested normalization for volume dependency to be used for the confounded features [29
]. Traverso et al. investigated volume-confounding in 841 radiomic features derived from lung and head and neck tumors and found nearly 30% strongly correlated with tumor volume [30
]. Thus, the question arose of which features remain stable when the ROI size varies.
Therefore, this study aimed to identify stable radiomic features in CT and MR images when extracted from ROIs of variable size considering a homogenous phantom. In this way, we intended to observe solely the effects of the different ROI sizes on the features, as the phantom’s structure remains identical throughout.
3.1.1. T1w MR Images
Of the 18 first-order features, RMAD, entropy, range, uniformity, energy, and total energy showed a significant difference for all pairs of ROI sizes in both ROImm and ROIpx. On the contrary, no significant differences were observed for mean, median, RMS, 10th percentile, and skewness. The remaining first-order features were different in at least one compound of ROI sizes (mm or pixel).
Of the 24 GLCM features, ten features were significantly different in each possible combination. Seven out of fourteen GLDM features were different in all possible combinations; the others showed differences for at least one pair. Nine of sixteen GLRLM features were different for all possible pairs, while all of the features showed differences for at least one pair. Of the 16 GLSZM features, nine features were different in all combinations, with only one feature (small area emphasis) showing no differences in any compound. Three of five NGTDM features were different in all varieties, while all of the features showed differences for at least one pair.
In total, in T1w images, out of 93 analyzed features, 44 were different in every pairing and 43 in at least one pairing; only 6 features did not show differences in any combination.
For the total of 558 ROI pairs (4,8 and 4,16 and 8,16 in mm or px: six combinations for 93 features), we recognized 221 significant differences in ROImm and 185 in ROIpx.
3.1.2. T2w TIRM Images
Of the first-order features, uniformity, RMAD, MAD, IQR, variance, entropy, range, total energy, and energy were significantly different for all pairs of ROI sizes. Mean, median, RMS, 10th percentile, 90th percentile, and skewness showed no differences for any combination. The remaining first-order parameters were significantly different in at least one compound.
A total of 13 GLCM, 9 GLDM-, 11 GLRLM, 9 GLSZM, and 2 NGTDM features were different in all combinations, while all of the features showed differences for at least one pair.
In summary, in the T2w TIRM images, a significant difference occurred in 53 of 93 features in all combinations and 34 features in at least one pair. Only 6 of the features (6 first-order features) showed no significant differences in all possible variations.
There were 221 significant differences in ROImm and 212 in ROIpx.
3.1.3. CT Images
Compared to the MR images, only a few first-order features were significantly different between ROI sizes: range, energy, and total energy showed significant differences in all possible ROI combinations; 10th percentile, variance, MAD, and minimum showed difference in at least one compound. The features mean, median, RMS, entropy, uniformity, skewness, 90th percentile, RMAD, and IQR did not show differences in any pair.
Of 24 GLCM features, 13 features were different in at least one compound. The remaining 11 GLCM features showed no differences. A total of 6 GLDM- and 5-GLSZM features showed significant differences in all possible combinations and one feature with significant differences for at least one pair. 10 GLRLM- and two NGTDM features were significantly different in all compounds, while all of the features showed differences for at least one pair.
In total, 21 of 93 CT-derived features did not show significant differences in any pair. Twenty-six features were significantly different in all compounds, and 44 features in at least one combination.
We found 128 significant differences for ROImm and 137 for ROIpx.
shows exemplary boxplots of the features mean, median, RMS, entropy, and uniformity for T1w, T2w TIRM, and CT images.
OCCCs4–16 showed excellent agreement for the features mean, median, and RMS extracted from T1w and T2w TIRM MR images for ROIsmm and ROIspx.
The features 90th percentile and 10th percentile showed excellent agreement for T1w ROIspx but not for ROIsmm. In T2w TIRM MR images, the 10th percentile showed excellent agreement also only for ROIspx.
In the OCCCs8,16 agreement was consistent for MR images, besides that the 10th percentile in T2w TIRM no longer showed excellent agreement, either in ROIsmm or in ROIspx, despite a high agreement of 0.88 in ROIspx.
None of the first-order features derived from CT showed excellent agreement based on OCCCs4–16 and OCCCs8,16. Median showed the best agreement with 0.8 in ROIsmm.
Considering second-order and higher-order features, none of the features, either extracted from CT or from MR images, showed excellent agreement.
shows correlation plots for the first-order feature mean for ROI sizes 8 and 16 pixels. 2D correlation plots of all included features for ROI sizes 8 and 16 mm and px are provided in the Supplementary Materials
(see Figure S4
). Figure 6
shows Bland–Altman plots for the first-order feature RMS. Bland–Altman plots of all included features are provided in the Supplementary Materials
(see Figure S5
3.3. Intra- and Interrater Agreement
Intra- and interrater agreement was calculated for first-order features to rule out reader dependency of results.
Except for skewness and kurtosis, both intra- and interrater agreement was excellent, demonstrating that the obtained results are not attributable to the individual reader. Skewness was the only feature for which agreement was moderate. Kurtosis was the only feature for which agreement was poor. This may be attributable to intrinsic properties of these parameters, which are known to be prone to outliers [12
3.4. Summary of the Results
Compared to the CT-derived features, more MR-derived features were significantly different between ROI sizes in the MWU-test. Most of the few features for MR images without significant differences (mean, median, RMS, 10th percentile, skewness, and in T2w TIRM images additionally 90th percentile) showed excellent OCCCs.
For CT, in total fewer features were significantly different between ROI sizes, especially considering the first-order and the GLCM features. However, none of the CT-derived OCCCs showed excellent agreement.
For the MR images, more features from ROIs drawn in millimeters showed significant differences than from ROIs drawn in pixels. In CT images, slightly more features from ROIs drawn in pixels were significantly different.
Of all features extracted from our homogenous phantom, the first-order parameters mean, median, and RMS proved robust to a ROI size variation of 4–16 mm and pixels in MR images. Thus, a lesion could vary in size between 4 and 16 mm or pixels without altering these three radiomic features. Agreement in absolute numbers, however, was better when only the two largest ROIs were analyzed.
Considering the Mann–Whitney U-test results, it is interesting that differences between the ROI sizes were significant for a substantial number of features. When transferring this to clinical studies, a feature could be classified as helpful in differentiating a disease entity or condition, even though it may only indicate a systematic difference in lesion size. Our observations on the homogenous phantom showed more MR than CT-derived features with a significant difference between ROI sizes.
We intentionally chose a phantom without an internal structure to acquire images that remain identical for all ROI sizes. We decided to analyze three different spherical ROI sizes in our study to mimic three lesions of the same homogenous composition, but with different volumes. Although a 4 mm ROI is relatively small, it is not entirely unusual in clinical routine (e.g., small pulmonary nodules). Still, it is more likely to encounter larger lesions of clinical relevance, corresponding to ROIs with diameters of 8 to 16 pixels or mm. Nevertheless, we can deduct from our results, that the features we consider stable provide congruent information from 8 to 16 mm/px and 4 to 16 mm/px resp.
Our results for RMS—a measure of the magnitude of intensity values [2
]—as a robust feature are rather unexpected since the developers of PyRadiomics themselves refer to RMS as a volume-confounded parameter [2
]. Yet, the results confirm a lack of reproducibility across different ROI sizes for energy and total energy, congruent to the developers’ statement. Our stable parameters in T1w and T2w TIRM images, mean and median, were already reported as stable in lung CTs by Choi et al. [28
]; however, in our study these parameters did not show excellent OCCCs when derived from CT images.
None of the second- or higher-order features extracted from MR images of our homogenous phantom achieved excellent agreement in the OCCCs. These parameters identified as volume confounded in our study were also reported unstable in the in vivo MRI study by Roy et al., who investigated stability across different tumor volumes on breast cancer patients with T1w and T2w MR sequences [29
]. Therefore, these features do not seem reliable for use in MRI-based texture analysis from differently sized ROIs, and studies based on MR-derived second- and higher-order features should be scrutinized.
Unlike Baessler et al. [12
], who reported TIRM (FLAIR) images to be most robust in reproducing radiomic features in fruits, we observed no crucial differences between T2w TIRM and T1w images with T1w even yielding slightly better results in our homogenous phantom.
Moreover, we found the reproducibility of MRI-derived 90th and 10th percentiles dependent on whether we measured ROI size in pixels or millimeters, showing excellent agreement only for ROIspx
. In contrast, mean, median, and RMS were robust to ROI size irrespective of whether we used pixels or millimeters. By comparison, there are more pixels included in ROIsmm
than in the respective ROIspx
. This fact may increase the number of outliers in the ROIsmm
by which the percentiles shift slightly, which may be enough to reduce stability. Percentiles are known to be strongly influenced by single-pixel outliers [12
]; however, this also applies to mean, which proved to be less susceptible in our study.
Our results for CT-derived features are not surprising since several studies have approved that many CT texture features lack reproducibility, even under constant examination conditions [24
]. In our homogenous phantom, none of the CT-derived features had an excellent OCCC8,16
. In contrast, we must also highlight that fewer CT than MR-derived features showed significant differences. Therefore, CT-derived radiomics seem to be volume confounded in our setting, but not distorted enough to simulate significant differences.
One reason for the high number of features prone to ROI size variation could be that most of the radiomic features were initially developed for non-medical applications and planar images, while typically three-dimensional lesions are investigated in radiological imaging [17
Our study has some limitations. One is that only one scanner per modality was used to acquire the images used for the analysis. Thus, as already outlined in the introduction, results may be different for other reconstruction algorithms, manufacturers, and settings, especially for MRI [12
]. Taking these issues into account was beyond the scope of this study. Nevertheless, we have aimed for reproducible settings with examination parameters taken from the clinical routine. Furthermore, the smallest ROIs in this study (especially ROIpx
) comprise a relatively low number of pixels, which may render the results prone to outliers. We tried to compensate for that by considering multiple acquisitions (10 acquisitions per CT/MR sequence) and comparing values under the exclusion of the smallest ROI by applying the OCCCs8,16
. In this context, the consideration of only two readers for the estimation of the interobserver variability should also be mentioned. More readers would lead to an even more reliable assessment.
Apart from that, the comparability of T1w and T2w TIRM MR images is limited because the slightly different slice thicknesses lead to different voxel depths and hence differences in spatial resolution in this direction. Additionally, we used different PyRadiomics settings for the extraction from CT and MR images. However, the use of identical parameters without consideration of modality-specific characteristics would again have been associated with limitations.
It may also be seen as a drawback that intensity inhomogeneities in the MR images of our phantom are already visible to the naked eye and may influence radiomic features. However, we believe that similar effects are likely to be encountered in clinical images as well. And although they may not be obvious, there are probably minor inhomogeneities in CT images as well due to repositioning and rotating the phantom, since the wall of the cup is unlikely to be absolutely uniform.
Moreover, it can be considered a limitation that our phantom has no internal structure and hence may not be applicable for texture analysis. In addition, images from a homogenous phantom likely reflect mainly image noise. However, clinical images are not expected to be entirely free of similar effects and homogenous structures are not generally excluded from texture analysis. Nevertheless, it should be kept in mind that the results obtained from our phantom may not be directly translatable to clinical routine.
Despite the already known myriad of factors influencing radiomic features, our results underline that the ROI size is another factor to be considered in radiomics studies. In our study, more MR than CT-derived features were stable across ROI sizes and less susceptible to whether ROI size was measured in millimeters or pixels. On the contrary, less CT than MR-derived features were significantly different between the ROI sizes.
In many studies, lesions were marked with ROIs, but the lesions and consecutively the ROIs had different sizes. Considering our results, however, it has to be validated if the ROI size is a pivotal influencing factor in radiomics, for example, by sorting lesions by volume and voxel size and comparing heterogeneities of the radiomic features or by normalizing the features by voxel count or volume [17
]. Thus, before applying radiomics in clinical routine, volume as a confounding factor needs to be investigated further.