Robustness of Radiomics in Pre-Surgical Computer Tomography of Non-Small-Cell Lung Cancer

Background: Radiomic features are increasingly used in CT of NSCLC. However, their robustness with respect to segmentation variability has not yet been demonstrated. The aim of this study was to assess radiomic features agreement across three kinds of segmentation. Methods: We retrospectively included 48 patients suffering from NSCLC who underwent pre-surgery CT. Two expert radiologists in consensus manually delineated three 3D-ROIs on each patient. To assess robustness for each feature, the intra-class correlation coefficient (ICC) across segmentations was evaluated. The ‘sensitivity’ of ICC upon some parameters affecting features computation (such as bin-width for first-order features and pixel-distances for second-order features) was also evaluated. Moreover, an assessment with respect to interpolator and isotropic resolution was also performed. Results: Our results indicate that ‘shape’ features tend to have excellent agreement (ICC > 0.9) across segmentations; moreover, they have approximately zero sensitivity to other parameters. ‘First-order’ features are in general sensitive to parameters variation; however, a few of them showed excellent agreement and low sensitivity (below 0.1) with respect to bin-width and pixel-distance. Similarly, a few second-order features showed excellent agreement and low sensitivity. Conclusions: Our results suggest that a limited number of radiomic features can achieve a high level of reproducibility in CT of NSCLC.

In this study, we evaluated the robustness of radiomic features across segmentations and evaluated the influence of other parameters (specifically, bin-width, pixel-distance in

ROIs Delineation
Two expert radiologists in consensus (MB and RM) manually delineated three segmentations on each patient. The software used was ITK-snap [35,36]. The first segmentation was free-hand and was delineated as accurately as possible. The second segmentation was a free-hand rough one. The third was delineated using a rough polygonal. The second and third segmentations were drawn including the lesion and some tissues around it.

Features Extraction
Radiomic features might support radiologists in quantifying medical image characteristics. However, given the difficulties of standardisation in this field, recently [4] the Imaging Biomarker Standardisation Initiative (IBSI) has assessed reproducibility. Therefore, we used radiomic features that have been approved by the IBSI committee.
Specifically, Pyradiomics [22] is an open-source python library allowing the calculation of radiomic features compatible and approved within the IBSI. In this study, we used Pyradiomics version 3.0.1.
We computed the following categories of radiomic features: shape 3D, first-order and second-order (glcm, glrlm, glszm, ngtdm and gldm). Each feature has been computed on original and wavelet (coif1) images. See [4,22] for a detailed account for features formulas and interpretation. A total of 851 features were computed.
In order to assess the dependence on important parameters used for features computation, we computed features using several values of bin-width and pixel-distance. Bin-width is the gray-level bin-width of the histogram and it can affect mainly features which require histogram computation: gray-level co-occurrence matrix (GLCM) and gray-level run length matrix. We repeated the feature computation for bin-width = 10, 20 and 40. Pixel-distance specifies the distances between the centre voxel and the neighbourhood, for which angles should be generated; it affects mainly the gray-level co-occurrence matrix (GLCM) and neighbouring gray-tone difference matrix (NTGDM). We repeated the feature computation using distance = 1, 2 and 4 pixels.
Moreover, we assessed robustness also with respect to isotropic resolution and type of interpolator. In particular, we used nearest-neighbour, linear and spline interpolators as the default implemented in pyradiomics. Resolutions were fixed at 1, 2 and 2.5 mm.

Robustness Evaluation
In order to quantify 'robustness' of a feature across segmentations, we used the intra-class correlation coefficient (ICC). The ICC is described in [37]. In particular, per each feature, we computed the lower bound of the 95% confidence interval for ICC. As a reference value for robustness, in line with other studies [3,4] we used ICC = 0.9, which indicates 'excellent' agreement. The computation of ICC and subsequent statistical analysis was performed in R version 3.6.3 [38] using package 'irr' version 0.84.1.
As regards to the 'sensitivity' to other parameters (bin-width, pixel-distances, type of interpolator and isotropic resolution), it was assessed per each parameter within each feature using the range of variation of ICC having other fixed parameters. In lack of previous studies, we arbitrarily considered a variation below 0.1 as low sensitivity. In Figure 1 there is a graphical summary of our analysis.
for which angles should be generated; it affects mainly the gray-level co-occurrence matrix (GLCM) and neighbouring gray-tone difference matrix (NTGDM). We repeated the feature computation using distance = 1, 2 and 4 pixels.
Moreover, we assessed robustness also with respect to isotropic resolution and type of interpolator. In particular, we used nearest-neighbour, linear and spline interpolators as the default implemented in pyradiomics. Resolutions were fixed at 1, 2 and 2.5 mm.

Robustness Evaluation
In order to quantify 'robustness' of a feature across segmentations, we used the intraclass correlation coefficient (ICC). The ICC is described in [37]. In particular, per each feature, we computed the lower bound of the 95% confidence interval for ICC. As a reference value for robustness, in line with other studies [3,4] we used ICC = 0.9, which indicates 'excellent' agreement. The computation of ICC and subsequent statistical analysis was performed in R version 3.6.3 [38] using package 'irr' version 0.84.1.
As regards to the 'sensitivity' to other parameters (bin-width, pixel-distances, type of interpolator and isotropic resolution), it was assessed per each parameter within each feature using the range of variation of ICC having other fixed parameters. In lack of previous studies, we arbitrarily considered a variation below 0.1 as low sensitivity. In Figure 1 there is a graphical summary of our analysis.
Lastly, the correlation between each feature reproducibility and lesion size (shape feature Voxel Volume) was analysed. To do this kind of analysis, it was not possible to use ICC (because patient information are flattened); we instead used the coefficient of variation (CV) which is the ratio between a measure of dispersion (standard deviation) of the distribution of one feature over a patient and the average value of the same feature over the same patient. The correlation between CV and tumour size is then computed.

Results
Patient demographic data and lesion type information are presented in Table 2. Figure 1 reports some illustrative examples of ROIs delineated on five randomly chosen patients. Flowchart of the method used in this study to evaluate robustness of radiomic features is showned in Scheme 1.  Lastly, the correlation between each feature reproducibility and lesion size (shape feature Voxel Volume) was analysed. To do this kind of analysis, it was not possible to use ICC (because patient information are flattened); we instead used the coefficient of variation (CV) which is the ratio between a measure of dispersion (standard deviation) of the distribution of one feature over a patient and the average value of the same feature over the same patient. The correlation between CV and tumour size is then computed.

Results
Patient demographic data and lesion type information are presented in Table 2. Figure 1 reports some illustrative examples of ROIs delineated on five randomly chosen patients. Flowchart of the method used in this study to evaluate robustness of radiomic features is shown in Scheme 1. Flowchart of the method used in this study to evaluate robustness of radiomic features. Three 3D ROIs have been delineated on each subject included in the study. Radiomic features from the original image and from the wavelet transformed have been computed using different values of four parameters: bin-width, pixel-distance, interpolator and isotropic resolution. ICC has been used for robustness evaluation and variability of ICC across a parameter was used as an index of 'sensitivity' of the feature to the parameter.
Figures 2-14 report the main results of our study. In particular, for each feature, we reported the ICC value with respect to the three segmentations. We grouped robustness per category of features. In particular, we considered 'shape', 'first-order', 'second-orderoriginal' and 'second-order-wavelet'. As an example, we comment in detail on Figure 2; we report robustness (ICC) evaluated on shape features. These features (shape) seem to have generally high robustness. It is not less than 0.6 and many of them have robustness higher than 0.9; only three of them have an ICC below 0.9. This kind of feature does not depend on bin-width, distance, resolution or interpolator.   In particular, for each feature, we reported the ICC value with respect to the three segmentations. We grouped robustness per category of features. In particular, we considered 'shape', 'first-order', 'second-orderoriginal' and 'second-order-wavelet'. As an example, we comment in detail on Figure 2; we report robustness (ICC) evaluated on shape features. These features (shape) seem to have generally high robustness. It is not less than 0.6 and many of them have robustness higher than 0.9; only three of them have an ICC below 0.9. This kind of feature does not depend on bin-width, distance, resolution or interpolator.  In particular, for each feature, we reported the ICC value with respect to the three segmentations. We grouped robustness per category of features. In particular, we considered 'shape', 'first-order', 'second-orderoriginal' and 'second-order-wavelet'. As an example, we comment in detail on Figure 2; we report robustness (ICC) evaluated on shape features. These features (shape) seem to have generally high robustness. It is not less than 0.6 and many of them have robustness higher than 0.9; only three of them have an ICC below 0.9. This kind of feature does not depend on bin-width, distance, resolution or interpolator. contours. They do not depend on bin-width, distance, resolution or interpolator. With the exception of Flatness, the other features have a robustness greater than 0.95.                  1, 2, 4). Because they are second-order, they are dependent on distance (dst) but they do not depend much on resolution or interpolator. However, the wavelet features are dependent on the direction used for wavelet computation (each circle on a line is a different gradient direction).
The robustness (ICC) is above 0.9 only for few of them. They are dependent on bin-width (binw: 10, 20, 40) and on (dst: 1, 2, 4). Because they are second-order, they are dependent on distance (dst) but they do not depend much on resolution or interpolator. However, the wavelet features are dependent on the direction used for wavelet computation (each circle on a line is a different gradient direction). We can observe similar behaviours for other features (Figures 3-14). However, in some cases, the ICC can be well below 0.9 indicating poor robustness of that feature. Moreover, in most wavelet-based and second-order features, there is a dependence on bin-width, distance, resolution or interpolator.
As regards to sensitivity of the ICC of features to other parameters such as bin-width, pixel-distance, isotropic resolution and interpolator, we report in Figure 15 the sensitivity per each feature. The first 107 features were computed from the original image; the other 744 were computed from wavelet images. The index in the figure refers to a feature list given as a separate file. We note that 'shape' features (index from 94 to 107) have very small (approximately zero) sensitivity. We can observe similar behaviours for other features (Figures 3-14). However, in some cases, the ICC can be well below 0.9 indicating poor robustness of that feature. Moreover, in most wavelet-based and second-order features, there is a dependence on bin-width, distance, resolution or interpolator.
As regards to sensitivity of the ICC of features to other parameters such as bin-width, pixel-distance, isotropic resolution and interpolator, we report in Figure 15 the sensitivity per each feature. The first 107 features were computed from the original image; the other 744 were computed from wavelet images. The index in the figure refers to a feature list given as a separate file. We note that 'shape' features (index from 94 to 107) have very small (approximately zero) sensitivity. The robustness (ICC) is above 0.9 only for few of them. They are dependent on bin-width (binw: 10, 20, 40) and on (dst: 1, 2, 4). Because they are second-order, they are dependent on distance (dst) but they do not depend much on resolution or interpolator. However, the wavelet features are dependent on the direction used for wavelet computation (each circle on a line is a different gradient direction).   Figure 16 reports the correlation between each feature CV and lesion size. The Spearman method was used. Red circles represent correlations with 'shape' features; only a weak correlation exists. Figure 16. Correlation between CV for each feature (on x-axis) and lesion size (shape feature Voxel Volume). Spearman method was used. Red circles indicate correlation for 'shape' features; only a weak correlation exists.

Discussion
The aim of this study was to evaluate the robustness of radiomic features with respect to ROI segmentation in the case of pre-surgical CT of NSCLC. We quantified robustness as 'agreement' among three segmentations using the intra-class correlation coefficient (ICC). We chose ICC = 0.9 as a reference value indicating an 'excellent' agreement [37].  Figure 16. Correlation between CV for each feature (on x-axis) and lesion size (shape feature Voxel Volume). Spearman method was used. Red circles indicate correlation for 'shape' features; only a weak correlation exists.

Discussion
The aim of this study was to evaluate the robustness of radiomic features with respect to ROI segmentation in the case of pre-surgical CT of NSCLC. We quantified robustness as 'agreement' among three segmentations using the intra-class correlation coefficient (ICC). We chose ICC = 0.9 as a reference value indicating an 'excellent' agreement [37]. The sensitivity of ICC upon bin-width, distance, resolution or interpolator was also evaluated. We assessed the following feature categories: shape (14 features), first-order (18), gray-level co-occurrence matrix (24), GLDM (14), GLRLM (16), GLSZM (16) and NGTDM (5): a total of 107 features. Moreover, the same features were assessed on wavelet images (coif1) using High/Low filtering in x-y-z directions for a total of 851 features.
Our results showed that some features were robust having high ICC; however, a number of features were sensitive to segmentation, having low ICC. Moreover, a dependence on bin-width, distance, resolution or interpolator has been observed for many features.
Most importantly, we observed that shape features showed high robustness (ICC > 0.90). This is in line with intuition, as shape features (e.g., VoxelVolume, maximum diameter or surface area) should not vary too much across segmentations. It should be noticed, however, that surface-volume ratio and sphericity can reach very low ICC values (down to 0.75).
Another important result of our study is that we observed a low number of features robust with respect to segmentations and to bin-width, distance, resolution or interpolator. In fact, our results showed that a large number of first-and second-order features have an ICC lower than 0.9. For these features the agreement ICC, across segmentations and other source of variability, is less than excellent.
Lastly, the correlation analysis showed that 'shape' features CV is only weakly negatively correlated with lesion size: as tumour size increases, CV weakly decreases. This might be explained by the fact that the larger the tumour size, the larger the segmentation variability by manual delineation. Moreover, Figure 16 shows that there is a large variability among feature CV vs lesion size.
Assessment of radiomics robustness is an important task for guaranteeing reproducibility of radiomic analysis across segmentations; it has been also addressed in other papers [3,4]. A very important paper in this field is [4] in which a large number of radiomics features was evaluated for reproducibility across 25 teams. However, they used only one CT of a patient suffering from lung cancer. Subsequently, in [3], the authors evaluated radiomics reproducibility of NSCLC across different CT equipment and reconstruction algorithms. In this last paper, the authors analysed 103 patients. However, the authors did not explicitly address the dependence on segmentations and feature parameters.
Commentary is in order regarding the applicability of our results in clinical settings. The fact that a feature might be suitable for clinical application should be assessed, not only looking for high reproducibility; in fact, a certain variability of a feature might be tolerated given that the variability in the patients is larger. Further, in ultimate analysis, features are used as an input to classifiers; therefore, also the specific combination of classifier plus feature must be optimised. Specifically, our study addressed the problem of features selection only from the point of view of high reproducibility but did not address the points just mentioned.
One important limitation of this study is the reduced sample size (48 patients). In order to gain insight into the reproducibility of radiomics features, large databases should be analysed.

Conclusions
Radiomics analysis has the potential for improving diagnosis in many pathologies and in particular in NSCLC. However, the reproducibility of features with respect to variability in 3D ROI segmentations and other sources of variability in feature computation (such as bin-width, pixel-distance, interpolator and isotropic resolution) must be assessed before introduction into the clinical practice. In our study, we observed that 'excellent' agreement (ICC > 0.9) across different sources of variability can be achieved by a relatively small number of features: in particular, 'shape' features, some second-order features and some wavelet-based features.
Author Contributions: M.P.B. and R.M., expert radiologists, performed manual segmentation. Quality assessment was provided by M.P.B. and data analysis were provided from M.S. and S.M. Possible disagreements were resolved after a panel discussion with two other authors familiar with the project (V.N. and A.R.). All articles were identified by two reviewers (R.G. and R.F.) with an experience of at least 5 years in the radiomic field. All authors have read and agreed to the published version of the manuscript.

Funding:
The research leading to these results has received funding from Ministero dell'Universita e della Ricerca (MIUR) under PRIN 2021 (2020YL3FB3_002) "REASONING: foRmal mEthods for com-putAtional analySis for diagnOsis and progNosis in imaging" (University of Campania "L. Vanvitelli").