Impact of Interobserver Variability in Manual Segmentation of Non-Small Cell Lung Cancer (NSCLC) Applying Low-Rank Radiomic Representation on Computed Tomography

Simple Summary Discovery of predictive and prognostic radiomic features in cancer is currently of great interest to the radiologic and oncologic community. Tumor phenotypic and prognostic information can be obtained by extracting features on tumor segmentations, and it is typically imaging analysts, physician trainees, and attending physicians who provide these labeled datasets for analysis. The potential impact of level and type of specialty training on interobserver variability in manual segmentation of NSCLC was examined. Although there was some variability in segmentation between readers, the subsequently extracted radiomic features were overall well correlated. High fidelity radiomic feature extraction relies on accurate feature extraction from imaging that produce robust prognostic and predictive radiomic NSCLC biomarkers. This study concludes that this goal can be obtained using segmenters of different levels of training and clinical experience. Abstract This study tackles interobserver variability with respect to specialty training in manual segmentation of non-small cell lung cancer (NSCLC). Four readers included for segmentation are: a data scientist (BY), a medical student (LS), a radiology trainee (MH), and a specialty-trained radiologist (SK) for a total of 293 patients from two publicly available databases. Sørensen–Dice (SD) coefficients and low rank Pearson correlation coefficients (CC) of 429 radiomics were calculated to assess interobserver variability. Cox proportional hazard (CPH) models and Kaplan-Meier (KM) curves of overall survival (OS) prediction for each dataset were also generated. SD and CC for segmentations demonstrated high similarities, yielding, SD: 0.79 and CC: 0.92 (BY-SK), SD: 0.81 and CC: 0.83 (LS-SK), and SD: 0.84 and CC: 0.91 (MH-SK) in average for both databases, respectively. OS through the maximal CPH model for the two datasets yielded c-statistics of 0.7 (95% CI) and 0.69 (95% CI), while adding radiomic and clinical variables (sex, stage/morphological status, and histology) together. KM curves also showed significant discrimination between high- and low-risk patients (p-value < 0.005). This supports that readers’ level of training and clinical experience may not significantly influence the ability to extract accurate radiomic features for NSCLC on CT. This potentially allows flexibility in the training required to produce robust prognostic imaging biomarkers for potential clinical translation.


Introduction
Lung cancer is the leading cause of cancer-related death in the United States [1]. Non-small cell lung cancer (NSCLC) represents the majority of primary lung cancers and carries a poor prognosis and low overall survival [2]. Computed tomography (CT) is a routinely used diagnostic imaging tool in clinical management in oncology due to the ability of CT to noninvasively provide anatomic information for detection, staging, and therapy response assessment. Over the past decade it has become evident that quantitative features are embedded in conventional medical imaging data, not appreciable to the human eye [3]. These radiomics features are a reflection of tissue architecture, heterogeneity, and pericellular environment and can be harnessed to construct tissue signatures that correlate with clinically relevant biomarkers, including tumor histologic subtype, mutational status, degree of infiltration with tumor infiltrating lymphocytes, as well as therapeutic endpoints such as overall survival [4][5][6][7][8][9]. These imaging "phenotypes" provide valuable data that may enhance personalization of medical care in oncology [10].
It is well known that repeatability and reproducibility of radiomic features on CT are sensitive to various image details such as image acquisition settings, processing, reconstruction algorithm, and specific software used for radiomic feature extraction [5,7,9,[11][12][13][14][15][16][17]. Furthermore, certain radiomic features are more sensitive to these variations than others, with first order features, specifically entropy, consistently reported as being very stable while other texture features, such as coarseness and contrast, being the least reproducible [18].
Discovery of predictive and prognostic radiomic features in cancer is currently of great interest to the radiologic community; however, there is no reliable fully automated means of segmenting lung cancer. Tumor delineation and contouring are often performed by scientists with a range of training in anatomical imaging including imaging analysts, students, physician trainees, and attending physicians using either manual or semi-automated techniques. In addition to being time consuming, 3-dimensional manual and semi-automated contouring are subject to interobserver variability. This variability has been shown to be particularly challenging with segmented lesions when associated with ground glass components and postobstructive atelectasis [4]. In order to generate high fidelity phenotypic radiomic signatures, tumor segmentations must be reproducible across different readers [17]. Performing quality segmentations is an important task. Although the ability to anticipate tumor histology, mutational status, and therapeutic consequences are all ultimate goals of radiomics, interobserver variability between readers should be thoroughly investigated before subsequent feature analysis is tested, given that these segmentations form the basis of the analyses.
To our knowledge, no study has examined how both the level and type of specialty training in manual or semi-automated segmentations affects the subsequent extraction of radiomic features. Thus, our purpose in this study is to examine how the level of specialty training impacts interobserver variability in manual segmentation and radiomic feature extraction of NSCLC on CT.

Materials and Methods
The proposed approach presents a comparative assessment of interobserver variability in segmenting NSCLC tumors on chest CTs and its effect on subsequent extraction of radiomic features and survival analysis (see Figure 1 for study schema).

Figure 1.
Workflow of the approach. The NSCLC tumor is segmented from the original CT images by four segmenters (n = 4) with different backgrounds, yielding radiomics features and tumor masks as inputs. Next, PCA categorizes features based on their maximum variance in radiomics. For every group, three principal components of feature sets are selected and used for correlative analysis and prediction of survival.

Patient Population and Study Data
This was a single-center study with segmentations performed at our institution between July 2018 and December 2019. The CT images included in this study had slice thicknesses between 1 and 5 mm, and both contrast and non-contrast enhanced studies were included. No pre-processing methods of the CT images were employed. Two publicly available datasets containing CT images from patients with NSCLC were analyzed. The NSCLC-Radiomics-Genomics-Lung3 (also known as Harvard) dataset (Table 1) [11,19,20] contains pre-treatment CT images from 89 patients with NSCLC and the NSCLC-Radiogenomics (also known as Stanford) dataset [20][21][22] contains pre-treatment CT images from 211 patients with NSCLC and both are publicly available from the National Institutes of Health (NIH) mentioned in The Cancer Imaging Archive (TCIA) [20][21][22][23]. Patients without available imaging in the online dataset were excluded.

Patient Population and Study Data
This was a single-center study with segmentations performed at our institution between July 2018 and December 2019. The CT images included in this study had slice thicknesses between 1 and 5 mm, and both contrast and non-contrast enhanced studies were included. No pre-processing methods of the CT images were employed. Two publicly available datasets containing CT images from patients with NSCLC were analyzed. The NSCLC-Radiomics-Genomics-Lung3 (also known as Harvard) dataset (Table 1) [11,19,20] contains pre-treatment CT images from 89 patients with NSCLC and the NSCLC-Radiogenomics (also known as Stanford) dataset [20][21][22] contains pre-treatment CT images from 211 patients with NSCLC and both are publicly available from the National Institutes of Health (NIH) mentioned in The Cancer Imaging Archive (TCIA) [20][21][22][23]. Patients without available imaging in the online dataset were excluded.
Four readers with different levels of training performed manual segmentations on Neuroimaging Informatics Technology Initiative (NIFTI) format images and included a data scientist (BY) with no formal medical experience, a medical student (LS), a radiology trainee (MH) with 5 years of clinical radiology experience, and a specialty-trained thoracic radiologist (SK) with 18 years of experience. The data scientist (BY) used the snake feature of ITkSnap region growing tool, while he manually selected the region of tumors in the CT images, adjusted the contrast, set initial bubbles, controlled them to grow to a substantial size, and manually with a brush tool cleaned the areas that were not in the boundaries or exceeded them. The reader with the most experience (SK) was defined as the reference standard (RS) used for benchmarking. Prior to performing segmentations, each reader performed a NSCLC tumor segmentation in a training set of 10 cases from a different source (institution PACS system) supervised by the specialty-trained radiologist (SK) and received feedback on segmentation methods. After completing the training set, each observer completed segmentations of tumors for the complete data set of CT exams. The tumors were labeled in 3D on standard lung windows using ITkSnap (version 3.6.0) [24] by each reader. Segmentations were only performed once per patient per reader, taking breaks between segmentations at the discretion of the reader. A total of 429 radiomic features were extracted within the tumor volume of each image using the Pyradiomics library (v2.2.0) and analyzed using low-rank representations of radiomics using principal component analysis and selecting the first principal component (PC) corresponding to the maximum variance in the radiomics. The radiomic analyses were carried out in Python programing language (3.6.8), while the survival analyses were conducted in R programming software (4.0.1). Correlation between the extracted features and agreement between 3D segmentations were analyzed using a Pearson correlation coefficient and Sørenson-Dice coefficient [25], respectively. Dice coefficient measures variabilities of the segmented regions, and lowrank correlation shows its corresponding effect on radiomics by calculating correlation for direction of the maximum variances. In other words, correlation among three first PCs represent the correlation of the entire radiomics (all 429 radiomics). Appendix A provides additional information regarding principal component analysis (PCA). The proposed approach involves using machine learning to reduce the radiomic dimensionality and predict survival using PCA and Cox regression models, which increases the importance of applying unsupervised and supervised models' integration.
Cox regression modeling was performed for each dataset, incorporating radiomic phenotypes, and clinical and demographic data (i.e., sex, stage status, and histology). Kaplan-Meier curves of overall survival were generated for each dataset to determine if contributing radiomic signatures were able to stratify high-and low-risk patients.

Patient Population
A total of 89 patients were in the NSCLC-Radiomics-Genomics-Lung3 dataset, 3 of whom did not have available data and were excluded from the study. There were 42 patients with adenocarcinoma, 32 patients with squamous cell carcinoma, and 12 patients with another type of NSCLC. Thirty-nine patients had stage I disease, 26 patients had stage II disease, 10 patients had stage III disease, and 11 patients had an unknown stage. Of the NSCLC-Radiogenomics data in the NIH-TCIA dataset, 4 patients were excluded from the study for a total of 207 patients included. Of the included tumors in the Harvard dataset, all were solid, and of the included tumors in the Stanford dataset, 134 were solid, 68 were subsolid, and 5 were unknown.
The total number of patients included in the study is described in Figure 2. Clinical information and demographics of patients are provided in Tables 1 and 2.

of 18
dataset, all were solid, and of the included tumors in the Stanford dataset, 134 were solid, 68 were subsolid, and 5 were unknown.
The total number of patients included in the study is described in Figure 2. Clinical information and demographics of patients are provided in Tables 1 and 2.

Figure 2.
Number of patients included in study. Two publicly available datasets were analyzed in the study, the NSCLC-Radiomics-Genomics-Lung3 (Harvard) dataset and the NSCLC-Radiogenomics (Stanford dataset). Eighty-nine patients and 211 patients are part of the Harvard and Stanford datasets, respectively. A total of 3 patients were excluded from the Harvard dataset and 4 patients were excluded from the Stanford dataset due to lack of available data. Tumor types consisted of adenocarcinoma (Adeno), squamous cell carcinoma (SCC), and other types of NSCLC. A total of 293 patients were segmented as part of the study.

Analysis of Interobserver Variability on Radiomic Feature Extraction
From the 429 radiomic features initially extracted from the tumors on CT images, the feature-level was reduced to 3 radiomic signatures (three first PCs) for all the segmenters ( Figure 1). The correlation coefficient among the low rank radiomic signatures showed significant correlation among the segmenters with a correlation of greater than 0.7 for all the cases (Table 3).    Table 3). We conducted in-depth correlation analysis for individual radiomics and showed the results based on radiomics' categories (Supplementary Materials Table S8). Moreover, we presented some radiomics that showed lesser stability among the segmenters in this study (Supplementary Materials Ta-  Table 3). We conducted in-depth correlation analysis for individual radiomics and showed the results based on radiomics' categories (Supplementary Materials Table S8). Moreover, we presented some radiomics that showed lesser stability among the segmenters in this study (Supplementary Materials Table S9).  Cox regression modeling of overall survival for the NSCLC-Radiomics-Genomics-Lung3 (Harvard) and NSCLC-Radiogenomics (Stanford) datasets yielded a c-statistic of 0.64 (95% CI) and 0.6 (95% CI), respectively, for the model including only the clinical (sex, smoking status, and histology) and demographic covariates, which increased when adding radiomic signatures, having of c-statistic of 0.7 (95% CI) and 0.69 (95% CI), respectively. Adding clinical and demographic data to this model yielded an increase in c-statistic, although with slightly increased variability: 0.05-0.02 and 0.01-0.02 for NSCLC-Radiomics-Genomics-Lung3 and NSCLC-Radiogenomic datasets, respectively (Table 4). Additional Cox regression analysis data are presented in the supplemental materials. Kaplan-Meier curves of survival prediction for each dataset showed significant discrimination between high-and low-risk patients using extracted radiomic signatures (p < 0.01) and are presented in Figure 5. Median risk score was used as a distinguishing criterion for signifying high-and low-risk groups. The hazard ratio for each covariate in the maximal model is fully reported in the Supplementary Materials Table S11.  Kaplan-Meier curves for multivariate models of overall survival using low-rank radiomics show significant differences between high-and low-risk patients for each segmenter and NSCLC dataset using median risk score in the model.

Discussion
CT imaging is the workhorse of oncology staging and treatment response assessment. However, we now know that conventional imaging has imbedded "radiomic" features that are not appreciable by the eye but contain information on tumor heterogeneity that are reflections of the underlying tumor structure and can be harnessed to generate prognostic and predictive biomarkers. In addition, the morphologic qualitative descriptors used in conventional reporting of radiologic assessments of tumors on CT, such as "spiculated", "heterogeneous", and "necrotic", while clinically useful, are subject to inter and intraobserver variability [10] due to their subjective nature; radiomic signatures may allow for more quantitative and precise measure of tumor description, potentially enhancing the clinical value of these interpretations.
In addition to providing a more quantitative approach to conventional morphologic descriptors, radiomics offers the potential to reveal aspects of tumor phenotype not discernable by the human eye, providing another layer of valuable information that can be extracted from conventional imaging for clinical management. Several studies have described the significance of these additional imaging features and radiomics in cancer Figure 5. Kaplan-Meier curves for multivariate models of overall survival using low-rank radiomics show significant differences between high-and low-risk patients for each segmenter and NSCLC dataset using median risk score in the model.

Discussion
CT imaging is the workhorse of oncology staging and treatment response assessment. However, we now know that conventional imaging has imbedded "radiomic" features that are not appreciable by the eye but contain information on tumor heterogeneity that are reflections of the underlying tumor structure and can be harnessed to generate prognostic and predictive biomarkers. In addition, the morphologic qualitative descriptors used in conventional reporting of radiologic assessments of tumors on CT, such as "spiculated", "heterogeneous", and "necrotic", while clinically useful, are subject to inter and intraobserver variability [10] due to their subjective nature; radiomic signatures may allow for more quantitative and precise measure of tumor description, potentially enhancing the clinical value of these interpretations.
In addition to providing a more quantitative approach to conventional morphologic descriptors, radiomics offers the potential to reveal aspects of tumor phenotype not discernable by the human eye, providing another layer of valuable information that can be extracted from conventional imaging for clinical management. Several studies have described the significance of these additional imaging features and radiomics in cancer imaging [26][27][28][29][30][31][32][33][34][35][36][37] and have hypothesized that tumor genetic and cellular characteristics and phenotypes can be represented with medical imaging [38][39][40]. For example, studies by Ganeshan et al. [41][42][43] reported an association of extracted NSCLC CT tumor features with patient survival, tumor stage, metabolism, angiogenesis, and hypoxia. The importance of imaging in treatment planning and outcomes was demonstrated by El Naqa et al. [44] for head and neck and cervical cancers, and Vaidya et al. [45] for lung cancer. Huang et al. [4] concluded that EGFR mutation status can be determined using quantitative imaging from extracted tumor phenotypes in NSCLC. Similarly, Bardia et al. [46] found that combining radiomic phenotypes, clinical variables, and circulating tumor DNA (ctDNA), enhanced prediction of EGFR-targeted therapy outcomes for NSCLC.
However, while the use of extracted radiomic features from conventional imaging poses exciting possibilities for precision medicine, there are challenges to clinical translation that must be overcome before the use of these novel techniques can become a reality in routine practice. There is variability introduced in the acquisition of imaging, for example the use of different imaging protocols, reconstruction algorithms, and scanner types. In addition, variability is introduced through choice of imaging processing techniques, such as choice of segmentation and feature extraction software, and degree of skill of the reader performing 3D segmentation. Variability is a particular concern with manual segmentations [47], and several studies have reported significant inter-clinician variation in contouring of tumors in radiation treatment planning, including head and neck, lung, prostate, and esophageal cancers [48][49][50][51][52]. In this study, we did find some variability between segmentations performed by the data scientist (BY), the medical student (LS), the radiology trainee (MH), and the most experienced reader, reference standard (SK). However, the SD coefficients suggest an overall moderate to high degree of spatial agreement of the segmentations and good overlap of tumor segmentations between readers.
Interobserver variability between readers in this study may have been introduced by several factors. One factor is differentiating between the boundaries of tumor and adjacent post-obstructive atelectasis [53,54] or pneumonia, a known problem with tumor delineation. In non-contrast CT examinations, it may also be difficult to delineate tumor and adjacent vascular structures that course in and adjacent to lung cancer, especially if the tumor abuts the hilum or mediastinum. Some lung cancers also demonstrated both a solid and a ground glass component, which can introduce variability in the choice of where to draw the boundary around faint ground glass components. Huang et al. [4] discovered that trained radiologists tended to focus on the solid component of a tumor as opposed to the ground glass component, whereas junior radiologists tended to include more of the ground glass component in their segmentations. The inclusion of more ground glass component would increase overall tumor volume and impact the spectrum of radiomic features extracted, thus a risk factor for variation. Window width and level settings on CT may also influence segmentations and gross tumor volumes [54][55][56][57]. ITkSnap software allows the reader to choose the window width and level settings in addition to an automatic window width/level selection. While some of our readers manually and arbitrarily adjusted the window width/level based on preference and ability to differentiate tumor from adjacent structures, other readers chose the automated window width/level setting chosen by the software.
Radiomic features used in this study follow imaging features defined by the Imaging Biomarker Standardization Initiative (IBSI). However, differences in CT exam parameters may also introduce segmentation variability between readers. This is particularly true with certain texture features such as coarseness and contrast, which tend to be the least reproducible. First order features, particularly entropy, are found to be the most reproducible [18]. Leijenaar et al. [58] found that radiomic features with high test-retest repeatability suffered less from interobserver differences. A few studies have confirmed that tube current (mAs) or tube voltage (kVp) had no influence on feature reproducibility [59,60]. Varying slice thicknesses of CT scans can also introduce variability in the extracted features, with 1-2.5 mm being the recommended slice thickness when contouring tumors [17,61]. Our study used a publicly available online dataset with slice thickness varying from 1-5 mm (Supplemental Tables S1 and S2). We conducted in-depth analyses on the effect of CT parameters on the outcome of the selected features using the proposed approach and their final survival outcomes (Supplement Tables S3, S4, S5, S6 and S7). Our supplemental analyses testing the potential effects of CT parameters indicated that there was an overall similarity among segmentations between readers when considering contrast-enhancement, CT kernel, and slice thickness.
The degree of medical specialty training has been a concern for the introduction of variability in segmentations of tumors. Logue et al. [62] reported that radiologists tended to contour smaller gross tumor volumes compared to radiation oncologists in the segmentation of bladder cancers and concluded that a more correct anatomic gross tumor volume was provided by radiologists likely due to clinical practice differences, since radiation oncologists typically select more inclusive volumes around tumors in practice so as not to underestimate tumor extent radiation treatment planning [63]. Similar results were observed in NSCLC by Giraud et al. [64], who noted major discordances between radiation oncologists' and radiologists' tumor delineations, radiologists tending to delineate smaller volumes. In this same study, junior physicians included as readers tended to delineate smaller and more homogeneous volumes compared to senior physicians regardless of their specialty. Van de Steene et al. [63] looked at specialty dependence between junior and senior radiation oncologists, one pulmonologist, and one radiologist, on contouring lung cancer gross tumor volumes and noticed that the radiologist ended up with the smallest tumor volume. They also noted good agreement between the senior radiation oncologist and radiologist. Haga et al. [65] concluded that NSCLC tumor volumes should be contoured by a specialist, such as a radiation oncologist, in order to decrease tumor delineation uncertainty and overestimation of prognostic power in radiomic feature analysis. In this study we compared tumor segmentations between level of training (i.e., medical student, radiology trainee, and radiology attending), and specialty type (i.e., data scientist). Interestingly, the 3D masks in the Harvard Dataset for BY-SK (RS) had an overall higher correlation compared to the masks for MH-SK (RS) and LS-SK(RS) in the segmentation analysis. However, the 3D masks in the Stanford dataset for MH-SK (RS) had an overall higher correlation compared to the masks for BY-SK (RS) and LS-SK (RS). The Pearson correlation coefficients, comparing three significant radiomic phenotypes for PCA, were all relatively equal amongst segmenters in the Harvard dataset, although the correlation coefficients were slightly more variable in the Stanford dataset. Overall, these differences are small and can probably be overlooked given overall high correlation of segmentations amongst all segmenters in the principal component analysis. It should be noted, however, that all readers in this study participated in a training set of cases supervised by the reference standard (SK) to ensure a standard approach to contouring.
Our study had several limitations. The CT scans in the dataset had varying slice thicknesses, ranging from 1-5 mm, which is known to introduce some variability as described above. Additionally, while all the readers used ITKSnap software for segmentation, there was some variability in methods of tumor contouring, such as choice of purely manual or semi-automated tools and the exact window and level used to perform the contouring. However, while there was interobserver variability in contouring, the extracted radiomic features of both the medical student, radiology trainee, and data scientist were overall well correlated with the experienced reader (RS). Another limitation is that the readers were all trained by the expert reader; however, the number of training cases was small and consisted of feedback of the segmentations. Additionally, the training cases were from a different source than the databases that were used for analysis. Despite the limitations, overall correlation of extracted features between readers supports the inclusion of readers of various levels of training in performing segmentations for NSCLC.
Future research would include testing interobserver variability based on level and type of experience against other publicly and readily available datasets and testing intraobserver variability. Other future directions should include determining how factors such as slice thickness, pixel spacing, window width/level, contrast enhancement, and pre-and postprocessing of CT imaging affect interobserver variability between readers of different experience.

Conclusions
Although there is some variability in tumor contouring for imaging segmentations between readers, the extracted radiomic features were overall well correlated in observers. Therefore, level of training and clinical experience of the reader may not have a substantial impact on extracted radiomic features of NSCLC on CT, noting that all readers did have a supervised training set prior to contouring cases. Having more readers to perform tumor segmentations may accelerate the development of radiomic signatures in NSCLC that can provide added value to cancer management and precision medicine. This study shows that a greater degree of inclusion of personnel is allowable to perform these tumor segmentations.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/cancers13235985/s1, Table S1. The CT parameters for Stanford NSCLC Radiogenomics dataset, Table S2. The CT parameters for Harvard NSCLC Radiomics-Genomics dataset, Table S3. Similarity of the radiomic signatures using Pearson correlation among different segmenters are presented for different stratifications based on CT, Table S4. Overall survival, Cox regression. Using the low-rank representation of the radiomic signatures survival prediction is measured for each segmenter while there is Contrast-Enhancement (CE), Table S5. Overall survival, Cox regression. Using the low-rank representation of the radiomic signatures survival prediction is measured for each segmenter while there is Non-Contrast-Enhanced (UN), Table S6. Overall survival, Cox regression. Using the low-rank representation of the radiomic signatures survival prediction is measured for each segmenter for higher convolutional kernel (CKh), Table S7. Overall survival, Cox regression. Using the low-rank representation of the radiomic signatures survival prediction is measured for each segmenter for slice thickness between 2 mm and 4 mm, Table S8. Itraclass correlation coefficient based on radiomics categories and with the respect of different group means. For each segmenter, mean and standard deviation of correlation coefficient is calculated for every radiomics' category, Table S9. Radiomic features with lesser stability with the respect to different segmenters. Means and standard deviations of these radiomics are presented. Table S10. More detailed information about the Radiomic features used in this study. Table S11. The hazard ratio for each covariate in the maximal cox proportional hazard model.  Informed Consent Statement: Patient consent was waived due to the research posing no more than minimal risk to subjects and the waiver does not adversely affect the rights and welfare of the subjects who are involved in the research.
Data Availability Statement: Information on the publicly available datasets used in this study [19][20][21].

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Considering that X Ob1 = {x 1 , x 2 , x 3 , . . . , x n } is the result of radiomic features extracted from the first segmenter, where x 1 , x 1 ∈ R p , is a zero-mean (z-scored) vector (Z-score of a vector x 1 defined by x i , and σ is standard deviation of x 1 ), with the size of our radiomic features (p = 429) and n was 86 and 207 for the Harvard and Stanford NSCLC Radiogenomic datasets, respectively. There were four segmenters, hence there exists a set of different observers {X Ob1 , X Ob2 , X Ob3 , X Ob4 }. The problem is to reduce the collinearity among the features in each X Ob . For that, we propose using low-rank representation of each vector in the direction of maximum variance, using eigen decomposition method presented in the following section.
Appendix A.1. Low-Rank Representation of Radiomics Principal component analysis (PCA) [66,67] is used for many applications such as dimension reduction, noise elimination, and classification, amongst others. The PCA can be performed by using a covariance matrix calculation with singular value decomposition (SVD) [68]. The decomposition matrix is performed for the input matrix (heat matrix) X which is p × n , where n is the vectorized thermal image in every sequence and p corresponds to the number of observations, and decomposes to: where k > p and Σ is a diagonal matrix with a dimension of p × p and either zero or positive elements. It is considered as the singular value of matrix X and U is the p × n matrix denotes as eigenvector or basis matrix of X. The data are arranged column-wise based on the observation variation. Spatial variations are mapped in the row direction (input data located in columns and rows show the observations). The PCA is a linear transformation method, which applies a decomposition of the input zero-mean data matrix into the basis U and coefficient matrix Σ. The basis matrix carries the orthonormal property that also maximizes the variance of projected data which leads to the principal components (PCs) of the input matrix. Selecting the k = 3 to reduce dimensionality from 429 to 2 for each segmenter we would use Equation (A1), to convert X Ob to U Ob , where U Ob ∈ R 3×n , and {U Ob1 , U Ob2 , U Ob3 , U Ob4 }. The resulting comparison of the radiomic signatures is thus facilitated by this dimensionality reduction. The three initial PCs used to measure the correlation of radiomics by each segmenter corresponding to their Dice-scores, while survival analysis uses only the initial PCs. PCA selects the initial predominant eigenvectors, known as bases of analysis, and provides the highest variance among the radiomics. In other words, PCA finds the best signatures exist in the radiomics, we compared the best representative of radiomics for each segmenter with our reference to find overall correlation of radiomics.
Appendix A.2. Low-Rank Correlation of Interobserver's Radiomics PCA between the four reader segmentations was performed on the extracted features with a Pearson correlation coefficient (corr) using the first principal components. A high degree of correlation between the extracted features was defined between ±0.50 to ±1, a moderate degree of correlation was defined between ±0.30 to ±0.49 and a low degree of correlation was defined as <±0.29. The Pearson correlation coefficient (PCC), or the bivariate correlation, [69] allows measurement of the linear correlation between two variables or vectors, X and Y. PCC calculates covariance of two variables divided by standard divisions of both variables, involving the product moment. The Pearson's correlation coefficient, r xy , is measured using the following formula: where n is the sample size, x i , y i corresponds to the individual sample points with i, and x i and analogously for y.
Interobserver variability in the 3D segmentations between the readers and Reference Standard was performed also using a Sørenson-Dice (SD) coefficient to evaluate spatial agreement of the segmentations. High spatial agreement was defined as a SD between 0.7-1.0, moderate was defined by a SD between 0.5-0.7, and low spatial agreement was defined as a SD < 0.5. the results of Dice also indicate high spatial agreement among the segmenters by having SD > 0.7. The Sørensen-Dice coefficient (DSC) [25,70] is calculated by the following formula: By use of these two methods of measuring the similarity, two coefficients are produced to gauge the pair-wised similarity between each two segmenters. Using Equations (A2) and (A3), we calculate Corr i,j U Obi , U Obj , and DSC i,j U Obi , U Obj where i = j, respectively. The results of these two measures indicate the variability among the low-rank radiomic signatures. Table 3 shows the results of such correlation among the segmenters.