Role of Radiomics Features and Machine Learning for the Histological Classification of Stage I and Stage II NSCLC at [18F]FDG PET/CT: A Comparison between Two PET/CT Scanners

The aim of this study was to compare two different PET/CT tomographs for the evaluation of the role of radiomics features (RaF) and machine learning (ML) in the prediction of the histological classification of stage I and II non-small-cell lung cancer (NSCLC) at baseline [18F]FDG PET/CT. A total of 227 patients were retrospectively included and, after volumetric segmentation, RaF were extracted. All of the features were tested for significant differences between the two scanners and considering both the scanners together, and their performances in predicting the histology of NSCLC were analyzed by testing of different ML approaches: Logistic Regressor (LR), k-Nearest Neighbors (kNN), Decision Tree (DT) and Random Forest (RF). In general, the models with best performances for all the scanners were kNN and LR and moreover the kNN model had better performances compared to the other. The impact of the PET/CT scanner used for the acquisition of the scans on the performances of RaF was evident: mean area under the curve (AUC) values for scanner 2 were lower compared to scanner 1 and both the scanner considered together. In conclusion, our study enabled the selection of some [18F]FDG PET/CT RaF and ML models that are able to predict with good performances the histological subtype of NSCLC. Furthermore, the type of PET/CT scanner may influence these performances.


Introduction
Non-small-cell lung cancer (NSCLC) is a frequent form of neoplasm with globally rising incidence, accounting for most cancer-related deaths worldwide [1][2][3]. The risk factors associated with the development of disease are mainly environmental, with cigarette smoking as the most important [4,5]. The three main histological types of NSCLC, according to World Health Organization (WHO), are adenocarcinoma (ADK), squamous cell carcinoma (SCC) and large cell carcinoma [4].
The correct diagnosis of NSCLC is often made in advanced stages, since the disease can become symptomatic only in these stages. In this setting, the main symptoms are presented by cough, hemoptysis, chest pain and dyspnea [4,6].
The correct diagnostic and staging assessment of NSCLC is performed with imaging evaluation with chest X-ray (XR) and chest computed tomography (CT) that are pivotal 2 of 13 for the evaluation of disease. In this setting, such imaging modalities directly drive the treatment of the disease and are also able to perform distant staging, being useful even for its follow-up [4]. The clinical outcome of NSCLC is directly related to its stage at diagnosis: early stages are usually managed with surgical resection and good prognosis, while advanced and metastatic disease can benefit from adjuvant therapy [7][8][9][10][11].
In the recent years, positron emission tomography/computed tomography (PET/CT) with different tracers is emerging as a fundamental imaging modality for the assessment of a high amount of neoplastic and infectious diseases [12,13]. In this scenario, 18 F-fluorodeoxyglucose ([ 18 F]FDG) PET/CT is routinely performed for the staging of NSCLC and the usefulness of some semiquantitative parameters in predicting the prognosis of patients has been proved [14,15]. In this setting, the possible role of [ 18 F]FDG PET/CT radiomics features (RaF) to differentiate between malignant and benign lesions in various organs has recently emerged and pulmonary nodules do not make exception [16,17]. Furthermore, it has also been reported that texture analysis is somehow able to differentiate between ADK and SCC, however with heterogeneous findings in terms of PET/CT acquisition, RaF extraction and results [18][19][20][21][22][23][24][25]. The populations considered in such studies were characterized by high heterogeneity in terms of clinicopathological features and, as mentioned, prognosis of patients is related to the stage of disease at the time of diagnosis. Furthermore, it is known that different scanners and protocols used for the acquisition and reconstruction of PET images are able to influence the extraction of RaF and therefore affect the results of such analysis [16][17][18].
The aim of this study is therefore to analyze the value of baseline [ 18 F]FDG PET/CT RaF and ML models for the prediction of final histological diagnosis in patients with stage I and stage II NSCLC and also to assess the influence of different scanners on this scenario.

Patient Selection
We retrospectively screened our database in order to find patients submitted to our center to perform [ 18 F]FDG PET/CT for the initial staging of NSCLC. The screening was performed from January 2014 until February 2022 and a total of 2332 subjects were selected. Inclusion criteria were the presence of a histologically proven diagnosis of stage I or stage II NSCLC, the presence of a baseline [ 18 F]FDG PET/CT performed before any treatment and the presence of tracer uptake by NSCLC higher than liver uptake. After applying such inclusion criteria, 233 patients were included in the study.
Clinicopathological information including gender, age, size of NSCLC measured on histological evaluation, grading, lobe involved by the disease, therapy performed, TNM category and American Joint Commission on Cancer (AJCC) VIII Edition stage were collected. Furthermore, histological classification was collected and since only 6 patients had the presence of adenosquamous carcinoma, they were excluded from the present study. A total of 227 patients were therefore finally included in the study.

[ 18 F]FDG PET/CT Acquisition and Interpretation
Patients fasted for at least 6 h before tracer injection and had a glucose blood level below 150 mg/dL (mean: 116, standard deviation [SD]: 19, range: 83-148). In order to perform PET/CT scan, 3.5-4.5 MBq/kg of [ 18 F]FDG were intravenously injected to the patients and before images acquisition they were instructed to void. No contrast agent, intestinal preparation with purge or enteric contrast were used.
Images were acquired 60 min after radiotracer injection, from the vertex to the midthigh on two different PET/CT tomographs. The first one (scanner 1) was a Discovery 690 PET/CT (General Electric Company, Milwaukee, WI, USA) while the second (scanner 2) was a Discovery STE PET/CT (General Electric Company, Milwaukee, WI, USA). On both, standard acquisition parameters (CT: 80 mA, 120 kV without contrast; 2.5-4 min per bed PET-step, axial width 15 cm) and standard reconstruction parameters were used (256 × 256 matrix and 60 cm field of view). Furthermore, scanner 1 had LYSO (cerium-doped lutetium yttrium oxyorthosilicate) scintillator crystals with a decay time of 45 ns, while scanner 2 had BGO (bismuth germanate) scintillator crystals with a decay time of 300 ns. Scanners were not harmonized with a cross-calibration program and all PET/CT scans were acquired at free-breath, instructing the patients to have regular breathing. For anatomical correlation and to perform attenuation correction, a low dose CT at free breathing and without contrast agent was acquired for both the scanners. More in detail, CT acquisition parameters for scanner 1 were: 120 kV, fixed tube current ≈ 60 mAs (40-100 mAs), 64 slices × 3.75 mm and 3.27 mm interval, pitch 0.984:1, tube rotation 0.5 s. CT acquisition parameters for scanner 2 were: 120 kV, fixed tube current ≈ 73 mAs (40-160 mAs), 4 slices × 3.75 mm and 3.27 mm interval, pitch 1.5:1, tube rotation 0.8 s. Furthermore, on scanner 1 time of flight (TOF) and point spread function (PSF) algorithm were used for the reconstruction of images, with filter cut-off 5 mm, 18 subsets and 3 iterations. Moreover, on scanner 2 an ordered subset expectation maximization (OSEM) algorithm with filter cut-off 5 mm, 21 subsets and 2 iterations was used.
PET images were visually and semiquantitatively analyzed by a nuclear physician with at least 10 years of experience and every focal tracer uptake deviating from physiological distribution and background was regarded as suggestive of disease localization.

Statistical Analysis
Statistical analyses were performed using R (version 3.6.3). The descriptive analysis of categorical variables comprised the calculation of simple and relative frequencies. The numeric variables were described as mean, SD, minimum and maximum (range).
The general statistical analysis line of the study was structured of various steps and was aimed at training a predictive model testing different Machine Learning (ML) approaches: Logistic Regressor (LR), k-nearest neighbors (kNN), Decision Tree (DT) and Random Forest (RF). Due to the nature of the used ML techniques, different approaches were used to cope with the feature selection strategy. For LR and kNN, for example, to reduce the complexity of the space, we used a Wilcoxon analysis after a 50-cross fold validation for all RaF and we removed the feature poorly correlated with the outcome. The 50-cross fold validation was performed randomly splitting the cohort in 80% for training and 20% per validation, 50 times. In this setting, a p-value ≤ 0.001 was considered as cutoff. DT and RF, on the other hand, did not need any preliminary feature selection strategy because they operate a feature selection exploiting a measure of the GINI index to split the node in the different branches (and this operation can be performed also with a relatively high number of variables). The ML models were than trained to find the best model and the corresponding most representative RaF. The different models were trained in the following ways: • LR: a bivariate Logistic Regressor was trained using the RaF survived at the feature selection strategy. All the possible couple of RaF with a Spearman's correlation coefficient lower than 0.3 were tested and only the LR models were both the p-values were lower than 0.05 were considered for the testing. This bivariate analysis was conducted in order to classify these couples based on the area under the curve (AUC) value of the receiver operating characteristic (ROC) analysis. The entire process was repeated in a 50 cross-fold validation, in order to be able to measure the mean and the SD of the AUCs, for each tested couple of RaF.
• kNN: kNN was trained with a 50 cross-fold validation technique for each couple of RaF tested for LR. This was done to assess the different performances between LR and kNN on the same couple of RaF. Again, mean and SD of the AUCs were measured. • DT and RF were tested with a 50 cross-fold validation technique on all the available RaF. In this case, for each run of the cross-fold validation, only two model were trained (one for DT and one for RF) and the mean and the SD of the AUC were measured on the base of the 50 different training runs.
The entire aforementioned analyses were performed by considering the performances of scanner 1 alone, scanner 2 alone and both the scanners together.

Results
Among the total number of 227 patients included in the study, 147 were men (64.8%) and the mean age was 70 years (SD: 8, range: 38-87 Data about the grading of disease were available only for 100 patients and in this setting 1 patient (1.0%) had a G1 disease, 42 (42.0%) had G2 disease, while 57 (57.0%) had G3 disease. Furthermore, a total of 142 (62.6%) scans were performed on the Discovery 690 tomograph (scanner 1), while 85 (37.4%) of them were acquired on the Discovery STE tomograph (scanner 2). Analyzing PET/CT acquisition depending on the tomograph used for their execution, in 107 (75.4%) scans performed on scanner 1 the presence of ADK was revealed while in 35 (24.6%) the presence of SCC was reported. On scanner 2, ADK was present in 66 (77.6%) subjects while SCC was reported in 19 (22.4%) cases. No significant difference in terms of final diagnosis was reported between the 2 scanners (p-value 0.7).
RaF selection analyses and cross correlation matrixes before applying the ML models for all the scanners are presented in Figure 1. After the cross-correlation selection, scanner 2 had more removed variables, compared to scanner 1 and both the scanners considered together.  The best results of LR analysis for scanner 1, scanner 2 and both the scanners together are presented in Table 2. In this setting, for scanner 1 L_least, F_cm.2.5Dmerged.diff.entr, L_major and F_cm_2.5D.diff.entr were between the RaF with the best performances and AUCs over 0.8 were reported. On scanner 2 the best performances were obtained by F_cm.clust.shade, F_cm_merged.inv.var, F_cm.inv.var and F_cm_merged.clust.shade, with AUCs comprised between 0.6 and 0.8. When considering the combination of the two scanner together L_major, F_rlm.2.5Dmerged.sre, F_stat.entropy and F_rlm_2.5D.sre were part of the couples with the best performances, again with AUCs comprised between 0.6 and 0.8.  The best results of LR analysis for scanner 1, scanner 2 and both the scanners together are presented in Table 2. In this setting, for scanner 1 L_least, F_cm.2.5Dmerged.diff.entr, L_major and F_cm_2.5D.diff.entr were between the RaF with the best performances and AUCs over 0.8 were reported. On scanner 2 the best performances were obtained by F_cm.clust.shade, F_cm_merged.inv.var, F_cm.inv.var and F_cm_merged.clust.shade, with AUCs comprised between 0.6 and 0.8. When considering the combination of the two scanner together L_major, F_rlm.2.5Dmerged.sre, F_stat.entropy and F_rlm_2.5D.sre were part of the couples with the best performances, again with AUCs comprised between 0.6 and 0.8.
The couples of variables that presented the best performances at kNN analyses for all the scanners are presented in Table 3. For scanner 1 F_stat.entropy, F_cm_2.5D.clust.shade, F_cm.2.5Dmerged.clust.shade, F_morph.surface, F_cm_merged.clust.prom and F_cm.clust.prom had the best performances, with general AUCs above 0.8. Moreover, for scanner 2 the best performances were obtained by F_stat.median, F_cm.clust.shade, F_cm_merged.clust.shade and F_cm.joint.max, with the AUCs values that were comprised between 0.6 and 0.8. In the general analysis considering all the scanner, F_stat.uniformity, F_cm.clust.shade, F_cm_merged.clust.shade, F_rlm.lre, F_cm.2.5Dmerged.sum.entr and F_cm_2.5D.clust.shade were part of the couples of RaF with best performances, with AUCs above 0.8. In this setting, a visual representation of the best combination of RaF for kNN and LR are presented in Figure 2.    Furthermore, RF and TM were also applied to our cohort and a comparison of the performances of such analyses with LR and kNN for all the PET/CT scanners are presented in Figure 3 and Table 4. In this setting, TM had the lower performances compared to other models and both TM and RF had AUC values that were lower and more heterogeneous compared to LR and kNN. In particular, for scanner 1 mean AUC for LR, kNN, RF and TM were 0.852, 0.882, 0.793 and 0.701, respectively; for scanner 2 mean AUC were 0.777, 0.870, 0.703 and 0.496, respectively, while when considering both the scanner together such values were 0.784, 0.859, 0.775 and 0.682, respectively. Furthermore, RF and TM were also applied to our cohort and a comparison of the performances of such analyses with LR and kNN for all the PET/CT scanners are presented in Figure 3 and Table 4. In this setting, TM had the lower performances compared to other models and both TM and RF had AUC values that were lower and more heterogeneous compared to LR and kNN. In particular, for scanner 1 mean AUC for LR, kNN, RF and TM were 0.852, 0.882, 0.793 and 0.701, respectively; for scanner 2 mean AUC were 0.777, 0.870, 0.703 and 0.496, respectively, while when considering both the scanner together such values were 0.784, 0.859, 0.775 and 0.682, respectively.

Discussion
As previously underlined in the literature, the technology of the scanner used to acquire PET/CT images directly afflicts the subsequent extraction of RaF and, in this

Discussion
As previously underlined in the literature, the technology of the scanner used to acquire PET/CT images directly afflicts the subsequent extraction of RaF and, in this setting, the use of different tomographs in the same department is frequent in daily practice [7,17,18]. These insights suggest that different scanners can potentially have different preferred features in terms of correlations with a clinical outcome and that radiomics models coming from centers adopting different technologies should be critically considered. In our cohort, we had to deal with the presence of two different scanners and our results confirmed this point. Due to the different technologies, for example, we reported a decreasing order of AUCs values for LR trained/tested, respectively on scanner 1, both scanners and on scanner 2. As expected, the p-values of the features in the related models were lower for scanner 1, both scanners and scanner 2, respectively. Moreover, the selection of RaF before applying any ML models selected only a small sample of features for scanner 2 compared to scanner 1 and the analysis for both scanners. This evidence was confirmed also when considering the kNN model, with generally higher AUC values and lower p-values for scanner 1 and both the scanner compared to scanner 2. The higher performance of kNN can be probably due to the linear limitations of the LR modes; on the other hand, LR provides a more communicative model which can be easily plotted in a 2D space or shaped in form of normogram. Our findings are therefore in line with the concept that different PET/CT technologies can influence the contouring and/or the feature values, and therefore the performances, of RaF. In contrast with our findings, Ma et al. [20] revealed that in a large cohort of NSCLC, for most RaF the influence of different scanner on their extraction was not present. However, this is not completely unexpected: in their study they used two different tomographs with a different technological gap and different reconstructing protocols and acquisition parameters. Even if, in their case, the gap between the two scanners was not pivotal, it still remains, in general, an open challenge in this kind of analysis.
Our study was performed by comparing different ML models. In this setting, the models with best performances in the analyses for all the scanners were kNN and LR and, in general, the kNN model had better performances compared to the others. A possible reason of the highest performances of kNN can be due to its non-linearity in cutting the space. This can surely be a pros but, on the other hand, it can carry to a higher risk of overfitting. In addition, a kNN model is computational expensive (to make a prediction the DSS needs to compare the distances with all the cases in the data base) and cannot resumed in an easyto-use graphical representation. On the other hand, LR has lower but similar performances, reduces the risk of overfitting, clearly show the role of the features in terms of p-value and can be easily reshaped in form of easy-to-use nomogram. In general, in choosing the best predictor, all the aforementioned points should be qualitatively considered, to reach the best trade-off for the specific needs. RF and TM had lower mean AUCs compared to the aforementioned models, and moreover such values were really heterogeneous in each analysis based on the single scanner. Furthermore, as previously mentioned, the impact of the scanner on such analyses is clearly evident and also in this scenario this evidence is confirmed: mean AUC values for scanner 2 were the lowest compared to scanner 1 and both the scanner considered together.
In general, our results underline the ability of [ 18 F]FDG PET/CT RaF to discriminate between ADK and SCC in stage I and stage II NSCLC. The first study to investigate the ability of RaF in this setting was proposed by Ha et al. [23] who reported that these NSCLC entities had different tumoral heterogeneity, with 15 RaF that were able to discriminate between them and that a linear discriminant analysis with such parameters was able to clearly classify them with high performances. More recently, Kim et al. [24] reported that tumor heterogeneity of [ 18 F]FDG uptake was significantly different between ADK and SCC and such parameter was able to predict recurrence of ADC but not SCC in patients who have undergone curative surgery. Orhlac et al. [22] reported the role of different resampling method on RaF in distinguish between ADC and SCC, underling that textural parameters using absolute resampling can vary in function of the cancer subtype more than in relative resampling. An interesting study by Ma et al. [20] investigated the role of texture and colour analyses in differentiating between NSCLC subtypes in a large cohort. They revealed that a combination of both methods had higher performances compared to single method in differentiating between SCC and ADK, with an AUC of 0.89. More recently, Bianconi et al. [19] reported that SCC had significantly higher degree of heterogeneity, stronger variability and lower uniformity of [ 18 F]FDG uptake compared to other subtypes, while ADK had lower heterogeneity, weaker variability and higher uniformity compared to other subtypes of NSCLC. Lastly, Aydos et al. [21] revealed that kurtosis was the only RaF able to differentiate between such histological subtypes, given the fact that in SCC it was significantly lower compared to ADK. Interestingly, some RaF were also able to differentiate between moderate and poorly differentiated ADK.
Generally speaking, even with a high degree of heterogeneity in terms of number of patients, number of scanners used, methods used for RaF extraction and analyses of such parameters, a promising role for texture analysis in differentiating between NSCLC has emerged. In this setting, our study confirms such evidence, reflecting also the influence of different scanners in this context that, as mentioned, can be frequent in daily practice.
Our work is not without limitations. First of all, this is a retrospective study. Second, conventional PET/CT scanners but not last-generation tomographs are used. Moreover, even if characterized by the presence of a relatively high number of patients, in particular compared to similar works, the sample of patients included still appears sub-optimal to clearly evaluate the predictive abilities of texture analysis. The fact that the two tomographs used in our work had different reconstruction algorithm can be another confounding factor. Lastly, the problem of the reproducibility of radiomics analysis in terms of multicentric evaluation is still an open issue and further research in this field is mandatory.

Conclusions
In conclusion, our study enabled the selection of some [ 18 F]FDG PET/CT RaF and ML models that are able to predict with good performances the histological subtype of NSCLC. Furthermore, evident influences of the type of PET/CT scanner on such performances were underlined.  Institutional Review Board Statement: Ethical review and approval were waived for this study, due to the retrospective design of the study according to the local laws.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: Data available on request due to privacy/ethical restrictions.

Conflicts of Interest:
The authors declare no conflict of interest.