Therapeutic Decision Making in Prevascular Mediastinal Tumors Using CT Radiomics and Clinical Features: Upfront Surgery or Pretreatment Needle Biopsy?

Simple Summary Initial management approaches for prevascular mediastinal tumors (PMTs) can be divided into two categories: direct surgery and core needle biopsy (CNB). Although the gold standard diagnostic method is histopathological examination, the selection of the initial management between direct surgery and CNB is more urgent for patients with PMTs, compared with the definite diagnosis of PMT subtypes. The study aimed to develop clinical–radiomics machine learning (ML) classification models to differentiate patients who needed direct surgery from patients who needed CNB, among the patients with PMTs. An ensemble learning model, combining five ML models, had a classification accuracy of 90.4% (95% CI = 87.9 to 93.0%), which significantly outperformed clinical diagnosis (86.1%; p < 0.05), which may be used as clinical decision support system to facilitate the selection of the initial management of PMT. Abstract The study aimed to develop machine learning (ML) classification models for differentiating patients who needed direct surgery from patients who needed core needle biopsy among patients with prevascular mediastinal tumor (PMT). Patients with PMT who received a contrast-enhanced computed tomography (CECT) scan and initial management for PMT between January 2010 and December 2020 were included in this retrospective study. Fourteen ML algorithms were used to construct candidate classification models via the voting ensemble approach, based on preoperative clinical data and radiomic features extracted from the CECT. The classification accuracy of clinical diagnosis was 86.1%. The first ensemble learning model was built by randomly choosing seven ML models from a set of fourteen ML models and had a classification accuracy of 88.0% (95% CI = 85.8 to 90.3%). The second ensemble learning model was the combination of five ML models, including NeuralNetFastAI, NeuralNetTorch, RandomForest with Entropy, RandomForest with Gini, and XGBoost, and had a classification accuracy of 90.4% (95% CI = 87.9 to 93.0%), which significantly outperformed clinical diagnosis (p < 0.05). Due to the superior performance, the voting ensemble learning clinical–radiomic classification model may be used as a clinical decision support system to facilitate the selection of the initial management of PMT.

Given that PMTs are highly heterogeneous diseases, it is imperative to tailor different management strategies to the specific subtypes [9,10].For example, guidelines recommend medical treatment rather than surgical resection for lymphomas [11].In contrast, resectable thymic epithelial tumors (TETs) require surgical resection to avoid the risk of tumor spread during a biopsy of an encapsulated thymoma [12].Initial management approaches for PMTs can be divided into two categories: direct surgery and core needle biopsy (CNB).Category 1 includes patients with resectable TETs, cysts, thymic hyperplasia, teratomas, goiters, lymphangiomas, or thymolipomas, who are advised to undergo immediate surgical resection [13][14][15].Category 2 includes patients with unresectable TETs, lymphomas, Castleman disease, or malignant germ cell tumors, who are recommended to start with CNB [15][16][17].After a confirmed pathologic diagnosis, Category 2 patients may receive targeted therapy, chemotherapy, and/or radiotherapy [18,19].
Clinicians take age, preoperative clinical characteristics, and preoperative radiological features into consideration when making a clinical diagnosis.Thymic epithelial tumors are more common in adults aged 50-70 years; lymphoma and malignant germ cell tumors are more common in adults aged 20-40 years.Patients with malignant germ cell tumor often have higher serum levels of alpha-fetoprotein (AFP) and human chorionic gonadotropin (HCG).Regarding radiological features, benign teratoma often has a fat component, lymphoma often has multiple mediastinal lymphadenopathy, resectable thymoma often has a rounder shape with regular margin, and unresectable thymoma and thymic carcinoma often have great vascular encasement.Clinicians then recommend the suitable initial management based on the clinical diagnosis.The accuracy of such binary classification is validated by the final pathology report, and misclassification results in unnecessary CNB and surgical resection with a negative impact on patient health.
With the advancement of machine learning (ML) technology, various image-based ML models have been developed to predict breast cancer metastasis [20], lung diseases [21,22], and ductal carcinoma [23].Furthermore, radiomics-based predictive models have been developed to differentiate between PMT subtypes [24,25].Liu et al. [26] reported that a radiomics-based ML model for differentiating anterior mediastinal cysts from thymomas had a sensitivity of 89.3%, while the sensitivity of the ML model could be improved to 92.3% if both clinical and radiological features were used to construct an ML model.Several clinical-radiomics models have been developed to predict pathological subtypes of PMTs [27], to differentiate low-risk from high-risk thymomas [28] and to distingue TETs from other PMT subtypes [29], thereby facilitating clinical diagnosis.However, no clinicalradiomics models have been developed to predict the selection of the initial management between direct surgery and CNB.
The gold standard diagnostic method for PMTs is histopathological examination.Compared to the definite diagnosis of PMT subtypes, the selection of the initial management between direct surgery and CNB is a more urgent issue that must be solved quickly in clinical practice.Imaging findings and clinical characteristics may be useful to differentiate surgical cases from nonsurgical cases [30].Therefore, this retrospective study aimed to develop ML classification models using preoperative clinical features and radiomics to differentiate Category 1 patients who need immediate surgical resection from Category 2 patients who need CNB first.These expected classification models will contribute to the development of an ML-based clinical decision support system to facilitate the selection of the appropriate initial management for each individual patient with PMT.

Patient Selection and Classification
Consecutive patients aged 20 years or older and with PMT, who underwent surgical resection or CNB between January 2010 and December 2020, were initially selected.We enrolled patients who underwent a complete contrast-enhanced computed tomography (CECT) of the chest.Patients who underwent non-contrast-enhanced computed tomography alone or only had head and neck computed tomography without complete coverage of the entire thoracic cavity were excluded.Eligible patients were divided into two categories based on the final pathology reports.Category 1 patients had pathologically confirmed resectable TET, thymic hyperplasia, cyst, lymphangioma, thymolipoma, or teratoma and would undergo direct surgery.Category 2 patients had pathologically confirmed unresectable TET, lymphoma, Castleman disease, or malignant germ cell tumor and would receive CNB first.The protocol of this retrospective study was reviewed and approved by the Institutional Review Board (IRB) of the National Cheng Kung University Hospital (IRB No. A-ER-111-287; 4 October 2022), and the requirement for informed consent was waived due to the retrospective nature of this study.

Clinical Diagnosis and Clinical Data Collection
From January 2010 to December 2020, more than 20 clinicians specializing in thoracic surgery, thoracic medicine, oncology, and radiology, participated in the clinical diagnosis.Clinical diagnosis and the subsequent treatment selection were based on the baseline clinical data and preoperative CECT.
A retrospective chart review was performed to collect patient demographic and baseline clinical characteristics, such as age, sex, presence of myasthenia gravis (MG) symptoms, pleural effusion or mediastinal lymphadenopathy (LAP) on CECT, and the serum levels of tumor markers including AFP, HCG, and lactate dehydrogenase (LDH).Serum tumor marker levels were categorized as normal, higher, or missing.It is worth noting that tumor markers are not part of routine blood tests, and clinicians may order them for patients with suspected malignant germ cell tumor.

Image Acquisition and Preprocessing
Four CT scanners, including a Siemens SOMATOM Definition Flash, Siemens SO-MATOM Definition AS, Siemens SOMATOM Sensation 16, and GE Optima CT660, were used to acquire CECT images.Contrast medium (60 to 120 mL) was intravenously administered at a rate of 1.5 mL/s, followed by a 20 mL saline flush.All CECT images were obtained 90 s after contrast medium administration.The image size was 512 × 512 pixels.All images were reconstructed in 5 mm slices with a smooth standard convolution kernel (B40f), as previously described [31].After CECT images were imported into the opensource software 3D Slicer version 4.10.2, the PMTs were manually contoured by a thoracic radiologist (C.Y.L.) with 9 years of experience, blinded to patient diagnosis, using the built-in paint tool as previously described [32].Tumor segmentation was performed in the mediastinal setting (window level, 50 HU; window width, 350 HU) on the axial CT plane.For normalization, all CT voxels were resampled to 1 mm 3 using a cubic interpolation.

Radiomic Feature Extraction and Selection
The open-source platform PyRadiomics was used to extract 3D radiomic features from the segmentation of the PMTs on the images [33].A total of 851 radiomic features were extracted, including 14 shape features, 18 intensity histogram features, 74 texture features, and 745 wavelet features.The least absolute shrinkage and selection operator (LASSO) regression analysis is a feature selection method based on a linear regression model.Both the 851 radiomic features and all the above clinical data were entered into the LASSO regression for variable selection, as previously described [28,29].The extracted radiomic features reflect subtle characteristics of MPTs in images, such as the sphericity, diameter, homogeneity, and calcification.The variables with non-zero coefficients and the optimal lambda value were selected by the LASSO regression to build ML classification models to distinguish Category 1 patients from Category 2 patients.Three different approaches were used to develop ML classification models to discriminate between Category 1 patients and Category 2 patients.First, all clinical variables and radiomic features were used to train and validate 14 different ML models using 10-fold cross validation, a common approach for evaluating the performance of ML prediction models [35][36][37].The dataset was divided into 10 subsets.The model was trained on 9 randomly selected subsets and was validated on the remaining subset.The model was then trained and validated in the same way 10 times.Thus, the macro F1 score, macro precision, macro recall, accuracy, and area under the receiver operating characteristic curve (AUROC) calculated from each time were averaged to estimate the generalization performance of the model, as previously described [38].

Machine Learning Model Building
Additionally, LASSO regression was applied to all clinical variables and radiomic features, resulting in 20 combinations of the selected variables based on various lambda values.These combinations are labeled as Selection_1 to Selection_20.The features selected by LASSO selection, as well as all features without LASSO selection, were independently used to build the aforementioned models.
Subsequently, we constructed voting ensemble ML models based using Python version 3.8.9.The final decision was made by majority voting.The features selected by Selection_3 and all features without LASSO selection were used independently to build voting ensemble learning models.The performance of all combinations of 3, 5, 7, or 9 ML models, which were randomly selected from a total of 14 ML models, was evaluated.Two voting ensemble approaches were used to select the optimal ML models.First, the mean classification performances of these combinations of 3, 5, 7, or 9 were averaged separately.The combination of a given number of ML models with the highest mean accuracy was identified as the final model.Second, the single combined ML model with the highest accuracy was identified as the optimal voting ensemble learning model.The construction workflow of the voting ensemble learning models is shown in Figure 1.

Statistical Analyses
Age is expressed as the median (Q1, Q3), and between-group difference was ex ined using the Mann-Whitney U test.Categorical variables are expressed as counts ( centages), and between-group differences were compared using the chi-square Fisher's exact test.As 10-fold cross-validation was used in this study, the macro F1 sc macro precision, macro recall, accuracy, and AUROC were expressed as the mean ± Statistical significance was set at a p value of 0.05 or 95% conference interval (95% CI) [ The p value represents the probability of commi ing a Type I error, while the significa level denotes the upper limit of the acceptable Type I error probability, typically se 0.05, 0.01, or 0.001.The choice of the significance level depends on the researcher's w ingness to take on the risk of making a decisional error.Se ing a significance level of means allowing the possibility (or probability) that a false alarm will occur, which sho be less than 0.05 (i.e., only 1 occurrence in 20).Statistical analysis of the ML models performed using statsmodels version 0.13.1.

Patient Selection and Grouping
A total of 375 eligible patients were included in the study.According to the final thology reports, 182 patients were to undergo direct surgery (Category 1), and 193 pati were to undergo CNB first (Category 2).The flowchart of the patient selection and gro ing is shown in Figure 2.

Statistical Analyses
Age is expressed as the median (Q1, Q3), and between-group difference was examined using the Mann-Whitney U test.Categorical variables are expressed as counts (percentages), and between-group differences were compared using the chi-square and Fisher's exact test.As 10-fold cross-validation was used in this study, the macro F1 score, macro precision, macro recall, accuracy, and AUROC were expressed as the mean ± SD.Statistical significance was set at a p value of 0.05 or 95% conference interval (95% CI) [39].The p value represents the probability of committing a Type I error, while the significance level denotes the upper limit of the acceptable Type I error probability, typically set at 0.05, 0.01, or 0.001.The choice of the significance level depends on the researcher's willingness to take on the risk of making a decisional error.Setting a significance level of 0.05 means allowing the possibility (or probability) that a false alarm will occur, which should be less than 0.05 (i.e., only 1 occurrence in 20).Statistical analysis of the ML models was performed using statsmodels version 0.13.1.

Patient Selection and Grouping
A total of 375 eligible patients were included in the study.According to the final pathology reports, 182 patients were to undergo direct surgery (Category 1), and 193 patients were to undergo CNB first (Category 2).The flowchart of the patient selection and grouping is shown in Figure 2.

Baseline Demographic and Clinical Characteristics
The baseline demographic and clinical characteristics between the two categories presented in Table 1.The mean age of Category 1 patients was significantly older than of Category 2 patients (p < 0.0001).Category 1 patients had a significantly higher perc age of MG symptoms but lower percentages of pleural effusion and mediastinal lymph enopathy than Category 2 patients (all p < 0.0001).After excluding patients with miss data, Category 2 had significantly more patients with higher levels of LDH and AFP t Category 1 (both p ≤ 0.0147).There were no significant differences in sex and serum H between the two categories (Table 1).The centrality and dispersion for age, LDH, A and HCG are presented in Supplementary Table S1.

Baseline Demographic and Clinical Characteristics
The baseline demographic and clinical characteristics between the two categories are presented in Table 1.The mean age of Category 1 patients was significantly older than that of Category 2 patients (p < 0.0001).Category 1 patients had a significantly higher percentage of MG symptoms but lower percentages of pleural effusion and mediastinal lymphadenopathy than Category 2 patients (all p < 0.0001).After excluding patients with missing data, Category 2 had significantly more patients with higher levels of LDH and AFP than Category 1 (both p ≤ 0.0147).There were no significant differences in sex and serum HCG between the two categories (Table 1).The centrality and dispersion for age, LDH, AFP, and HCG are presented in Supplementary Table S1.

Classification Accuracy for Clinical Diagnosis
With reference to the final pathology reports (gold standard), the classification accuracy of the clinical diagnosis was 86.1% (323/375).This means that 52 out of 375 patients were wrongly classified by the clinical diagnosis in the first place.Of these, 15 patients were misclassified by clinical diagnosis into Category 2, and 37 patients were misclassified by clinical diagnosis into Category 1.

The Individual Machine Learning Model
Among 14 different ML models, the classification model based on the CatBoost algorithm had the best classification performance, with an accuracy of 0.8227 ± 0.0430 (Table 2).However, the accuracy of the CatBoost model was 82.27 (95% CI = 79.6 to 84.9%), which was significantly lower than that of clinical diagnosis (86.1%), because 86.1% did not fall within the 95% CI of the CatBoost model.The CatBoost model with the third LASSO selection (Selection_3) had the best classification performance, with an optimal lambda of 0.025354, a Ln (lambda) of −3.675, and an accuracy of 0.8422 ± 0.0423 (Table 3).Therefore, the variables selected by Selection_3 were subsequently used for classification modeling (Supplementary Table S2).Nevertheless, the accuracy of the CatBoost model with Selection_3 was 84.2% (95% CI = 81.6 to 86.8%), which was still significantly lower than that of the clinical diagnosis (86.1%).

Voting Ensemble Machine Learning Models
The results showed that the random selection of seven ML models from a total of fourteen ML models using the features selected by Selection_3 had the best average classification performance, with an accuracy of 0.8804 ± 0.0358 (Table 4).The accuracy of this combination ML was 88.0% (95% CI = 85.8 to 90.3%), which was not significantly different from that of the clinical diagnosis (86.1%).In addition, the classification performance of all possible combinations of three, five, seven, or nine ML models using the features selected by Selection_3 were evaluated by voting process.The best combination with the highest classification accuracy was a combination of five ML models, including NeuralNetFastAI, NeuralNetTorch, RandomForest with Entropy, RandomForest with Gini, and XGBoost, with an accuracy of 0.9044 ± 0.0408.The accuracy of this ensemble learning classification model was 90.4% (95% CI = 87.9 to 93.0%), which was significantly higher than that of the clinical diagnosis (86.1%) (p < 0.05).

Discussion
This retrospective study revealed that the clinical diagnosis had an accuracy of 86.1%.Of the 14 ML algorithms evaluated, the CatBoost model achieved the highest accuracy of 84.2%.The first voting ensemble ML model, which randomly selected seven ML models from the set of fourteen, had an average classification accuracy of 88.0%.The second voting ensemble ML model, a specific combination of five ML models, achieved an accuracy of 90.4%.In particular, the classification accuracy of the second voting ensemble ML model was significantly higher than that of the clinical diagnosis.
Although the gold standard diagnostic method is histopathological examination, the initial management for patients with PMT is determined based on clinical diagnosis but not pathological diagnosis, indicating the importance of clinical diagnostic tests [40].In clinical practice, a multidisciplinary team works together to reach a consensus on the clinical diagnosis and initial management for patients with PMT.This consensus-based decision-making process is time-and labor-consuming.Moreover, a discrepancy in clinical diagnostic suggestions might occur in patients aged 30-40 or older than 70, who have normal levels of tumor markers but no specific radiological features.Therefore, ML classification models with superior classification accuracy may serve as clinical decision support systems for differential diagnosis and management.Notably, radiomics-based ML models have been developed to predict overall survival in various cancer sites [41] and to predict management in lower-grade gliomas [42].
Accurate clinical diagnosis is crucial for making treatment decisions, especially for highly diverse PMTs [6,10].While imaging findings have been widely used to differentiate PMT subtypes based on location, fat content, and calcification [8,30,43], biopsies are necessary when imaging results are inconclusive.CT-guided percutaneous CNB has been shown to be effective and safe for the diagnosis of PMTs, with diagnostic yields ranging from 75.7% to 96% across studies [16,44,45].
In this retrospective study, CECT images were generated by four different CT scanners.Because PMTs are rare and heterogeneous tumors, we were not able to evaluate the influence of different CT scanners on image quality and characteristics in patients with the same PMT subtype.Notably, the use of different CT scanners does not affect the performance of ML models for detecting lung nodules [46], but the reconstruction kernel affects radiomic feature selection and model performance [31].Thus, in this study, all CECT raw data were reconstructed with a smooth standard convolution kernel (B40f) to reduce variations in image characteristics.
Automated extraction of a huge amount of quantitative radiomic features allows analyzing subtle characteristics of tumors on images, such as the sphericity, diameter, homogeneity, and calcification, thereby facilitating disease differential diagnosis [47].Resectable TETs usually have regular and smooth borders and smaller diameters, while lymphomas, malignant germ cell tumors, and unresectable TETs tend to have larger diameters and irregular borders.Lymphoma is more homogenous, but TET and teratoma are more heterogeneous and may be calcified, resulting in differences in the Hounsfield Unit value.
Notably, only one clinical feature, the serum level of LDH, was extracted by LASSO in Selection_3.Although patients with a malignant germ cell tumor often have higher levels of AFP and HCG, those clinical features with high specificity were not selected by LASSO, which may in part be due to few patients having a malignant germ cell tumor in the study population.Although LDH has a low specificity for differential diagnosis of PMT subtype, PMTs with higher malignancy and fast growth, such as lymphoma, high-grade thymoma, thymic carcinoma, and malignant germ cell tumor, usually have higher LDH levels.Therefore, serum LDH may be considered for inclusion in routine preoperative blood tests.In support of this suggestion, an elevated serum LDH level has been reported in conditions such as mediastinal seminoma, mediastinal lymphoma, and leukocytosis [48].
The ML algorithms used in this study have been commonly utilized to develop classification models for various diseases [49][50][51][52][53]. Ensemble learning techniques have been widely used in clinical practice to combine several ML models to create a stronger model with superior performance.The ensemble model has a better predictive accuracy than those of individual models [50].RandomForest, CatBoost, and XGBoost were used to develop an ensemble model for predicting pre-cancer in pre-and post-menopausal women, with an accuracy of 94% [51].In particular, voting ensemble methods have been used to improve the performance of prediction models for the detection of cervical cancer and brain tumors [54,55], prognosis of non-small cell lung cancer [53], and the diagnosis and prognosis of acute coronary syndrome [56].In our research, we applied voting ensemble methods to construct an ensemble learning model with a classification accuracy of 90.4%, which significantly outperformed the clinical diagnosis.
Deep learning utilizes artificial neural networks to simulate the human brain.Generally speaking, deep learning models outperformed the ML models; however, deep learning models are more complex and require more parameters, huge amounts of data, and high computational cost.In our recent study, we found that the radiomics-based ML model had superior performance compared to the 3D conventional neural network (CNN) model within a limited dataset [29].Due to the limited sample size, the present study focused on the development of ML models using both clinical features and radiomics.
Compared with other clinical-radiomics model studies that limited specific PMT histology types [27][28][29], this study did not set restrictions on PMT subtypes, including 12 different PMT subtypes.In addition, unlike other clinical-radiomics model studies aimed to improve differential diagnosis [27][28][29], the developed voting ensemble ML model may be utilized as a clinical decision support system to help the selection of initial management for patients with PMTs.
Some limitations need to be addressed.First of all, the findings of this single-institution retrospective study need to be confirmed by other studies conducted in distinct geographic areas.In addition, this retrospective study did not have a training dataset and a separated validation dataset because of a spectrum of many PMT subtypes.Instead, 10-fold cross validation was utilized to train and validate ML models.The mean classification failure rates of ensemble learning models could be calculated in the present retrospective study; however, patients at high risk of misclassification and the underlying reasons could not be investigated.Moreover, due to the etiologic diversity of PMTs [6,57], many PMT subtypes are extremely rare and may not be considered in the binary classification system for initial treatment.Therefore, large-scale multicenter studies are warranted to evaluate the feasibility of ensemble ML classification model as a clinical decision support system to facilitate the selection of initial treatment for patients with PMTs.
Furthermore, some unresectable TET cases may receive neoadjuvant chemotherapy first.Once the tumor size is reduced, surgical resection will be performed to improve proportion of R0 resection and the prognosis.However, due to the small number of patients undergoing neoadjuvant chemotherapy followed by surgery, large-scale multicenter studies are needed to develop a reliable model for predicting patients with TET who may undergo neoadjuvant chemotherapy followed by surgical resection.

Conclusions
In conclusion, based on preoperative clinical and radiomic features, we developed a voting ensemble learning classification model to discriminate patients with PMT who need direct surgery from those who need CNB, achieving a classification accuracy of 90.4% that was higher than the clinical diagnosis (86.1%).This voting ensemble learning clinical-radiomic classification model can serve as a clinical decision support system to assist clinicians in selecting the appropriate initial treatment for PMT patients to reduce unnecessary CNB or surgical resection and the harm to patients.

Figure 1 .
Figure 1.The construction workflow of voting ensemble learning models.

Figure 1 .
Figure 1.The construction workflow of voting ensemble learning models.

Figure 2 .
Figure 2. Flowchart of patient selection and grouping based on the final pathology reports.

Figure 2 .
Figure 2. Flowchart of patient selection and grouping based on the final pathology reports.

Table 1 .
Baseline demographic and clinical characteristics.

Table 1 .
Baseline demographic and clinical characteristics.

Table 1 .
Cont.Age is reported as the median (Q1, Q3).The other characteristics are presented as the number of patients and percentage, n (%).† 228, 248, and 276 patients had missing data on LDH, AFP, and HCG, respectively.

Table 2 .
Performance of the 14 ML classification models.
Best results are in bold.

Table 3 .
Classification performance of the Catboost model with different combinations of selected features or all features without LASSO selection.

Table 4 .
Summary of results of voting ensemble.