Abstract
Background: The differentiation of pheochromocytoma (PCC) from other adrenal lesions, particularly in incidentalomas with non-benign radiological characteristics (size > 4 cm or density > 10 HU), remains a clinical challenge. The study aimed to develop and validate an interpretable machine learning (ML) model for pairwise differentiation of PCC from adrenocortical carcinomas (ACCs) and non-functioning adrenal adenomas (NAAs) and to identify the most important clinical features. Methods: We analyzed a dataset of 50 clinical, laboratory, and radiological parameters from 123 patients with histologically verified adrenal tumors (63 PCC, 30 ACC, 30 NAA). Four classifiers—Logistic Regression (LR), Random Forest (RF), Linear Discriminant Analysis (LDA), and Extreme Gradient Boosting (XGBoost)—were trained for binary classification tasks (PCC vs. ACC, PCC vs. NAA, ACC vs. NAA) using a robust nested stratified cross-validation pipeline to ensure generalizability and avoid overfitting. Results: All four models showed strong predictive performance, with discrimination (AUC) more than 0.8. Our analysis, based on the interpretable LR model, identified the key discriminators differentiated PCC from both ACC and NAA: maximum systolic blood pressure, grade 3 hypertension, headache, palpitation, tachycardia, male sex, and concomitant gastric and duodenal ulcers. In contrast, lower back pain and general weakness were strong signs of lower probability of PCC. The tumor density specifically differentiated PCC from NAA, whereas tumor size was an important marker for distinguishing PCC and ACC. Conclusions: We developed robust ML models capable of accurately differentiating PCC from other adrenal tumors in complex cases. The models provide a clinically actionable tool for pre-surgical decision support. Furthermore, the identification of key discriminative features enhances the clinical understanding of PCC and facilitates its differential diagnosis prior to histological verification.
1. Introduction
In recent years, the widespread use of computed tomography (CT) and magnetic resonance imaging has led to the enhanced detection of adrenal incidentalomas in the general population [1,2,3]. According to the results of a Minnesota population-based cohort study, adrenal tumor standardized incidence rates increased 10 times from 1995 to 2017 [4]. Another study from a large UK university teaching hospital demonstrated that scans performed increased 7.7% year-on-year from 2015 to 2019, with a more pronounced increase in the number with adrenal incidentalomas (14.7% per year) [5]. And discovered adrenal tumors require subsequent evaluation for hormonal activity and malignancy risk [1,2].
The European Society of Endocrinology clinical practice guidelines give a strong recommendation that noncontrast CT is consistent with a benign adrenal mass if the tumor size is <4 cm, has a homogenous appearance, and the CT native density is ≤10 Hounsfield units (HU) [1]. Adrenal tumors with a different CT appearances need additional consideration.
Benign non-functioning adrenal adenomas (NAAs) are the most common adrenal incidentalomas, with a reported frequency of up to 70–76% [6]. Among adrenal incidentalomas with non-benign CT characteristics, the most prevalent are pheochromocytomas (PCC), followed by adrenocortical carcinomas (ACC) [7].
PCC is a catecholamine-secreting neuroendocrine non-epithelial neuroendocrine neoplasm. The recent systematic review and meta-analysis of observational studies showed that the pooled prevalence of pheochromocytoma was 19.8 (95% CI: 9.6–40.8) cases per 1,000,000 individuals, and the incidence rate was 1.9 (95% CI: 1.2–2.6) cases per million person-years [8]. According to the updated 2022 WHO classification of neuroendocrine tumors and tumors of endocrine organs, all PCCs are considered as malignant and potentially metastatic tumors [9]. Therefore, early recognition of PCC among adrenal incidentalomas is crucial for determining the appropriate treatment strategy and monitoring to prevent tumor progression and metastasis. Furthermore, surgical intervention for PCC typically requires specific preoperative medical preparation because excessive catecholamine release during the procedure can lead to life-threatening cardiovascular complications [10].
Machine learning (ML) methods have become an increasingly important tool for enhancing the reliability of medical diagnosis. Various ML techniques are actively employed to differentiate PCC from lipid-poor adrenal adenomas and other adrenal tumors [11,12,13] and are further applied to stratify the risk of disease progression in cases of metastatic PCC [14,15]. Provided that sufficiently large datasets are available, ML can be leveraged to create tools that assist specialists in both medical research and clinical practice. Commonly used methods include algorithms such as Logistic Regression [11,12,13], Support Vector Machines [11,12,14], and Random Forests [11,12,15], as well as Neural Networks [16,17]. Having established themselves as reliable diagnostic aids, the predictive quality of these ML methods can be further improved with larger amounts of data.
This work presents a pairwise comparison of diagnoses for PCC, ACC, and NAA in patients with adrenal incidentalomas with non-benign CT appearance using acquired data and ML methods. The study aims to advance knowledge and improve the reliability of ML models in differentiating PCC from ACC and NAA, based on clinical, laboratory, and instrumental findings. The training data comprised parameters related to adrenal tumors, with the goal of identifying clinical (pre-testing) predictors of PCC in patients exhibiting an adrenal tumor larger than 40 mm and/or a noncontrast CT density greater than 10 HU.
We developed an ML pipeline to perform feature selection, tune hyperparameters, and assess the distribution of quality metrics on a limited dataset. The pipeline is also robust to outliers. As a pilot study, this work lays the foundation for developing a reliable model to assist in clinical diagnosis as more data becomes available.
2. Materials and Methods
Patients. The study cohort was derived from patients who underwent adrenalectomy at the M.F. Vladimirskiy Moscow Regional Research and Clinical Institute over a 12-year period (2011–2023; n = 265).
The inclusion criteria were as follows: a histologically and immunohistochemically verified diagnosis of PCC, ACC, and NAA; an adrenal tumor with an unenhanced CT density of >10 HU and/or size of >40 mm; and available data from clinical and biochemical examinations.
Exclusion criteria were other adrenal tumors (or metastases from other malignancies) confirmed by histology and immunohistochemistry; lack of histological or immunohistochemical verification of adrenal tumor; and absence of data on tumor density and size from CT scans.
The selection process was as follows: first, patients with adrenal tumors exceeding 40 mm in diameter and/or demonstrating a pre-contrast attenuation of >10 HU on computed tomography were identified (n = 156). Subsequently, the following exclusion criteria were applied: (1) histological and immunohistochemical confirmation of other adrenal tumors or metastases from non-adrenal malignancies (n = 23); (2) absence of definitive data on tumor size or density from CT imaging (n = 2); and (3) lack of essential clinical or biochemical information (n = 4). Consequently, the final study group consisted of 123 patients (88 women and 35 men, aged 22–81 years). Based on the histological/immunohistochemical verification, patients were divided into three groups: 63 with PCC (40 women, 23 men, aged 22–81 years), 30 with ACC (23 women, 7 men, aged 28–68 years), and 30 with NAA (26 women, 4 men, aged 27–79 years).
Study design. A single-center, retrospective, observational non-interventional cohort study.
Clinical methods. Our study assessed about 50 clinical and biochemical indicators. The following clinical manifestations were assessed: headache, palpitations, sweating, facial paleness and redness, nausea, vomiting, abdominal pain, hand numbness, body tremors, anxiety, panic attacks, muscle weakness, heat sensations, lower back pain, general weakness, chills, tinnitus, seeing spots, unexplained weight loss, dyspnea, chest pain, constipation, dizziness, systolic and diastolic blood pressure, heart rate, and the presence, type (continuous or paroxysmal), and grade of arterial hypertension. A broad panel of concomitant diseases was also evaluated, including diabetes mellitus, thyroid nodules, medullar thyroid carcinoma, primary hyperparathyroidism, concomitant non-functioning adrenal adenomas, pituitary tumors, neuroendocrine tumors, chronic heart failure, cerebrovascular accident, cholelithiasis, gastroduodenal ulcers, urolithiasis, chronic pyelonephritis, respiratory diseases, other oncological diseases, autoimmune disorders, and chronic infectious diseases.
Daily urinary excretion of fractionated metanephrines and normetanephrines was measured using high-performance liquid chromatography–tandem mass spectrometry (HPLC-MS/MS). The reference values were <320 μg/day for metanephrines and <390 μg/day for normetanephrines. CT was performed using Aquilion Prime Canon (Toshiba, Japan), 160-slice, with a contrasting agent Omnipaque 350 (GE HealthCare, Norway) according to the standard protocol. We collected data on the size, native (non-enhanced) density in HU, and contrast of accumulation of contrast agent by the adrenal lesions The volume of adrenal tumors was calculated using the formula: volume (cm3) = (length, mm) × (width, mm) × (thickness, mm) × 0.0005.
Statistical Analysis and Predictive Modeling. To identify features that most effectively help distinguish diagnoses in the data we collected, statistical analyses were conducted using R language (version 4.3.3) in RStudio 2024.04.2 (version 2024.04.2+ 764). We used non-parametric statistical tests for our data analysis, and results are presented as the median [interquartile range (IQR) 25–75%]. To determine whether there are statistically significant differences, quantitative variables were compared using the Kruskal–Wallis test for multiple comparison, followed by post hoc pairwise Dunn tests with Holm–Bonferroni correction to control the error rate. Qualitative features were compared using the Chi-square test or Fisher’s exact test, as appropriate. A p-value < 0.05 was considered statistically significant to ensure that the identified differences are not due to chance. Predictive modeling was performed using Python (version 3.9.12). The criteria for selecting classifiers for this work were the ability to obtain feature importance for the model and the widespread use of these methods in similar studies. We trained four classification models: Logistic Regression (LR), Random Forest (RF) Classifier, Linear Discriminant Analysis (LDA), and Extreme Gradient Boosting (XGboost) Classifier, for each pair of three diagnostic groups (PCC vs. ACC, PCC vs. NAA, ACC vs. NAA). To address moderate class imbalance, class weights were balanced during model training. A stratified nested cross-validation pipeline was implemented to ensure robust performance estimation and prevent overfitting. The pipeline included data preprocessing, hyperparameters optimization, and model validation. Dataset description (feature coding and percentage of missing data) is presented in Table S1. Missing data were imputed using the K-nearest neighbor (KNN) algorithm, and features were standardized to facilitate model convergence and enable the interpretation of LR coefficients.
Although the number of neighbors for KNN should be selected via validation for each specific dataset to improve metric quality, our pilot work did not aim to create the most accurate models (especially given the data volume). Instead, we focused on comparing the performance of the considered models on equally filled data and calculating feature importance in classifying different diagnosis pairs. Therefore, the number of neighbors was set to 3 as a common initial approximation.
Model Validation and Hyperparameters Optimization. Stratified nested cross-validation was employed to provide an unbiased estimate of the models’ performance and prevent data leakage from the hyperparameter tuning process. A flowchart demonstrating a pipeline for data processing, model training, and validation is shown in Figure 1.
Figure 1.
Stratified nested cross-validation flowchart.
The outer loop used stratified 3-fold cross-validation to assess generalization performance. For each outer fold of nested cross-validation, the training subset underwent hyperparameter tuning via the inner loop. Hyperparameters were sampled randomly from their specified distributions during the tuning process. Table 1 presents the hyperparameter search spaces for the respective models, as well as the optimal sets of these parameters found in the inner loops. As optimal sets, the ones selected were those in which the total number of repetitions in the other sets found in the inner loops was maximum. Hyperparameter optimization was performed via randomized search from the Scikit-learn library [18] (RandomizedSearchCV) with ten iterations for LR, RF, and LDA and twenty iterations for XGBoost within an inner stratified K-fold cross-validation scheme comprising three folds. This approach minimizes overfitting and provides a reliable estimate of optimal hyperparameters, while reducing computational cost compared to exhaustive grid search. Outer loop was performed via 3-fold stratified cross-validation to evaluate the generalization performance of the tuned model, avoiding optimistic bias. Data leakage was mitigated as follows: the outer cross-validation loop divided the entire dataset into training and testing sets. The test data was only used for final evaluation and was not involved in model training or hyperparameter tuning. The inner cross-validation operated only on the training set of the current outer fold. This ensures that hyperparameter tuning occurs strictly within the training data. All preprocessing steps—such as imputation and feature scaling—fit only on the training data of each fold and then applied to validation data, preventing information leakage from the test set into the training process.
Table 1.
Hyperparameter search space and optimal values.
For the RF Classifier (RandomForestClassifier [18]), the following hyperparameters were tuned: the number of trees (n_estimators), the maximum tree depth (max_depth), the number of features considered at each split (max_features), the minimum samples required to split an internal node (min_samples_split), and the minimum samples required at a leaf node (min_samples_leaf). The class_weight parameter was set to ‘balanced’. For LR (LogisticRegression [18]), the following were tuned: the inverse of regularization strength (classifier__C), which controls the amount of regularization, and the type of regularization penalty (classifier__penalty), with options being ‘l1’ (Lasso) or ‘l2’ (Ridge). Tuning the penalty type enables implicit feature selection through sparsity (L1) and handles multicollinearity (L2). We also tuned the optimization algorithm (classifier__solver) and the maximum number of iterations for convergence (classifier__max_iter). The class_weight parameter was set to ‘balanced’. For LDA (LinearDiscriminantAnalysis [18]), the following were tuned: the solver for computing discriminant functions (classifier__solver) and the regularization parameter (classifier__shrinkage) to improve model stability, particularly with small datasets or in the presence of multicollinearity. For each algorithm, the optimal hyperparameters found were used to train a model on the entire training fold. This model was then evaluated on the corresponding test fold. This process was repeated for all outer folds, and the performance metrics were aggregated by averaging to assess the overall generalization capability. For XGBoost (XGBClassifier [19]), the following hyperparameters were tuned: the number of trees in the model (classifier__n_estimators); the maximum depth of each tree, which limits tree growth to prevent overfitting (classifier__max_depth); the step size for updating the model (classifier__learning_rate), which controls training speed and balance between bias and variance; the fraction of training samples used for building each tree (classifier__subsample), which helps reduce overfitting; the fraction of features sampled for each tree (classifier__colsample_bytree) to increase model diversity; the minimum sum of instance weights needed in a leaf (classifier__min_child_weight); the minimum loss reduction required to make a split (classifier__gamma); classifier__reg_alpha: the L1 regularization term, which promotes sparsity in the model; and classifier__reg_lambda: the L2 regularization term, which helps prevent overfitting by shrinking weights.
Performance Metrics. Classifier performance was evaluated using multiple metrics to ensure a comprehensive assessment. The following metrics were calculated: accuracy, precision, recall, F1-score, and the Brier score for calibration. The formulas for these metrics are provided in reference [20].
—true positive;
—true negative;
—false positive;
—false negative;
N—number of dataset samples;
—target value and model-estimated probability of belonging to a given class c.
For probabilistic calibration, we implemented the Brier score as a custom scorer. Calibration visualization: For each outer fold, predicted probabilities for the positive class are binned into 10 intervals. For each bin, we calculated the mean predicted probability and the true fraction of positive samples in that bin. Later, the calibration information from all folds was combined by interpolating these points onto common bins.
The area under the Receiver Operating Characteristic (ROC) curve was measured to quantify model discrimination. ROC curve visualization: For each fold of the outer cross-validation, the values of False Positive Rate (FPR) and True Positive Rate (TPR) are calculated, which form the ROC curve. However, the points of FPR can differ for each fold (due to different prediction distributions). To average the ROC curves across folds, an interpolation of the TPR is performed at evenly spaced values of FPR.
To understand the real-world clinical impact and the types of mistakes that models make confusion matrices were calculated for all models for each pair of the diagnoses. True positives, true negatives, false positives, and false negatives were averaged across all outer folds. Evaluating both discrimination and calibration, as well as other multiple performance metrics, gives a fuller picture of model quality.
3. Results
3.1. Clinical Characteristics
Patients’ age at diagnosis varied significantly across the groups (p = 0.015) and was 51 years (IQR 43.5–59) in the PCC group, 50 years (IQR 44–61.5) in the ACC group, and 59.5 years (IQR 50.5–70.8) in the NAA group, with the NAA group being significantly older than PCC (p = 0.014 after correction). A trend was observed towards a higher frequency of male sex in the PCC group (23/63, 36.5%) compared to the ACC (7/30, 23.3%) and NAA (4/30, 13.3%) groups (p = 0.054). Among the clinical characteristics and symptoms assessed, the prevalence of headache, palpitation, weakness, and lower back pain differed significantly between groups (Table 2). Headache and palpitation were significantly more common in the PCC group compared to the ACC and NAA groups (p < 0.001 and p = 0.016, respectively). In contrast, general weakness and lower back pain were significantly less associated with PCC compared to ACC and NAA (both p < 0.001). The prevalence of other examined symptoms was similar in the groups.
Table 2.
Clinical symptoms in examined patients.
We found the high prevalence of arterial hypertension across all patient groups: PCC 92.0% (58/63), ACC 76.7% (23/30), and NAA 80% (24/30), with no statistically significant differences between groups (p = 0.093). Paroxysmal hypertension was the predominant type in all groups, and its prevalence was significantly different: 86.2% (50/58) in the PCC group, 78.2% (18/23) in the ACC group, and 58.3% (14/24) in the NAA group (p = 0.012). Both maximal systolic and diastolic blood pressure values were significantly higher in the PCC group compared to the ACC (p < 0.001 after correction) and NAA groups (p < 0.001 after correction) (Table 3). The severity of hypertension also differed, and the prevalence of grade 3 hypertension was significantly higher in the PCC group: 88.0% (51/58) vs. 56.5% (13/23) in the ACC group and 41.6% (10/24) in the NAA group (both p < 0.001). While the median heart rate was not distinct between the groups, the prevalence of tachycardia (heart rate > 90 bpm) was significantly higher in the PCC group (23.8%, 15/63) than in the ACC (10%, 3/30) and NAA (3%, 1/30) groups (p = 0.016).
Table 3.
Clinical, laboratory, and instrumental characteristics of examined patients.
Laboratory examination revealed catecholamines hypersecretion in 95.2% (60/63) of PCC cases, 3.3% (1/30) of ACC cases, and 14.0% (4/30) of NAA cases (see Table 3). A mixed biochemical type of PCC (hypersecretion of both metanephrines and normetanephrines) was the most common finding in PCC, observed in 56.7% (34/60) cases. Isolated normetanephrine elevation was found in 30% (18/60) of PCC cases, while isolated metanephrines elevation was present in 13.3% (8/60) of the cases. In the ACC group, one case exhibited elevated levels of both metanephrines and normetanephrines. In the NAA group, two patients had isolated normetanephrine elevation, and two others showed a mixed type of hypersecretion.
According to the CT results, there were statistically significant differences in maximum tumor size, volume, and unenhanced density among the groups (p < 0.001). Tumor size and volume were significantly lower in the PCC group compared to the ACC group (p < 0.001 after correction), but comparable to those in the NAA group (p = 0.953). In contrast, unenhanced CT density was similar between the PCC and ACC groups (p = 0.128 after correction), but was significantly higher in the PCC group than in the NAA group (p < 0.001 after correction).
Concomitant diseases occurred with similar frequency, except for gastroduodenal ulcers: its prevalence differed among the groups (p = 0.029), with a higher frequency in the PCC group compared to the ACC and NAA groups (both p = 0.047) (Table 4).
Table 4.
Prevalence of comorbidities in the examined patient groups.
Thus, we obtained data on clinical, biochemical, and morphological characteristics, which were then included in the development of the ML models.
3.2. Model Performance
The ML models for this study were selected based on the research objectives. Logistic Regression (LR) was chosen for its several advantages in statistical modeling. A primary strength is its interpretability; the model coefficients provide direct insight into the relationship between input features and the predicted outcome probability. Additionally, LR is computationally efficient. The next method was Random Forest (RF) Classifier, a robust and widely adopted method for binary classification. By constructing an ensemble of decision trees and aggregating their predictions, it significantly reduces overfitting compared to individual trees and achieves high predictive accuracy. It is also resilient to noisy data and outliers. Linear Discriminant Analysis (LDA) was also employed. This technique is widely used for binary classification due to its simplicity, interpretability, and solid theoretical foundation. A key advantage lies in its ability to find a linear combination of features that maximally separates the two classes. While computationally efficient, LDA has certain limitations, including sensitivity to outliers, assumptions about feature distributions, and potential issues with small sample sizes. In this work, LDA was included for comparison with other models, and was also utilized for patients’ data visualization. The last model evaluated in this study was Extreme Gradient Boosting (XGBoost), a scalable tree-boosting system renowned for its superior performance on structured data. Key advantages include built-in regularization to mitigate overfitting, native handling of missing values, and class imbalance.
To compare the performance of ML algorithms used in this work, we calculated the following quality metrics: accuracy, precision, recall, F1-score, calibration, and discrimination. Stratified cross-validation was applied to improve the objectivity of the classification quality assessment. Visualization of the ROC and calibration for the evaluated models are presented in Figure 2, Figure 3 and Figure 4, corresponding to each pair of the three diagnoses: PCC, ACC, and NAA.
Figure 2.
ROC and calibration: PCC vs. ACC classification.
Figure 3.
ROC and calibration: PCC vs. NAA classification.
Figure 4.
ROC and calibration: ACC vs. NAA classification.
The models’ metrics are presented in Table 5. The performance of the four machine learning models varied across the different classification tasks. For the PCC vs. ACC and ACC vs. NAA classifications, XGBoost demonstrated the highest performance across all metrics, indicating its superiority for this pairwise comparisons.
Table 5.
Models’ quality metrics.
For the PCC vs. NAA classification, the LDA model showed the best accuracy (0.8817 ± 0.0402), precision (0.8836 ± 0.0383), recall (0.8817 ± 0.0402), F1-score (0.8823 ± 0.0396), and Brier score (0.1060 ± 0.0380). The RF model, however, achieved the highest discriminative power with an AUC of 0.9230 ± 0.0243.
From the obtained confusion matrices (Table 6), for the PCC vs. ACC and PCC vs. NAA classifications, RF on average makes fewer errors than other models in cases of actual positive diagnosis. At the same time, XGBoost made the fewest errors of this type for the ACC vs. NAA pair. Even though no single model universally dominated all tasks in other metrics, RF exhibits the fewest false negative errors for PCC, which is clinically crucial.
Table 6.
Confusion matrices.
The dimensionality reduction technique LDA was applied to visualize patient data clustering based on ground truth labels (Figure 5). Due to dimensionality reduction, the four ACC patients clustered into very close points, which caused the visualization to appear as if there were only 27 ACC patients on the graph.
Figure 5.
Linear Discriminant Analysis for PCC, ACC, and NAA.
One of the key objectives was to identify clinical features that can be used for differentiating PCC from ACC and NAA. To rank the features of the standardized data by their importance for classification, we used the Logistic Regression coefficients due to the model’s interpretability. We used stratified nested cross-validation with 10 outer folds and 5 inner folds for a more accurate estimation of the expected value of the coefficients. The feature importance values were averaged across the outer cross-validation folds.
Figure 6, Figure 7 and Figure 8 display the LR coefficients (only those with absolute values ≥ 0.001 are shown for better readability) for the features used for pairwise differentiation of PCC, ACC, and NAA. The absolute value of the coefficient reflects the feature’s contribution of in the separation. A positive coefficient value indicates a higher likelihood of the first diagnosis in a compared pair. On the contrary, a negative coefficient value indicates a preference for the second diagnosis in a compared pair.
Figure 6.
Logistic Regression feature importance for PCC vs. ACC.
Figure 7.
Logistic Regression feature importance for PCC vs. NAA.
Figure 8.
Logistic Regression feature importance for ACC vs. NAA.
According to the statistical analysis of clinical data (Table 2, Table 3 and Table 4), significant features distinguishing PCC from ACC and NAA were headache, lower back pain, palpitations, general weakness, maximum blood pressure, adrenal tumor volume and size, and tumor CT density. The LR coefficients were largely consistent with the statistical analysis, while highlighting key distinctions for each paired diagnosis. Overall, the LR model confirmed the importance of specific characteristics in differentiating these adrenal lesions.
4. Discussion
In our study, we compared patients with verified PCC diagnosis with patients with ACC (representing malignant adrenal lesions) and NAA (representing benign adrenal lesions). One aim of the current study was to validate ML models for pairwise differentiation of incidentally adrenal lesions with non-benign CT characteristics. Another key objective was to identify clinical features that could be relevant for recognizing PCC among adrenal incidentalomas.
The problem was framed as a pairwise classification task to distinguish PCC from ACC and NAA and to identify features enabling this differentiation. Even though XGBoost was most effective for distinguishing PCC from ACC and ACC from NAA in terms of accuracy, precision, recall, F1-score, ROC AUC, and Brier score, RF reducing false negative errors in PCC patient classification decreases the risk of missing truly ill PCC patients. LR provided the ability to extract feature importance from standardized data and made it possible to understand the sign of a feature’s contribution to the probability of a particular class, unlike, for example, RF feature importance. As a result, the model outcomes became clearer and more interpretable. These models could be further trained and applied in clinical practice. The identified markers could assist in better understanding the differences between PCC and other adrenal lesions, such as ACC and NAA. We combined traditional clinical and biochemical diagnostic methods with ML approaches to improve diagnostic performance.
Some recent publications that we have mentioned before demonstrated that ML models based on CT radiomics could differentiate lipid-poor adrenal adenomas from PCC. The study by Xiao et al. included 70 patients with lipid-poor adenomas and 60 PCC that was comparable the number of patients in our study, but the authors did not consider any clinical signs [11]. Liu et al. based their ML model on CT data from 188 lipid-poor adrenal adenomas and 92 PPC, but the authors mentioned an absence of clinical and biochemical data as a limitation of their study [12]. Altay et al. used ML to evaluate texture analysis for distinguishing various adrenal lesions including PCC and adrenal metastases from other tumors, but there were 19 cases of PCC only among 166 adrenal lesions included at the ML model [13]. So, we included clinical, anamnestic, biochemical, and radiological parameters of our 123 patients to develop an ML model for distinguishing PCC, ACC, and NAA among adrenal incidentalomas with non-benign characteristics according CT images.
The classical symptoms of PCC include hypertension, headaches, palpitations, and sweating. It is noteworthy that our model identified the importance of clinical PCC indicators: headaches and palpitation were important in all pairs, and the presence of lower back pain and general weakness strongly indicated against PCC. However, the landscape of other clinical signs was different. Although arterial hypertension is a primary clinical manifestation of PCC, it was equally prevalent across all examined groups. The ML algorithm did not consider hypertension per se or its paroxysmal type as specific indicators. High systolic and diastolic blood pressure, as well as grade 3 hypertension, appeared to be specific for PCC compared to ACC and NAA. Thus, the clinical indicator of PCC is the severity of arterial hypertension (grade 3), and not the presence or type of hypertension. (Figure 6 and Figure 7). The usual biochemical marker of PCC is a 2–3-fold increase in metanephrine and/or normetanephrine concentrations in blood plasma or daily urine. However, in our cohort of patients with verified PCC diagnosis, we observed 3/63 (4.8%) cases had both parameters within the reference range, so clinical markers of PCC are very important despite the ability of biochemical investigation.
Based on our data, male sex was a sign of higher probability of PCC compared to ACC and NAA. On multivariate analysis, male sex (among other factors) was found to be a statistically significant predictor of adrenal tumor malignancy [6]. However, ACC also showed a female predominance [21].
Our model assigned diagnostic significance to concomitant stomach and duodenum ulcers in PCC cases, which may raise clinicians’ awareness of this complication. Regarding the relationship between gastroduodenal ulcers and PCC, studies have shown that some PCC patients have elevated serum adrenaline and gastrin levels both basally and in response to food intake. These findings indicate that epinephrine stimulates gastrin secretion [22], providing a pathophysiological basis for the association between gastroduodenal ulcers and PCC. Chronic or paroxysmal catecholamine excess causes vasospasm of the gastric mucosal vessels, leading to impaired microcirculation, ischemia, and the consequent formation of “stress” or “ischemic” ulcers. Thus, gastroduodenal ulcers represent a rare but justified pathophysiological feature of PCC that is often overlooked in standard diagnostic algorithms. Our results highlight the importance of a comprehensive differential diagnosis and consideration of associated comorbidities.
According to CT results, PCC size and volume were similar to NAA but significantly smaller than ACC. In contrast, unenhanced CT density of PCC was comparable to ACC and higher than that of NAA. Our ML model showed clear separation of PCC from ACC and NAA based on different CT characteristics: high noncontrast CT density of the tumor was very important to distinguish PCC and NAA, whereas tumor size was notable marker to differentiate PCC and ACC in adrenal tumors with high noncontrast CT density.
A recently published study by Iwamoto et al. developed an ML-based clinical model that combined CT imaging and clinical parameters for adrenal tumor classification [23]. The aim of the study was to evaluate whether the tumors were hormone-producing. This retrospective study involved 162 patients with different adrenal lesions, including 55 cases of NAA and 23 cases of PCC, verified after surgical treatment. In the cited study, maximal tumor size was similar in PCC and NAA, but they did not include patients with ACC and did not comment on CT density of adrenal tumors. Some clinical parameters in that study were similar to our research (sex, systolic blood pressure, body mass index, catecholamine increase), but they included some additional hormonal test (aldosterone–renin ratio, cortisol–ACTH ratio) that we did not use. In contrast, we focused on another wide spectrum of clinical signs and concomitant disorders to evaluate their clinical significance in differentiating specific types of adrenal lesions on preanalytical stage.
It should be noted that our study has several limitations. The most notable is the relatively small sample size, particularly for the ACC group (n = 30). This is an inherent challenge in studying rare tumors like ACC, with an annual incidence of only 1–2 cases per million people per year [24]. However, as demonstrated in other medical ML studies [25,26], robust models can be developed even with limited data when using appropriate validation techniques, such as the nested cross-validation employed here. Nevertheless, small sample sizes pose significant challenges for nested cross-validation. Vabalas et al. [27] demonstrate that while nested CV mitigates bias better than standard k-fold CV—yielding near-unbiased accuracies close to 50% on noise data—even nested procedures on number of samples less than 200 suffer high variance due to limited resampling stability. Therefore, the small cohort size necessitates external validation on larger, multicenter datasets to confirm the generalizability of our findings.
We framed the problem as three separate binary classification tasks rather than a single multi-class model. This approach was chosen for two primary reasons: First, given the limited sample size, especially for ACC, building a multi-class model would have been statistically underpowered and prone to overfitting. Second, from a clinical perspective, the pairwise differentiation often reflects the immediate diagnostic dilemma faced by physicians when specific tumor characteristics are observed.
Another limitation of our study was sample bias, as we included only patients with histologically verified tumors after surgical treatment. Most NAA patients typically do not require surgical treatment due to absence of hormonal hypersecretion and benign CT features. This limitation, however, can also be regarded as a strength of the study, as all diagnoses were surgically confirmed. In our cohort, the primary indication for surgery in adrenal lesion cases without proven hormonal activity was non-benign CT characteristics of adrenal tumors suspected of malignancy. Adrenal tumors with non-benign CT characteristics represent the most challenging clinical scenarios that would benefit from the decision support. We suggest that our data will help to recognize the type of adrenal tumor with more confidence that could lead to better preoperative management in case of PCC and/or avoid an operation in case of NAA.
We also excluded patients with other types of adrenal lesions such as metastases of extra-adrenal cancer, because PCC, ACC, and NAA are the most prevalent adrenal tumors, according to the experience of many centers including ours [3,4,5,6,7]. It is important to consider that the model’s performance might not generalize to all patients with incidentalomas and need to be confirmed on independent, prospective cohorts from multicenter studies.
5. Conclusions
We developed and validated robust ML models for the differential diagnosis of PCC versus ACC and NAA in patients with adrenal tumors exhibiting non-benign CT characteristics. The key discriminators that differentiated PCC from both ACC and NAA were maximum systolic blood pressure, grade 3 hypertension, headache, palpitation, tachycardia, male sex, and concomitant gastric and duodenal ulcers. In contrast, lower back pain and general weakness were strong signs of lower probability of PCC. Furthermore, high tumor CT density specifically differentiated PCC from NAA, whereas tumor size was an important marker for distinguishing PCC and ACC. These findings highlight the potential of integrating clinical data into a diagnostic algorithm. The proposed models form a foundation for a future clinical decision support system (e.g., a script utilizing models validated in real clinical settings), which could aid endocrinologists and surgeons in risk stratification and preoperative planning for patients with challenging adrenal incidentalomas.
Supplementary Materials
The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/life16010164/s1. Table S1: Dataset description.
Author Contributions
Conceptualization, L.N., V.P. and I.I.; methodology, A.L. and T.N.; validation, T.N. and V.P.; formal analysis, A.L.; writing—original draft preparation, T.N., I.I. and L.N.; writing—review and editing, L.N., V.P. and I.I.; visualization, T.N.; supervision, L.N.; project administration, L.N.; funding acquisition, L.N. and V.P. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the MSU Program of Development, Project No. 23-SCH-04-34.
Institutional Review Board Statement
The study was conducted in accordance with the Declaration of Helsinki and approved by the Independent Ethics Committee of M.F. Vladimirskiy Moscow Regional Research Clinical Institute (protocol code No. 3 approved 11 February 2021).
Informed Consent Statement
Informed consent was obtained from all subjects involved in the study.
Data Availability Statement
Additional information regarding the manuscript will be welcomed by the authors.
Acknowledgments
The authors have reviewed and edited the output and take full responsibility for the content of this publication.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| ACC | Adrenocortical carcinoma |
| AUC | Area under the curve |
| CT | Computed tomography |
| FPR | False Positive Rate |
| HU | Hounsfield Units |
| IQR | Interquartile range |
| KNN | K-Nearest Neighbors |
| LDA | Linear Discriminant Analysis |
| LR | Logistic Regression |
| ML | Machine Learning |
| NAA | Non-functioning adrenal adenoma |
| PCC | Pheochromocytoma |
| RF | Random Forest |
| ROC | Receiver Operating Characteristic |
| TPR | True Positive Rate |
| XGBoost | Extreme Gradient Boosting |
References
- Fassnacht, M.; Tsagarakis, S.; Terzolo, M.; Tabarin, A.; Sahdev, A.; Newell-Price, J.; Pelsma, I.; Marina, L.; Lorenz, K.; Bancos, I.; et al. European Society of Endocrinology Clinical Practice Guidelines on the Management of Adrenal Incidentalomas, in Collaboration with the European Network for the Study of Adrenal Tumors. Eur. J. Endocrinol. 2023, 189, G1–G42. [Google Scholar] [CrossRef]
- Sherlock, M.; Scarsbrook, A.; Abbas, A.; Fraser, S.; Limumpornpetch, P.; Dineen, R.; Stewart, P.M. Adrenal Incidentaloma. Endocr. Rev. 2020, 41, 775–820. [Google Scholar] [CrossRef]
- Martin, C.S.; Andrei, M.; Voicu, B.A.; Riță, M.A.; Taralunga, A.A.; Sîrbu, A.E.; Cima, L.N.; Stoian, I.; Barbu, C.G.; Calu, V.; et al. The Spectrum of Adrenal Lesions in a Tertiary Referral Center. Biomedicines 2024, 12, 2214. [Google Scholar] [CrossRef]
- Ebbehoj, A.; Li, D.; Kaur, R.J.; Zhang, C.; Singh, S.; Li, T.; Atkinson, E.; Achenbach, S.; Khosla, S.; Arlt, W.; et al. Epidemiology of Adrenal Tumours in Olmsted County, Minnesota, USA: A Population-Based Cohort Study. Lancet Diabetes Endocrinol. 2020, 8, 894–902. [Google Scholar] [CrossRef]
- Hanna, F.W.F.; Hancock, S.; George, C.; Clark, A.; Sim, J.; Issa, B.G.; Powner, G.; Waldron, J.; Duff, C.J.; Lea, S.C.; et al. Adrenal Incidentaloma: Prevalence and Referral Patterns From Routine Practice in a Large UK University Teaching Hospital. J. Endocr. Soc. 2022, 6, bvab180. [Google Scholar] [CrossRef]
- Nasiroğlu İMga, N.; Aslan, Y.; Çatak, M.; Aykanat, İ.C.; Tuncel, A.; Berker, D. Clinical, Radiological, and Surgical Outcomes of 431 Patients with Adrenal Incidentalomas: Retrospective Study of a 10-Year Single-Center Experience. Turk. J. Med. Sci. 2024, 54, 376–383. [Google Scholar] [CrossRef]
- Cyranska-Chyrek, E.; Szczepanek-Parulska, E.; Olejarz, M.; Ruchala, M. Malignancy Risk and Hormonal Activity of Adrenal Incidentalomas in a Large Cohort of Patients from a Single Tertiary Reference Center. Int. J. Environ. Res. Public Health 2019, 16, 1872. [Google Scholar] [CrossRef] [PubMed]
- Vitturi, G.; Crisafulli, S.; Alessi, Y.; Frontalini, S.; Stano, M.G.; Fontana, A.; Giuffrida, G.; Ferraù, F.; Trifirò, G.; Cannavò, S. Global Epidemiology of Pheochromocytoma: A Systematic Review and Meta-Analysis of Observational Studies. J. Endocrinol. Investig. 2025, 48, 2813–2825. [Google Scholar] [CrossRef] [PubMed]
- Mete, O.; Asa, S.L.; Gill, A.J.; Kimura, N.; De Krijger, R.R.; Tischler, A. Overview of the 2022 WHO Classification of Paragangliomas and Pheochromocytomas. Endocr. Pathol. 2022, 33, 90–114. [Google Scholar] [CrossRef] [PubMed]
- Fang, F.; Ding, L.; He, Q.; Liu, M. Preoperative Management of Pheochromocytoma and Paraganglioma. Front. Endocrinol. 2020, 11, 586795. [Google Scholar] [CrossRef]
- Xiao, D.; Zhong, J.; Peng, J.; Fan, C.; Wang, X.; Wen, X.; Liao, W.; Wang, J.; Yin, X. Machine Learning for Differentiation of Lipid-Poor Adrenal Adenoma and Subclinical Pheochromocytoma Based on Multiphase CT Imaging Radiomics. BMC Med. Imaging 2023, 23, 159. [Google Scholar] [CrossRef]
- Liu, H.; Guan, X.; Xu, B.; Zeng, F.; Chen, C.; Yin, H.L.; Yi, X.; Peng, Y.; Chen, B.T. Computed Tomography-Based Machine Learning Differentiates Adrenal Pheochromocytoma From Lipid-Poor Adenoma. Front. Endocrinol. 2022, 13, 833413. [Google Scholar] [CrossRef]
- Altay, C.; Basara Akin, I.; Ozgul, A.H.; Adiyaman, S.C.; Yener, A.S.; Secil, M. Machine Learning Analysis of Adrenal Lesions: Preliminary Study Evaluating Texture Analysis in the Differentiation of Adrenal Lesions. Diagn. Interv. Radiol. 2023, 29(2), 234–243. [Google Scholar] [CrossRef] [PubMed]
- Pamporaki, C.; Berends, A.M.A.; Filippatos, A.; Prodanov, T.; Meuter, L.; Prejbisz, A.; Beuschlein, F.; Fassnacht, M.; Timmers, H.J.L.M.; Nölting, S.; et al. Prediction of Metastatic Pheochromocytoma and Paraganglioma: A Machine Learning Modelling Study Using Data from a Cross-Sectional Cohort. Lancet Digit. Health 2023, 5, e551–e559. [Google Scholar] [CrossRef]
- Zhou, Y.; Zhan, Y.; Zhao, J.; Zhong, L.; Tan, Y.; Zeng, W.; Zeng, Q.; Gong, M.; Li, A.; Gong, L.; et al. CT-Based Radiomics Analysis of Different Machine Learning Models for Discriminating the Risk Stratification of Pheochromocytoma and Paraganglioma: A Multicenter Study. Acad. Radiol. 2024, 31, 2859–2871. [Google Scholar] [CrossRef]
- Luo, G.; Yang, Q.; Chen, T.; Zheng, T.; Xie, W.; Sun, H. An optimized two-stage cascaded deep neural network for adrenal segmentation on CT images. Comput. Biol. Med. 2021, 136, 104749. [Google Scholar] [CrossRef]
- Wang, S.C.; Yin, S.N.; Wang, Z.Y.; Ding, N.; Ji, Y.D.; Jin, L. Evaluation of a fusion model combining deep learning models based on enhanced CT images with radiological and clinical features in distinguishing lipid-poor adrenal adenoma from metastatic lesions. BMC Med. Imaging 2025, 25, 219. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar] [CrossRef]
- Moore, A.; Bell, M. XGBoost, A Novel Explainable AI Technique, in the Prediction of Myocardial Infarction: A UK Biobank Cohort Study. Clin. Med. Insights Cardiol. 2022, 16, 11795468221133611. [Google Scholar] [CrossRef] [PubMed]
- Sokolova, M.; Lapalme, G. A Systematic Analysis of Performance Measures for Classification Tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
- Sharma, E.; Dahal, S.; Sharma, P.; Bhandari, A.; Gupta, V.; Amgai, B.; Dahal, S. The Characteristics and Trends in Adrenocortical Carcinoma: A United States Population Based Study. J. Clin. Med. Res. 2018, 10, 636–640. [Google Scholar] [CrossRef] [PubMed]
- Majumder, M.; Chiang, C.; Kong, G.; Michael, M.; Sachithanandan, N.; Boehm, E. Approach to the Management of Gastrointestinal Manifestations in Patients With Phaeochromocytoma and Paraganglioma. Clin. Endocrinol. 2025, 103, 21–35. [Google Scholar] [CrossRef] [PubMed]
- Iwamoto, Y.; Kimura, T.; Morimoto, Y.; Sugisaki, T.; Dan, K.; Iwamoto, H.; Sanada, J.; Fushimi, Y.; Shimoda, M.; Fujii, T.; et al. Development of a Prediction Model by Combining Tumor Diameter and Clinical Parameters of Adrenal Incidentaloma. Endocr. J. 2025, 72, 1115–1125. [Google Scholar] [CrossRef]
- Else, T.; Kim, A.C.; Sabolch, A.; Raymond, V.M.; Kandathil, A.; Caoili, E.M.; Jolly, S.; Miller, B.S.; Giordano, T.J.; Hammer, G.D. Adrenocortical Carcinoma. Endocr. Rev. 2014, 35, 282–326. [Google Scholar] [CrossRef]
- Yuan, H.; Kang, B.; Sun, K.; Qin, S.; Ji, C.; Wang, X. CT-Based Radiomics Nomogram for Differentiation of Adrenal Hyperplasia from Lipid-Poor Adenoma: An Exploratory Study. BMC Med. Imaging 2023, 23, 4. [Google Scholar] [CrossRef]
- Rajput, D.; Wang, W.-J.; Chen, C.-C. Evaluation of a Decided Sample Size in Machine Learning Applications. BMC Bioinform. 2023, 24, 48. [Google Scholar] [CrossRef] [PubMed]
- Vabalas, A.; Gowen, E.; Poliakoff, E.; Casson, A.J. Machine Learning Algorithm Validation With a Limited Sample Size. PLoS ONE 2019, 14, e0224365. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.







