1. Introduction
Metabolic dysfunction-associated steatotic liver disease (MASLD) has emerged as the most prevalent chronic liver disorder worldwide, paralleling the global epidemics of obesity, insulin resistance, and type 2 diabetes mellitus [
1,
2]. Hepatic steatosis represents the earliest and most common manifestation of MASLD and is a key determinant of progression to steatohepatitis, fibrosis, cirrhosis, and hepatocellular carcinoma. Beyond liver-related outcomes, steatosis is also strongly associated with increased cardiometabolic risk and all-cause mortality [
3,
4]. Therefore, early and scalable identification of hepatic steatosis is essential for effective risk stratification and preventive interventions.
Current diagnostic strategies rely predominantly on imaging modalities, including ultrasonography, controlled attenuation parameter (CAP), and magnetic resonance imaging–proton density fat fraction (MRI-PDFF), or, in selected cases, liver biopsy [
5,
6]. Although these methods provide reasonable diagnostic accuracy, their cost, limited accessibility, operator dependence, and, in the case of biopsy, invasiveness restrict their applicability for large-scale screening and routine use in primary care settings.
To overcome these limitations, simple clinical indices such as the fatty liver index (FLI) and hepatic steatosis index (HSI) have been developed using conventional regression-based approaches [
7,
8]. While attractive for their simplicity, these scores are constrained by predefined linear assumptions, limited variable interactions, and reduced performance across heterogeneous populations and metabolic phenotypes. As MASLD is a multifactorial and metabolically complex disease, such simplifications may fail to capture the nonlinear and synergistic relationships among routinely collected clinical and laboratory variables.
Recent advances in machine learning (ML) provide new opportunities to address these limitations. ML algorithms can integrate high-dimensional clinical data and model complex, nonlinear interactions among metabolic risk factors that are difficult to capture using traditional statistical approaches [
6,
9,
10,
11,
12]. Importantly, these models can be developed using routinely available clinical and laboratory parameters, including anthropometric measures, glycemic indices, lipid profiles, and liver enzymes, thereby enabling non-invasive, low-cost, and scalable detection of hepatic steatosis.
Several classical supervised ML algorithms, including logistic regression, random forest, support vector machines, k-nearest neighbors, and boosting-based methods, have been applied to the prediction of fatty liver disease and ultrasound-based steatosis classification. For example, Weng et al. developed multiple ML models for fatty liver detection based on abdominal ultrasonography [
13]. Similarly, Marques et al. evaluated ML classifiers for ultrasound-based hepatic steatosis classification [
14]. A recent review by Mahzari et al. further summarized the application of these algorithms in fatty liver prediction, highlighting their established role in this clinical domain [
15].
Despite growing interest, systematic comparisons of multiple ML models within leakage-free validation frameworks, together with clinically interpretable outputs, remain limited in the context of MASLD screening.
In this study, we developed and validated ML models to detect hepatic steatosis using exclusively routinely collected clinical and biochemical data. We systematically compared multiple algorithms, assessed model discrimination, and evaluated clinical interpretability and utility across actionable decision thresholds. To enhance transparency and facilitate clinical adoption, we further explored global feature importance and individual-level explanations.
Because the present study was based on structured tabular clinical data composed of routinely collected demographic, anthropometric, clinical, and biochemical variables, the evaluated algorithms were intentionally selected from established supervised learning approaches commonly used in clinical prediction tasks involving this type of input. This design enabled comparison across complementary modeling paradigms, including interpretable linear models and more flexible nonlinear and ensemble-based methods. Given the moderate sample size of the cohort, this modeling framework was considered methodologically appropriate, allowing robust internal validation while preserving clinical interpretability and applicability.
Accordingly, the aim of the present study was to develop an explainable, leakage-resistant, and clinically deployable predictive framework for ultrasound-detected steatosis within the MASLD spectrum using only routinely available clinical and biochemical parameters. This implementation-oriented approach was designed to facilitate integration into primary and metabolic care workflows.
2. Methods
2.1. Dataset and Ultrasonography Outcome Definition
This retrospective observational study used routinely collected clinical data from Ankara Etlik City Hospital, a tertiary referral center in Ankara, Türkiye. Adult patients (≥18 years) who underwent abdominal ultrasonography for any clinical indication between 1 May 2024 and 30 April 2025 were eligible for inclusion. The study protocol was reviewed and approved by the Scientific Research Evaluation and Ethics Committee of Ankara Etlik City Hospital, Ankara, Türkiye (Approval Date and ID: 22.10.2025, AEŞH-BADEK-2025-590).
Ultrasonography was performed using GE Healthcare ultrasound systems (GE Healthcare, Chicago, IL, USA) as part of standard clinical care and interpreted by experienced radiologists. Hepatic steatosis was assessed using established qualitative criteria based on increased hepatic echogenicity relative to the renal cortex and spleen, as documented in standardized radiology reports. The study outcome was ultrasound-detected hepatic steatosis, recorded as present or absent in the radiology report, and used as the supervised learning target representing hepatic steatosis within the MASLD spectrum.
Ultrasonography was deliberately selected as the reference imaging modality to define hepatic steatosis, as it remains the most widely accessible and routinely used first-line diagnostic tool in real-world clinical practice and population-based settings. Although its limitations in detecting mild steatosis, particularly near the 5% fat threshold, are well recognized, it represents the pragmatic standard for initial assessment within the MASLD spectrum. In the present study, ultrasonography was used exclusively to define the binary outcome variable (presence or absence of hepatic steatosis) for supervised ML model development. No ultrasound images were used as model inputs, and no manual or automated image-based feature extraction, quantitative attenuation measurements, radiomic analysis, or deep learning-based image processing were performed. All predictive variables were derived solely from routinely collected clinical, anthropometric, and biochemical data.
Patients were included if ultrasonography reports clearly documented the presence or absence of hepatic steatosis and if required predictor variables were available within 1 month of imaging. Individuals with missing ultrasonography outcomes were excluded. Additional exclusion criteria included significant alcohol consumption, known chronic viral hepatitis, other chronic liver diseases, or the use of hepatotoxic medications, when such information was available in the medical records, in order to align with the diagnostic framework of MASLD. The final dataset consisted of 644 patients, with 515 (80%) allocated to the training set and 129 (20%) to the test set using stratified random sampling.
The dataset used in the present study was derived from the institutional patient cohort contributed by our center to the previously published nationwide multicenter MASLD study by Kirik et al. [
16]. In that multicenter investigation, the pooled dataset from several centers was used to estimate the national prevalence of MASLD and the risk of advanced fibrosis among individuals with cardiometabolic risk factors using conventional epidemiological analyses.
The present study represents a secondary, methodologically independent analysis of the institutional dataset from our center. Although a subset of patients overlaps with the cohort contributed to the previously published multicenter MASLD study, the research objective, analytical strategy, and outcomes of the present work are fundamentally different. Specifically, while the multicenter study focused on estimating epidemiological prevalence and fibrosis risk, the current study develops and evaluates ML models to predict ultrasound-detected hepatic steatosis using routinely available clinical and biochemical parameters.
Therefore, the present analysis addresses a distinct research question and should not be interpreted as a subdivision or reanalysis of the previously published study, and no analyses from that work were reused in the present manuscript.
In addition to the subset of patients that contributed to the multicenter dataset, additional eligible patients from the same institutional cohort who met identical inclusion criteria during the study period were incorporated into the present analysis. The inclusion of these additional cases was intended to increase the sample size and improve the statistical robustness of the ML models while maintaining identical variable definitions, outcome criteria, and data collection procedures. Accordingly, the dataset used in the present study represents an expanded institutional cohort rather than a structurally modified dataset.
2.2. Study Design and Modeling
The primary objective was to predict the presence of ultrasound-detected hepatic steatosis within the MASLD spectrum using supervised ML algorithms applied to routinely collected clinical, anthropometric, and biochemical variables. The target outcome was ultrasound-detected hepatic steatosis (present/absent). All analyses were performed in Python 3.11.5 using standard scientific libraries, including pandas (v2.1.1), numpy (v1.26.0), scikit-learn (v1.3.1), matplotlib (v3.8.0), and xgboost (v2.0.0) and SHapley Additive exPlanations (SHAP) (v0.44.0) for gradient boosting and model explainability, respectively.
2.3. Preprocessing and Feature Selection (Single Pipeline)
The ML models in the present study were trained using tabular clinical data derived from routinely collected electronic health records. The original dataset included 86 candidate variables per patient, encompassing demographic characteristics, anthropometric measurements, clinical comorbidities, medication use, routine laboratory parameters, and derived metabolic or cardiovascular indices. These variables are summarized in
Table 1 and
Table 2. Ultrasonography was used exclusively to define the binary outcome variable (presence or absence of hepatic steatosis) and was not used as an input feature in the ML models.
To prevent information leakage, all preprocessing and feature-selection procedures were implemented within a single scikit-learn pipeline and executed inside the cross-validation loop. To ensure methodological validity, preprocessing and feature-selection procedures were treated as integral components of the modeling pipeline rather than optional analytical steps, and therefore were embedded within the cross-validation framework to prevent information leakage and maintain unbiased performance estimation. Numeric variables were median-imputed and standardized to z-scores, whereas categorical variables were mode-imputed and one-hot encoded with unknown levels ignored at transform time.
Feature selection was performed using logistic regression with an Elastic Net penalty (SAGA solver), combining L1 regularization for sparsity and L2 regularization for stability under collinearity. Predictors with non-zero coefficients after regularization were considered selected. Because one-hot encoding generates multiple columns per original variable, variable-level importance was summarized as the L2 norm of all associated coefficients. Variables were ranked by this aggregated importance, and the top 20 variables were retained as the feature panel for subsequent modeling and evaluation, with rankings derived exclusively from the training partitions within each resampling iteration.
The evaluated algorithms were selected to represent complementary modeling strategies commonly used for structured tabular clinical data, allowing comparison between linear, nonlinear, and ensemble-based approaches within the same analytical pipeline.
2.4. Data Splitting and Resampling
To obtain an unbiased estimate of performance, patients were randomly allocated using stratified sampling to training (80%; n = 515) and test (20%; n = 129) sets, preserving the class distribution of ultrasound-detected hepatic steatosis vs. no ultrasound-detected hepatic steatosis. The same train–test partition was applied across all algorithms to ensure comparability. Within the training data, we used repeated stratified k-fold cross-validation with 5 folds and 5 repetitions (total 25 resamples; fixed random seed) to benchmark candidate models. At each fold, the full pipeline (imputation → encoding/standardization → Elastic-Net selection → classifier) was fitted on the training split and evaluated on the corresponding validation split.
To ensure fair and directly comparable evaluation across classifiers, the same stratified train–test split and identical random seed were used for all models. All cross-validation procedures were performed using fixed random states, and no classifier-specific resampling or data re-partitioning was applied. The held-out test set was reserved for independent evaluation, whereas the performance metrics reported in the
Section 3 primarily reflect the repeated cross-validation estimates obtained from the training data, which provide a more stable assessment of model discrimination across resampling iterations.
2.5. Classifiers
The evaluated algorithms were selected to represent widely used classification approaches for tabular clinical datasets. This design allows for comparison between interpretable linear models, such as logistic regression, and more flexible nonlinear approaches, including ensemble and instance-based algorithms commonly applied in clinical prediction modeling. The evaluated classifiers comprised decision tree, AdaBoost [base estimator: decision tree; Stagewise Additive Modeling using a Multiclass Exponential loss function (SAMME)], random forest, XGBoost (eval_metric = logloss), gradient boosting, support vector machine with probability estimates enabled, k-nearest neighbors, multilayer perceptron (maximum iterations = 5000), Gaussian Naive Bayes, and logistic regression (maximum iterations = 2000). A unified benchmarking strategy was applied across all classifiers rather than extensive model-specific hyperparameter optimization. The classifiers were evaluated using standard scikit-learn and XGBoost implementations with largely predefined settings. Limited parameter adjustments were made only for technical or convergence-related reasons, including enabling probability estimates for support vector machines, setting the XGBoost evaluation metric to log loss, using the SAMME algorithm for AdaBoost, and increasing the maximum iteration limits for multilayer perceptron and logistic regression. No separate grid search, randomized search, Bayesian optimization, or model-specific hyperparameter tuning procedure was applied. This strategy was selected to ensure a consistent and comparable evaluation of different algorithm families within the same repeated stratified 5-fold cross-validation framework, while reducing the risk of overfitting in a moderate-sized retrospective cohort. For algorithms without native probability outputs, decision-function scores were used for receiver operating characteristic (ROC) analyses.
To provide methodological context, we evaluated a diverse set of supervised ML algorithms representing complementary modeling paradigms. Logistic regression was included as a transparent linear baseline with strong interpretability and well-calibrated probability estimates, but limited capacity to capture nonlinear relationships. Tree-based ensemble methods (decision tree, random forest, gradient boosting, and XGBoost) were selected for their ability to model nonlinear effects and higher-order interactions without explicit feature engineering; random forest emphasizes variance reduction through bagging, whereas boosting-based methods prioritize bias reduction through sequential error correction, at the cost of increased complexity and reduced interpretability.
Support vector machines were included for their effectiveness in high-dimensional spaces and robustness to overfitting, although probability calibration may be less reliable. k-nearest neighbors was evaluated as a nonparametric, instance-based learner sensitive to local structure but prone to performance degradation in high-dimensional settings. Gaussian naïve Bayes provided a probabilistic baseline with strong bias assumptions and computational efficiency, while the multilayer perceptron represented a shallow neural-network approach capable of modeling nonlinearities but sensitive to sample size and hyperparameter choices. Collectively, this model set enabled a balanced comparison between interpretability, flexibility, and predictive performance.
2.6. Evaluation Metrics and Interpretability
Predictive performance was summarized using accuracy, sensitivity (recall), specificity, positive and negative predictive values (PPVs/NPVs), F1 score, Youden’s J index, balanced accuracy, and area under the receiver operating characteristic curve (AUROC). Receiver-operating-characteristic curves were generated to visualize diagnostic discrimination. For each algorithm, results are reported as mean ± standard deviation across the 25 resamples to enable robust, distribution-aware comparisons. For interpretability, we additionally fitted a logistic-regression model (solver = liblinear) on standardized predictors and applied SHAP to quantify feature contributions, highlighting clinical variables with consistently high impact on ultrasound-detected hepatic steatosis.
In addition to commonly reported performance metrics such as accuracy, precision, recall, and F1 score, we report Youden’s J index to provide a threshold-independent summary of diagnostic effectiveness. Youden’s J (sensitivity + specificity − 1) quantifies the maximum achievable separation between true-positive and false-positive rates and is widely used in diagnostic and screening settings where balanced performance across classes is clinically important. In the context of MASLD screening, where both missed cases (false negatives) and unnecessary follow-up investigations (false positives) carry clinical and economic consequences, Youden’s J offers complementary information by emphasizing the sensitivity–specificity trade-off rather than overall correctness alone.
In addition to overall accuracy, we report balanced accuracy, defined as the average of sensitivity and specificity. In binary classification, balanced accuracy is equivalent to macro-averaged accuracy, as it weights both classes equally. Although the dataset is approximately balanced, balanced accuracy provides a class-symmetric assessment that is independent of class prevalence and facilitates comparison with diagnostic metrics such as sensitivity, specificity, and Youden’s J index. For ML models, classification-based performance metrics (sensitivity, specificity, accuracy, balanced accuracy, and Youden’s J) were computed using a fixed, pre-specified probability threshold of 0.5, applied consistently across all cross-validation resamples and the test set.
2.7. Comparative Evaluation of Simple Steatosis Scores and ML Models
To contextualize the clinical performance of the proposed ML models, we evaluated two widely used rule-based steatosis indices, HSI and FLI, using their originally proposed diagnostic thresholds. Importantly, these indices were applied as intended, without recalibration, refitting, or threshold optimization [
17,
18].
For HSI, values < 30 were considered to rule out hepatic steatosis, values ≥ 36 to rule in steatosis, and values between 30 and 35.9 were classified as indeterminate and excluded from performance analyses. For FLI, values < 30 were considered to rule out steatosis, values ≥ 60 to rule in steatosis, and values between 30 and 59.9 were considered indeterminate and excluded. Diagnostic performance metrics for HSI and FLI were calculated only among classifiable patients, in accordance with their original design [
18]. Sensitivity, specificity, accuracy, and AUROC were calculated using ultrasound-detected hepatic steatosis as the reference standard. For rule-based indices, AUROC was computed using the binary rule-based outputs after exclusion of indeterminate cases. In contrast, AUROC for ML models was calculated using continuous predicted probabilities across the entire cohort. This approach avoids implicit recalibration of established indices while allowing a fair, clinically meaningful comparison of discrimination and population coverage.
3. Results
3.1. Baseline Characteristics
Baseline demographic, anthropometric, clinical, and biochemical characteristics of participants according to hepatic steatosis status are summarized in
Table 1. Among 644 individuals, 322 (50%) had ultrasound-detected hepatic steatosis and 322 (50%) had normal liver echogenicity.
Participants with hepatic steatosis were younger on average (59.8 ± 16.0 vs. 62.4 ± 19.1 years, p = 0.004). Sex distribution and educational status were comparable between groups (p > 0.05). Anthropometric measures revealed substantial differences: individuals with hepatic steatosis exhibited greater body weight (79.8 ± 18.2 vs. 70.4 ± 14.6 kg, p < 0.001), higher body mass index (BMI) (29.0 ± 6.6 vs. 26.2 ± 5.8 kg/m2, p < 0.001), and larger waist circumference (95.7 ± 18.2 vs. 87.7 ± 17.2 cm, p < 0.001). They were also slightly taller (165.9 ± 10.0 vs. 164.2 ± 10.2 cm, p = 0.043). Regular weekly exercise was significantly less frequent in the hepatic steatosis group (13.7% vs. 22.1%, p = 0.005).
Hemodynamic parameters showed modest but significant elevations in the hepatic steatosis cohort: systolic blood pressure (126.6 ± 17.3 vs. 124.3 ± 16.1 mmHg, p = 0.026), diastolic blood pressure (75.2 ± 11.1 vs. 72.6 ± 9.7 mmHg, p < 0.001), and heart rate (82.7 ± 13.2 vs. 80.4 ± 12.9 bpm, p = 0.031).
Regarding comorbidities, diabetes mellitus (51.2% vs. 36.0%, p < 0.001) and dyslipidemia (32.0% vs. 18.3%, p < 0.001) were significantly more prevalent in participants with ultrasound-detected hepatic steatosis, while hypertension and cardiovascular diseases did not differ significantly. Obstructive sleep apnea was more common among hepatic steatosis participants (1.6% vs. 0%, p = 0.025). Medication use paralleled disease prevalence: metformin (25.8% vs. 12.4%, p < 0.001), sodium–glucose cotransporter-2 inhibitors (SGLT2-i) (14.0% vs. 8.1%, p = 0.017), and statins (20.2% vs. 12.1%, p = 0.005) were all used more frequently in participants with ultrasound-detected hepatic steatosis.
Hematologic evaluation showed slightly higher hemoglobin levels (11.8 ± 2.6 vs. 11.1 ± 2.6 g/dL, p < 0.001) and modestly lower lymphocyte counts (p = 0.004) in the hepatic steatosis group, whereas other cell counts were comparable.
Biochemically, participants with ultrasound-detected hepatic steatosis had a more adverse metabolic profile characterized by higher fasting glucose (144.5 ± 81.2 vs. 126.1 ± 63.3 mg/dL, p < 0.001), glycated hemoglobin (HbA1c) (7.11 ± 2.78 vs. 6.51 ± 2.34%, p < 0.001), and triglycerides (170.9 ± 116.7 vs. 134.4 ± 153.6 mg/dL, p < 0.001). Total cholesterol was slightly higher (p = 0.039), while low-density lipoprotein (LDL) and high-density lipoprotein (HDL) cholesterol did not differ significantly.
Liver enzymes did not show a uniform pattern of elevation in the hepatic steatosis group. AST and GGT were numerically higher but did not reach statistical significance, whereas ALT did not demonstrate a higher mean value in participants with hepatic steatosis. Albumin was higher in participants with ultrasound-detected hepatic steatosis (37.7 ± 4.6 vs. 36.0 ± 5.1 g/L, p < 0.001), while bilirubin fractions, blood urea nitrogen (BUN), creatinine (Cr), estimated glomerular filtration rate (eGFR), and thyroid indices were comparable (p > 0.05).
Overall, participants with hepatic steatosis exhibited greater adiposity, insulin resistance, and metabolic derangements, consistent with metabolic dysfunction-related patterns observed within the MASLD spectrum (
Table 1). Derived anthropometric, metabolic, cardiovascular, hepatic, hematologic, and renal indices are summarized in
Table 2.
Participants with ultrasound-detected hepatic steatosis demonstrated consistently higher values across multiple body composition indices, including waist-to-height ratio, body fat percentage, ponderal index, conicity index, relative fat mass, and visceral adiposity index (all p < 0.01), reflecting excess central and visceral adiposity, key drivers of hepatic fat accumulation.
Metabolic indices integrating glycemic status and lipid metabolism were markedly elevated in the ultrasound-detected hepatic steatosis group. The triglyceride–glucose index (TyG), TyG-BMI, TyG–waist circumference (WC), TyG–triglyceride–glucose index adjusted for waist-to-height ratio (WHtR), lipid accumulation product (LAP), and atherogenic index of plasma (AIP) were all significantly higher (all p < 0.001), indicating greater insulin resistance and dyslipidemia among individuals with steatosis. These composite indices capture nonlinear metabolic interactions that are not fully represented by isolated glucose or lipid measurements.
Cardiovascular risk indices, including Castelli I and II ratios, non-HDL cholesterol, remnant cholesterol, and rate pressure product, were also significantly increased in the hepatic steatosis group, underscoring the close association between hepatic steatosis and adverse cardiometabolic risk profiles. In contrast, pulse pressure did not differ between groups.
Regarding liver-related indices, the HSI was substantially higher in participants with ultrasound-detected hepatic steatosis (p < 0.001), supporting internal consistency with the ultrasound-based outcome definition. While fibrosis-oriented scores such as aspartate aminotransferase (AST)-to-platelet ratio index (APRI), fibrosis-4 index (FIB-4), and nonalcoholic fatty liver disease (NAFLD) fibrosis score did not differ significantly, this likely reflects the predominance of early-stage steatosis rather than advanced fibrotic disease in the cohort. Albumin–bilirubin score and hemoglobin–albumin–lymphocyte–platelet score (HALP) score differed modestly but significantly, suggesting subtle alterations in hepatic synthetic function and systemic inflammatory–nutritional balance.
Finally, selected renal and metabolic ratios, including uric acid-to-HDL ratio, were higher in the hepatic steatosis group, consistent with emerging evidence linking hyperuricemia and renal–metabolic interactions to steatotic liver disease. Overall,
Table 2 demonstrates that hepatic steatosis is associated with a broad constellation of adverse adiposity-related, metabolic, cardiovascular, and hepatic indices, many of which later emerged as key predictors in the ML models.
3.2. Model Development and Evaluation
We trained and evaluated multiple supervised classifiers to discriminate ultrasound-detected hepatic steatosis from its absence using routinely collected clinical, anthropometric, and biochemical variables. Model performance was assessed using repeated stratified cross-validation and summarized by accuracy, sensitivity, specificity, predictive values, F1 score, Youden’s J, balanced accuracy, and AUROC (
Table 3,
Figure 1).
To examine the contribution of preprocessing and feature selection, we compared model performance using the full original feature set and the Elastic Net-selected 20-feature panel within the same cross-validation framework. Across models, performance with the reduced feature panel was comparable to, and in some cases modestly improved over, the full feature set, while showing lower variance across cross-validation folds. These findings indicate that the primary performance gains were driven by the removal of redundant and collinear variables, resulting in more stable and interpretable models without loss of discrimination.
3.3. Elastic Net-Based Variable Selection
The Elastic Net procedure converged and produced a sparse solution, shrinking many coefficients to zero while retaining a compact set of informative predictors. To avoid information leakage, preprocessing and selection were implemented inside a single scikit-learn pipeline during cross-validation. Numeric variables were median-imputed and standardized (z-score); categorical variables were mode-imputed and one-hot encoded with unknown levels ignored at transform time. Predictors (or one-hot levels) with non-zero coefficients after regularization were considered selected. Because one-hot encoding yields multiple columns per original variable, variable-level importance was summarized as the L2 norm of all associated coefficients.
Based on aggregated variable-level importance, the top 20 variables associated with ultrasound-detected hepatic steatosis were (in descending order of aggregated importance): Weight (kg), Ponderal index, FIB-4, BUN/Cr ratio, height (cm), APRI, Castelli II, VAI (Visceral adiposity index), LDL cholesterol (mg/dL), TyG-WC, uric acid (UA)/Cr Ratio, BUN (mg/dL), NAFLD fibrosis score, albumin (g/L), AST (IU/L), TyG-BMI, ALT (IU/L), weekly exercise history, Cr (mg/dL), and a body shape index (ABSI). These variables appear in descending aggregated importance in
Supplementary Figure S1, while encoded-level effects (e.g., per one-hot level) are summarized in
Supplementary Figure S2. In the logistic framework, positive coefficients indicate higher log-odds of hepatic steatosis and negative coefficients indicate lower log-odds. This feature-selection step establishes a fixed 20-variable panel for all subsequent model development and reporting.
3.4. Accuracy
Overall accuracy was moderate. The highest mean accuracies were observed for Logistic Regression and Gradient Boosting (each 0.65 ± 0.03), followed by Random Forest and Support Vector Machine (SVM)/XGBoost (≈0.63 ± 0.03–0.04). Lower-tier models achieved ≤0.61 on average [Multilayer Perceptron (MLP) 0.61 ± 0.04; k-NN 0.59 ± 0.03; Decision Tree 0.58 ± 0.04; Naïve Bayes 0.58 ± 0.06; AdaBoost 0.57 ± 0.0]. Thus, Logistic Regression and Gradient Boosting were the most accurate learners.
3.5. F1 Score
F1 scores mirrored accuracy: Logistic Regression and Gradient Boosting led with 0.65 ± 0.04, while Random Forest/XGBoost were slightly lower (≈0.62 ± 0.03–0.04). SVM/MLP were around 0.60 ± 0.04; k-NN 0.59 ± 0.04, Decision Tree 0.58 ± 0.05, Naïve Bayes 0.59 ± 0.11, and AdaBoost 0.57 ± 0.05. Hence, Logistic Regression and Gradient Boosting achieved the best precision–recall balance.
3.6. Sensitivity
Mean sensitivity was highest for Gradient Boosting (0.65 ± 0.06) and Logistic Regression (0.64 ± 0.06). Naïve Bayes reached a similar mean (0.65 ± 0.20) but with very wide dispersion, indicating instability. SVM traded sensitivity (0.57 ± 0.04) for higher specificity (see below).
3.7. Specificity
Specificity peaked with SVM (0.69 ± 0.06), followed by Random Forest (0.67 ± 0.05) and Logistic Regression (0.67 ± 0.06). Gradient Boosting yielded 0.65 ± 0.06. Thus, the most conservative false-positive control was achieved by SVM.
3.8. Youden’s J
Youden’s J was highest for Logistic Regression and Gradient Boosting (each 0.30 ± 0.07), then Random Forest (0.27 ± 0.07) and SVM/XGBoost (≈0.25 ± 0.06–0.07). Lower-tier models were ≤0.21 (e.g., MLP 0.21 ± 0.07; k-NN 0.18 ± 0.07; Decision Tree 0.16 ± 0.08; Naïve Bayes 0.15 ± 0.12; AdaBoost 0.15 ± 0.08). Accordingly, Logistic Regression and Gradient Boosting maximized the sensitivity–specificity composite.
3.9. AUROC
Mean AUROC was highest for Logistic Regression (0.71 ± 0.04), followed by Random Forest (0.69 ± 0.04), Gradient Boosting (0.68 ± 0.04), SVM (0.68 ± 0.04), and XGBoost (0.67 ± 0.03). Remaining models were ≤0.65 (MLP 0.65 ± 0.03, Naïve Bayes 0.63 ± 0.06, k-NN 0.62 ± 0.04, Decision Tree 0.58 ± 0.04, AdaBoost 0.57 ± 0.04).
3.10. Balanced Accuracy
Balanced accuracy corroborated the above: Logistic Regression (0.66 ± 0.06) and Gradient Boosting (0.65 ± 0.06) led, with Random Forest (0.64 ± 0.04), SVM (0.63 ± 0.05), and XGBoost (0.62 ± 0.06) close behind; other models were ≤0.60.
3.11. ROC Visualization
Mean ROC curves (
Figure 2) showed a reproducible lift above chance for the top-tier models, consistent with their AUROC ranking in
Table 3. Upper-envelope curves corresponded to Logistic Regression, Random Forest, Gradient Boosting, SVM, and XGBoost (
Figure 2;
Table 3).
3.12. PPV/NPV
PPV/NPV reflected each model’s sensitivity–specificity balance. For example, Logistic Regression achieved PPV 0.66 ± 0.04 and NPV 0.65 ± 0.03, while SVM showed PPV 0.65 ± 0.05 with lower NPV (0.61 ± 0.03) owing to its specificity-oriented profile. Random Forest and Gradient Boosting remained in the mid-0.60s for both indices. Hence, Logistic Regression and Gradient Boosting offered the most balanced predictive value profiles among top performers (
Table 3).
3.13. SHAP Summaries and Directionality
To elucidate model behavior, we computed SHAP summary plots from a logistic-regression explainer fitted on standardized predictors (
Figure 3). The global importance profile showed a right-skewed distribution, indicating that a compact subset of the selected variables accounts for most of the predictive signal. The direction of effects was clinically coherent: features with higher SHAP magnitudes systematically increased (positive SHAP) or decreased (negative SHAP) the predicted probability of ultrasound-detected hepatic steatosis in line with their regularized coefficients. Local (beeswarm) patterns revealed heterogeneous but directionally stable effects across individuals, with no single feature exerting idiosyncratic influence limited to a narrow subgroup. Collectively, the Elastic Net ranking and SHAP attributions were concordant, reinforcing that the model relies on interpretable, routinely available parameters, and supporting the plausibility and robustness of observed discrimination. Although Elastic Net-based feature selection identified a clinically coherent predictor set, we did not perform a formal fold-wise feature stability analysis based on selection frequencies across all cross-validation resamples. Therefore, the reported feature panel should be interpreted as a parsimonious and clinically interpretable predictor set rather than a definitive ranking of biomarker importance. Given the presence of correlated anthropometric, metabolic, hepatic, and lipid-derived indices, some variability in selected predictors across resampling iterations would be expected. Future studies using larger external cohorts should further evaluate feature stability through repeated selection-frequency analyses or bootstrap-based stability selection approaches.
3.14. Comparison Between ML Models and Simple Clinic Scores for Ultrasound-Detected Hepatic Steatosis
When applied according to their original diagnostic rules, both HSI and FLI demonstrated modest discriminative ability and left a substantial proportion of patients in an indeterminate category (
Table 4). Using rule-based classification, HSI was applicable to 71.3% of the cohort, excluding 28.7% as indeterminate, and achieved high sensitivity (0.84) but limited specificity (0.42), reflecting its primary role as a rule-out tool. FLI was applicable to 76.6% of patients, excluding 23.4% as indeterminate, and showed a more balanced but still moderate performance (sensitivity 0.71, specificity 0.55). The AUROC values for rule-based HSI and FLI were 0.63 and 0.63, respectively. In contrast, the best-performing ML models, logistic regression and gradient boosting, provided continuous risk estimates for 100% of patients, without an indeterminate zone. These models demonstrated higher overall discrimination, with AUROC values of approximately 0.71 for logistic regression and 0.68 for gradient boosting, alongside balanced accuracy values of approximately 0.65–0.66. Unlike rule-based indices, ML models maintained stable performance across repeated cross-validation and allowed flexible threshold selection depending on clinical context. Collectively, these findings indicate that while HSI and FLI retain utility as simple screening tools, ML models offer improved discrimination, complete cohort coverage, and greater flexibility for individualized risk stratification.
4. Discussion
Metabolic dysfunction-associated steatotic liver disease has emerged as the most prevalent chronic liver condition worldwide, paralleling the global rise in obesity, type 2 diabetes, and metabolic syndrome. Early identification of individuals at risk is critical, yet conventional diagnostic modalities such as ultrasonography, CAP, or MRI-PDFF remain costly, operator dependent, and impractical for population-level screening [
19,
20]. In this study, we developed and validated ML models leveraging routine clinical and biochemical parameters to predict ultrasound-detected hepatic steatosis. Among the evaluated algorithms, logistic regression and gradient boosting achieved the highest performance, with an AUROC of 0.70 and a balanced accuracy of 0.65, demonstrating modestly improved discrimination compared with rule-based indices such as FLI and HSI in our cohort, while avoiding indeterminate classifications and ensuring complete population coverage [
7,
9,
17,
21]. An important methodological consideration when comparing ML models with established rule-based indices is population coverage. HSI and FLI are designed with rule-in and rule-out thresholds and therefore include an intermediate indeterminate range. Accordingly, their diagnostic performance is evaluated only among individuals classified outside this gray zone, whereas ML models generate probability-based predictions for all patients. This difference may influence the interpretability of direct metric-based comparisons, because exclusion of indeterminate cases alters the evaluable population. Therefore, comparisons between ML models and HSI/FLI should be interpreted not only in terms of discrimination metrics, but also in the context of clinical applicability and cohort coverage across the target population. These findings suggest that the moderate performance observed across models may be improved by incorporating larger and more diverse datasets, integrating imaging-derived or longitudinal features, and performing external validation in independent cohorts. The present model is designed to predict ultrasound-detected hepatic steatosis, which represents an early and central component of the MASLD spectrum, rather than establishing a comprehensive diagnosis of MASLD that requires broader clinical and metabolic evaluation. Although several cardiometabolic and cardiovascular indices were included among the candidate predictors, the present framework should not be interpreted as a model for predicting MASLD progression, fibrosis development, cardiovascular disease onset, or longitudinal cardiovascular risk trajectories. The study was cross-sectional in design, and the supervised learning target was limited to the binary classification of current ultrasound-detected hepatic steatosis. Therefore, lipid-, insulin resistance-, and cardiovascular-related markers included in the model should be interpreted as cross-sectional correlates that improve classification of current liver status rather than as predictors of future cardiovascular events or temporal disease progression. Longitudinal cohorts with time-to-event outcomes will be required to determine whether similar ML frameworks can predict longitudinal liver- and cardiovascular-related outcomes.
The comparable performance of Logistic Regression and Gradient Boosting suggests that a substantial proportion of the predictive signal may be captured by additive effects of routinely measured clinical variables, although more complex nonlinear interactions cannot be excluded. The modest yet consistent AUROC observed in the present study is broadly consistent with recent ML approaches for MASLD or steatosis prediction based on routinely available clinical and laboratory data, which similarly demonstrate moderate to good discrimination without reliance on imaging-derived features [
22]. For example, Cubillos et al. showed that a novel deep learning (DL) approach, which converts tabular clinical data into image-like representations, outperformed traditional ML models and the HSI for predicting steatotic liver disease (SLD). Using data from 2999 patients, their best DL model achieved high diagnostic accuracy (AUC = 0.87, sensitivity = 0.95, specificity = 0.64), demonstrating the superior predictive capability of DL-based methods for non-invasive SLD detection [
23]. Lim et al. developed and validated ML-based survival models to predict the time to onset of MASLD in individuals without baseline disease. Using data from over 25,000 Korean participants for model development and 16,000 Chinese participants for independent validation, they trained random survival forest and extra survival tree models based on routine clinical and laboratory variables. Both models demonstrated strong predictive performance, with c-indices around 0.75 in the external cohort. The study showed that ML survival models can accurately estimate individualized risk and timing of MASLD onset, supporting personalized prediction and tailored follow-up strategies in clinical practice [
24]. Our findings extend these observations by demonstrating model reproducibility across multiple algorithms and providing rigorous cross-validation estimates with embedded feature selection, thereby mitigating overfitting and information leakage.
Elastic Net regularization identified a concise set of 20 predictors reflecting both metabolic and hepatic injury pathways. Anthropometric indices (weight, Ponderal index, height, a body shape index) captured central and overall adiposity, consistent with the pivotal role of excess fat mass in hepatic lipid deposition [
25,
26]. Derived composite indices such as VAI, TyG-BMI, and TyG-WC integrate dyslipidemia and insulin resistance and are increasingly recognized as robust surrogates for visceral adiposity and hepatic steatosis [
27,
28].
Liver injury markers including ALT, AST, and fibrosis surrogates (FIB-4, APRI, NAFLD Fibrosis Score) contributed substantially, underscoring the continuum between metabolic steatosis and early fibrotic remodeling [
29,
30,
31]. Renal and nitrogenous markers (BUN, Cr, BUN/Cr ratio, UA/Cr ratio) emerged as informative correlates, an observation supported by emerging data linking hyperuricemia and renal–hepatic axis dysfunction to steatosis and metabolic syndrome progression [
32,
33]. The inclusion of Castelli II index (LDL/HDL ratio) and albumin further reflects the interplay between lipid transport, synthetic function, and systemic inflammation [
34,
35].
Our model’s discrimination is broadly consistent with prior ML frameworks for the prediction of hepatic steatosis within the MASLD spectrum using routine data. Xiao et al. developed and validated five ML models, logistic regression, random forest, XGBoost, gradient boosting, and SVM, to predict MASLD using clinical and biochemical variables. In a cohort of 578 ultrasound-evaluated participants and an external MRI-based validation set (n = 131), key predictors included VAI, abdominal circumference, BMI, ALT, ALT/AST ratio, age, HDL-C, and triglycerides. Among the models, XGBoost achieved the highest predictive accuracy (AUC = 0.94 after tuning), outperforming others. The authors concluded that XGBoost offers a reliable, noninvasive tool for early identification of high-risk NAFLD patients in clinical settings [
36]. Verschuren et al. developed a mechanism-based, non-invasive biomarker panel to detect fibrosis in MASLD [
37]. Using a translational approach that integrated findings from a diet-induced MASLD mouse model with human liver transcriptomics and serum proteomics, they identified three key biomarkers: Insulin-Like Growth Factor Binding Protein 7 (IGFBP7), Scavenger Receptor Cysteine-Rich Type 5 Domain-Containing Protein (SSc5D), and Semaphorin 4D (Sema4D). When modeled using light gradient boosting machine (LightGBM), this panel accurately predicted fibrosis stages (AUCs = 0.82 for F0/F1, 0.89 for F2, 0.87 for F3/F4), outperforming established markers such as FIB-4, APRI, and FibroScan. The findings demonstrate that this three-protein blood-based panel can reliably identify both mild and advanced MASLD fibrosis, offering a promising non-invasive diagnostic tool. Although emerging serum, genetic, and omics-based biomarkers have demonstrated considerable potential for improving the detection, staging, and prognostic stratification of MASLD, their current clinical utility remains limited. Many of these biomarkers are associated with high costs, lack standardized analytical assays or universally accepted cut-off values, and are not routinely available in primary care or general clinical settings. In contrast, the present study deliberately prioritized real-world clinical applicability by exclusively utilizing routinely collected clinical, anthropometric, and biochemical parameters that are already embedded in standard care pathways. This design choice represents a pragmatic and implementation-oriented strategy rather than a restriction in conceptual scope. Importantly, future incorporation of well-validated biomarkers into this framework may further enhance predictive performance, particularly with respect to disease staging and progression, while preserving the scalability, accessibility, and clinical feasibility of the proposed models. This level of discrimination should therefore be interpreted in the context of the intentionally pragmatic design of the present study, which relied exclusively on routinely available clinical and biochemical parameters and applied leakage-free validation procedures, both of which tend to yield more conservative but clinically realistic performance estimates [
38]. Importantly, the strict cross-validation strategy and embedded preprocessing design enhance model robustness and reduce optimistic bias, a limitation frequently observed in earlier single-split studies [
26]. In the present study, model performance was primarily summarized as mean ± standard deviation across the 25 cross-validation resamples in order to reflect variability across repeated training–validation partitions. Because repeated cross-validation folds are not fully independent observations, these summaries should be interpreted as descriptive indicators of model stability rather than as formal measures of sampling uncertainty. For completeness, 95% confidence intervals were also reported in
Table 3; however, these should be interpreted cautiously in the context of repeated resampling, where dependence between folds may limit the strict inferential meaning of conventional interval estimates.
In this context, reporting mean performance together with the standard deviation across resamples is widely used to characterize model robustness and variability across resampling iterations. Within this framework, the combined presentation of mean ± standard deviation and confidence intervals provides complementary descriptive information, while the overall interpretation should remain focused on consistency of performance across resampling iterations rather than on formal hypothesis-driven inference.
Interpretability remains central to the translation of ML tools into clinical workflows. SHAP analysis confirmed that model decisions were biologically plausible and aligned with established MASLD pathophysiology [
39]. Higher weight, VAI, TyG-BMI, and FIB-4 consistently increased predicted risk, while higher albumin and lower Cr were protective. The concordance between Elastic Net coefficients and SHAP attributions reinforces model transparency and trustworthiness. Such interpretability facilitates clinician acceptance and supports integration into electronic health record (EHR)-based decision support systems, allowing automated, low-cost pre-screening for individuals warranting confirmatory imaging or lifestyle intervention.
From a public health perspective, scalable ML models based solely on routine clinical and biochemical data offer a pragmatic path toward early identification of hepatic steatosis within the MASLD spectrum in primary care. They can augment conventional scores by incorporating complex, multidimensional patterns without requiring novel biomarkers or imaging. Future work should focus on external validation across diverse ethnic groups and healthcare systems, incorporation of longitudinal data to predict disease progression such as fibrosis development, and integration of genetic and metabolomic predictors for enhanced precision [
13,
40].
While multiple prior studies have applied ML models with explainability to ultrasound-detected hepatic steatosis within the MASLD spectrum, the novelty of the present work lies not in introducing a new algorithm but in providing a methodologically rigorous and clinically deployable modeling framework for ultrasound-detected steatosis within the MASLD spectrum. In contrast to many earlier reports that rely on single train-test splits or post hoc feature interpretation, our approach integrates leakage-free preprocessing, embedded Elastic Net-based feature selection, repeated stratified cross-validation, and multi-model benchmarking under a unified pipeline. This design minimizes optimistic bias and enables fair comparison across distinct model families, prioritizing methodological rigor, interpretability, and real-world usability. Importantly, the primary objective of this study was not algorithmic innovation, but the construction of robust, transparent, and scalable tools that can operate on routinely collected clinical and biochemical data in routine hepatology and metabolic practice. Although deep learning architectures such as convolutional neural networks or transformer-based models may achieve higher classification performance when applied to raw imaging or high-dimensional data, such approaches typically require large annotated datasets, specialized computational infrastructure, and often provide limited interpretability. In contrast, the present study deliberately focused on routinely collected clinical and biochemical parameters to develop a transparent and clinically applicable screening framework, offering scalability, accessibility, and immediate applicability in primary care and metabolic clinics where advanced imaging data or deep learning infrastructure are not consistently available.
Furthermore, the study emphasizes clinical realism and interpretability, leveraging exclusively routinely collected variables and demonstrating that transparent models such as logistic regression and gradient boosting can achieve competitive performance when developed under robust validation protocols. The concordance between Elastic Net rankings and SHAP attributions reinforces biological plausibility and supports clinical trust, an aspect often underexplored in prior work focused primarily on performance metrics. Although direct benchmarking against publicly available datasets would be informative, heterogeneity in outcome definitions, population characteristics, and available predictors across MASLD datasets limits the validity of direct comparisons. Future studies should focus on harmonized external validation and benchmarking across multiple open datasets to further evaluate generalizability and comparative performance.
In this study, we compared interpretable ML models with established rule-based steatosis indices under conditions that preserve the original intent of each method. Rather than recalibrating or optimizing FLI and HSI within the study cohort, we applied their predefined rule-in and rule-out thresholds, thereby reflecting real-world clinical use. Our findings demonstrate that HSI and FLI, when used as designed, achieve modest discrimination (AUROC = 0.63) and leave approximately one quarter of patients in an indeterminate category. This limitation is inherent to their rule-based structure and reflects a deliberate trade-off between diagnostic certainty and coverage. HSI prioritizes sensitivity and exclusion of disease, whereas FLI offers a more balanced profile but still lacks comprehensive applicability. By contrast, the proposed ML models provide continuous, probabilistic risk estimates for all individuals, avoiding indeterminate classifications altogether. Although the absolute improvement in AUROC is modest, this gain becomes clinically meaningful when combined with full population coverage, balanced sensitivity–specificity trade-offs, and transparent model behavior. This advantage is particularly relevant in screening settings, where probabilistic risk estimation may offer practical benefits over rule-based scores that leave a substantial proportion of individuals unclassified. The interpretability of the ML models, supported by Elastic Net feature selection and SHAP analyses, further demonstrates that predictions are driven by biologically plausible metabolic and hepatic injury pathways, addressing a common barrier to clinical adoption of ML-based tools. Taken together, these results suggest that ML models should be viewed not as replacements for simple indices, but as complementary decision-support tools that can augment existing screening strategies. In settings where indeterminate results from rule-based scores necessitate further testing or imaging, ML-based risk stratification may help guide prioritization, reduce uncertainty, and support screening for ultrasound-detected hepatic steatosis within routine clinical workflows. It should also be noted that rule-based indices such as HSI and FLI inherently include indeterminate zones; consequently, their performance metrics are calculated only among classifiable individuals. In contrast, ML models generate continuous risk estimates for all patients. Accordingly, the comparison should be interpreted primarily in terms of clinical applicability and population coverage rather than as a purely head-to-head statistical equivalence.
Consistent with this observation, the comparative analyses suggest that the primary performance gains were not attributable to a specific classifier alone, but rather to the combined effect of leakage-free preprocessing and targeted feature reduction. In particular, the Elastic Net-derived feature panel achieved similar discrimination to the full feature set while improving model stability and interpretability.
In the present study, preprocessing and feature selection were integrated within a single leakage-free modeling pipeline. All imputation, scaling, encoding, and Elastic Net-based selection steps were embedded within the cross-validation loop, ensuring that performance estimates reflected unbiased model development procedures and the combined effects of data structure, preprocessing, feature regularization, and classifier choice. This design minimizes information leakage and provides a more reliable estimate of model performance compared with simplified preprocessing strategies. Simplified ablation analyses were not pursued because they could produce methodologically invalid comparisons or clinically implausible data representations.
A limitation of this study is the use of ultrasonography as the reference standard for hepatic steatosis. While ultrasonography is widely available and routinely used in clinical practice, its sensitivity for detecting mild steatosis is limited, particularly near the ≥5% threshold used in the current MASLD definition. As a result, some individuals with early steatosis may have been misclassified; however, such misclassification is expected to be largely non-differential and therefore more likely to attenuate, rather than inflate, model performance. This limitation may have particularly affected the distinction between mild steatosis and the absence of disease. Accordingly, the observed moderate yet consistent AUROC values should be interpreted as conservative estimates of the true predictive capability of the proposed models. Although the models demonstrated stable performance within the internal validation procedures, external validation in independent cohorts will be necessary to confirm generalizability across different populations and clinical settings. Importantly, this study was not intended to provide a definitive diagnostic alternative to advanced imaging modalities such as CAP or MRI-PDFF, but rather to develop a pragmatic and scalable pre-screening and risk stratification framework applicable in primary care and routine clinical settings to identify individuals who may benefit from confirmatory imaging or early preventive interventions. Accordingly, the proposed models should be interpreted as tools for prioritizing individuals for confirmatory imaging rather than as substitutes for ultrasonographic diagnosis.
Although deep learning architectures such as CNN- or transformer-based models may provide additional modeling flexibility, the present study focused on interpretable ML approaches using structured clinical data. Future studies with larger datasets may further explore deep learning methods to evaluate potential performance improvements.
Another consideration relates to the potential influence of medication use and temporal biological variability on model predictors. In the present cohort, glucose-lowering agents and statins were more frequently used among individuals with MASLD, reflecting the higher burden of metabolic comorbidities in this group. Such treatments, together with intra-individual fluctuations and transient metabolic states, may affect several biochemical parameters included in the model, including glucose levels, lipid profiles, and liver enzymes. Consequently, some predictors may partially reflect dynamic treatment-related or short-term metabolic conditions in addition to underlying disease biology. However, the study was intentionally designed to reflect real-world clinical conditions, where patients are commonly evaluated while receiving treatment for metabolic disorders. In this context, these factors represent part of the observable clinical phenotype rather than an artificial confounder to be fully removed. Nonetheless, future studies incorporating longitudinal measurements and treatment-stratified analyses are warranted to further clarify the relative contributions of stable hepatic pathology, metabolic variability, and therapeutic effects to model predictions.
This study was designed to maximize analytical repeatability and methodological transparency. All preprocessing, feature selection, and model training steps were implemented within a single, deterministic scikit-learn pipeline with fixed random seeds, ensuring that identical results can be reproduced when the same data and code are used. The use of repeated stratified cross-validation, standardized preprocessing procedures, and widely adopted open-source libraries further enhances repeatability and reduces susceptibility to implementation-specific variation.
Reproducibility across independent cohorts is supported by the exclusive use of routinely collected clinical, anthropometric, and biochemical variables, which are commonly available in electronic health records across healthcare systems. Although the underlying patient-level dataset cannot be publicly released due to ethical and institutional restrictions, the modeling workflow relies entirely on transparent, well-documented algorithms, allowing independent investigators to reproduce the analytical framework using their own datasets. To facilitate reproducibility, the analysis code and pipeline specifications can be made available upon reasonable request, enabling verification of results and extension of the methodology in external populations.
Future studies may further quantify the incremental contribution of individual preprocessing components through formal ablation and sensitivity analyses in independent external datasets.