1. Introduction
Non-alcoholic fatty liver disease (NAFLD) is the most prevalent chronic liver condition globally, affecting approximately 30% of the general population [
1]. Once considered a benign hepatic manifestation of obesity, NAFLD is now recognized as a multisystem disorder closely intertwined with insulin resistance, type 2 diabetes, dyslipidemia, and cardiovascular disease [
2,
3,
4]. Its clinical spectrum ranges from simple steatosis to non-alcoholic steatohepatitis (NASH), fibrosis, cirrhosis, and even hepatocellular carcinoma [
5]. Given its strong association with metabolic dysfunction, an international consensus panel recently proposed redefining NAFLD as metabolic dysfunction-associated fatty liver disease (MAFLD) [
6]. This paradigm shift emphasizes positive diagnostic criteria based on the presence of metabolic risk factors rather than the exclusion of other liver diseases, thereby improving clinical relevance and inclusivity [
7].
Epidemiological studies have consistently identified key risk factors for fatty liver, including central obesity, hypertriglyceridemia, low high-density lipoprotein cholesterol (HDL-C), elevated fasting plasma glucose, hypertension, sedentary behavior, and genetic predisposition [
8,
9,
10]. However, a substantial proportion of existing research treats NAFLD as a binary outcome—either present or absent—typically diagnosed via imaging or biomarkers with predefined cutoffs. While this approach facilitates case-control comparisons, it discards valuable information about the degree of hepatic fat accumulation and limits the ability to detect dose–response relationships or subtle metabolic gradients [
11].
To address this limitation, the fatty liver index (FLI) was developed as a continuous, noninvasive surrogate for hepatic steatosis, combining body mass index (BMI), waist circumference, serum triglycerides, and γ-glutamyltransferase (γ-GT) into a single score ranging from 0 to 100 [
12]. FLI has been validated against ultrasound and magnetic resonance imaging in multiple populations and demonstrates strong predictive performance for incident type 2 diabetes and cardiovascular events [
13,
14]. By modeling FLI as a continuous outcome, researchers can uncover more nuanced associations between metabolic, inflammatory, and lifestyle variables and the severity of fatty liver, even in ostensibly healthy individuals.
In parallel, advances in machine learning have opened new avenues for modeling complex biomedical relationships. Among these, multivariate adaptive regression splines (MARS) offer a unique balance between flexibility and interpretability [
15]. Unlike “black-box” algorithms such as deep neural networks, MARS constructs a piecewise linear model using hinge functions that automatically detect nonlinearities and interactions without requiring prespecified functional forms [
16]. The resulting model can be expressed as a transparent mathematical equation—making it particularly suitable for clinical translation and hypothesis generation. MARS has demonstrated superior predictive accuracy over traditional multiple linear regression (MLR) in diverse domains, including livestock weight estimation [
17] and cardiovascular risk prediction [
18], yet its application in hepatology remains scarce.
Notably, because BMI, waist circumference, triglycerides, and γ-GT are already embedded in the FLI formula, including them as predictors would introduce circularity and inflate model performance. Therefore, in this study, we deliberately excluded these four variables to explore independent determinants of FLI—such as inflammatory markers (e.g., C-reactive protein, white blood cell count), liver enzymes, renal function, and lifestyle factors—in a cohort of healthy young Taiwanese men aged 20–50 years who were free of medications for metabolic conditions. This population is of particular interest because early metabolic perturbations may manifest before the onset of overt obesity or diabetes, offering a window into the initial drivers of fatty liver development.
The principal goals of this research were:
- (1)
To compare the predictive accuracy of MARS against traditional MLR in estimating FLI, using robust error metrics;
- (2)
To derive an interpretable MARS-based equation that ranks the relative importance of non-FLI variables in predicting hepatic steatosis risk in this understudied demographic.
While the primary aim of this study is predictive—to develop an accurate model for estimating FLI—the use of MARS is specifically intended to provide interpretable, mechanistic insight into the factors driving hepatic steatosis risk, beyond the components of the FLI itself. By modeling FLI as a continuous variable, this approach enables fine-grained risk stratification, identifying individuals across a spectrum of risk rather than a binary NAFLD classification. The resulting equation is designed to be practically implementable using common clinical variables, potentially aiding in early screening and personalized preventive strategies in primary care or health check-up settings by highlighting modifiable risk factors such as inflammation, dyslipidemia, and hyperglycemia.
2. Materials and Methods
2.1. Participant and Study Design
The data used in this investigation were derived from the Taiwan MJ Cohort, an established, ongoing prospective cohort encompassing health examinations performed by the MJ Health Screening Centers in Taiwan. This comprehensive dataset includes over 100 essential biological indicators, including anthropometric measurements, blood biochemical analyses, and imaging results. Participants also provided information on personal and family medical history, current health status, lifestyle, physical exercise, sleep patterns, and dietary habits via a self-administered questionnaire.
Study data were acquired from the MJ Clinic Database, maintained by the MJ Health Research Foundation. A general consent for future anonymous research was obtained from participants at the time of their original health check-up. The use of this data was authorized by the MJ Health Research Foundation (Authorization No.: MJHRF2024002A). As this is a secondary database analysis not involving new sample collection, a project-specific consent form was not required. Detailed procedures for the initial data collection are available in the annual technical report published by the MJ Health Research Foundation [
19].
This study was reviewed and approved by the Institutional Review Board of the Tri-Service General Hospital (IRB No.: A202405006), receiving an expedited review due to its nature as a secondary analysis.
The initial enrollment for the cohort included 1,498,312 subjects. Following the application of our predefined exclusion criteria, 5496 male subjects were retained for the final analysis (as detailed in the flow chart,
Figure 1).
Inclusion Criteria:
Men between 20 and 50 years old
No history of significant medical diseases such as stroke, myocardial infarction, or heart failure
No medications for metabolic syndrome
Without alcohol consumption
Subjects aged from 20 to 50 years were selected to capture individuals in the preclinical and early metabolic dysfunction phase, prior to the development of overt cardiometabolic diseases [
20,
21]. This design facilitates the study of incipient risk factors for fatty liver. The sample was restricted to men to avoid the substantial confounding effects of female sex hormones (e.g., estrogen) on liver fat accumulation, lipid profiles, and inflammatory markers—factors that differ by menopausal status, contraceptive use, and menstrual cycle phase [
22,
23]. This approach ensures cohort homogeneity and enhances model interpretability by removing these complex, sex-specific variables.
2.2. Measurements and Biochemical Analysis
On the day of the health examination, trained personnel, typically a senior nurse, documented participants’ personal history details, including current and past habits such as tobacco, alcohol, and betel nut consumption, as well as their education level. Body weight (kg) was accurately recorded using a calibrated electronic scale. Both systolic blood pressure (SBP) and diastolic blood pressure (DBP) were measured using a standardized electric sphygmomanometer.
Blood samples were collected following a minimum 10 h fasting period. The plasma was promptly separated from the whole blood within one hour of collection and subsequently stored at degrees Celsius until lipid profile testing. Lipid profile analysis was performed as follows: Total cholesterol (TC) and triglyceride (TG) concentrations were determined using a dry, multi-layer analytical slide method with the Fuji Dri-Chem 3000 analyzer (Fuji Photo Film, Tokyo, Japan). High-density lipoprotein cholesterol (HDL-C) and low-density lipoprotein cholesterol (LDL-C) concentrations were quantified using an enzymatic cholesterol assay method subsequent to dextran sulfate precipitation. Further details on the methodology and standardized procedures may be found in our previous related work [
24].
2.3. Traditional Statistics
Data are presented as the means ± standard deviations. To evaluate differences in continuous variables between groups, specific statistical tests were used based on the nature of the compared variables:
T-tests were used to assess the difference in means between two independent groups, specifically between married and unmarried participants.
Analysis of Variance (ANOVA) was applied when comparing differences across groups categorized by ordinal variables, such as education and income levels.
Pearson correlation coefficient was calculated to analyze the linear relationships between all continuous variables and the primary outcome measure, the FLI.
Furthermore, Multiple Linear Regression (MLR) was performed to serve as a benchmark for comparison against the performance of the various machine learning models. All statistical assessments were two-sided, and a p-value less than 0.05 (p < 0.05) was defined as the threshold for statistical significance. All data analyses were executed using SPSS 10.0 for Windows (SPSS, Chicago, IL, USA).
2.4. Description of the Study Data
Table 1 defines the 30 clinical variables used in this study. We gathered the following dependent variables from our study participants:, white blood cell (WBC) count, hemoglobin level, platelet count, total bilirubin (TBIL), total protein, albumin, globulin, aspartate aminotransferase (AST), alanine aminotransferase (ALT), gamma-glutamyltransferase (γ-GT), lactate dehydrogenase (LDH), creatinine, uric acid, TG, HDL-C, LDL-C, thyroid-stimulating hormone (TSH) level, C-reactive protein (CRP) level, educational level, marriage status, sleep time, and SBP and DBP level. The sleep time was an ordinal variable as shown in
Table 1. Finally, the equation of FLI = ey/(1 + ey) × 100, where y = 0.953 × ln(triglycerides, mg/dL) + 0.139 × (BMI kg/m
2) + 0.718 × ln(γ-GT, U/L) + 0.053 × (waist circumference, cm) − 15.745 [
25]. Since BMI, waist circumference, γ-GT, and triglyceride were used in the calculation of FLI, they were excluded in the MARS.
2.5. Machine Learning Analysis: MARS and Model Evaluation
The dataset was investigated using the MARS technique, a powerful, non-parametric modeling approach well-suited for high-dimensional data, capable of crafting adaptable models. The methodology uses an expansion structure based on product spline basis functions. Crucially, the model is built autonomously through data-driven mechanisms [
26]; this includes determining the number of basis functions, the attributes associated with each function (e.g., product degree), and the placement of knots. This strategy is inspired by recursive partitioning principles, similar to methods like Classification and Regression Trees (CART), allowing MARS to effectively capture complex higher-order interactions.
2.5.1. Model Training and Validation
For the analysis, the dataset was initially divided into two segments: an 80% training set used for model construction and a separate 20% testing set designated for final model assessment.
During the training phase, MARS models require the tuning of specific hyperparameters to ensure optimal performance. To achieve this, the 80% training dataset was further divided into two random segments: one for model formulation using a distinct set of hyperparameters, and the other for validation. A comprehensive grid search approach was implemented, systematically exploring all possible combinations of hyperparameters to identify the best configuration.
To establish a comparative context, the averaged performance metrics derived from the tuned MARS model were used to contrast its performance with that of the MLR model, which served as the benchmark. Both the MARS models and the MLR model were trained and tested using the exact same data partitions.
2.5.2. Model Evaluation: Performance Metrics
The predictive effectiveness of the MARS model was evaluated using the 20% testing dataset. Since the target variable in this study is a continuous, numerical parameter, the chosen evaluation metrics to compare model performance included:
Symmetric Mean Absolute Percentage Error (SMAPE)
Root Relative Squared Error (RRSE)
Root Mean Squared Error (RMSE)
SMAPE was calculated as: SMAPE = (100%/n) × Σ [|Actual − Predicted|/((|Actual| + |Predicted|)/2)], where n is the number of observations. This metric was selected for its robustness when dealing with values near zero, as it avoids the denominator instability of standard MAPE.
The model configuration that exhibited the lowest Root Mean Squared Error (RMSE) when applied to the validation dataset was selected as the optimal MARS model. This optimal MARS model was then compared against the benchmark MLR model using the testing dataset. The specific values for these metrics can be found in
Table 2.
All methods were performed using R software version 4.0.5 and RStudio version 1.1.453 with the required packages installed [
27,
28]. The implementations of MARS were the “earth” R package version 5.3.3 [
29] with “caret” R package version 6.0–94 [
30]. The MLR was implemented using the “stats” R package version 4.0.5, and the default setting was used to construct the models.
4. Discussion
Using MARS, we built an equation to estimate FLI and found the most important feature related to MARS for healthy Taiwanese young men. This work represents the first use of MARS in this field [
31,
32,
33,
34] and presents the following novel contributions: (1) Using MARS to build an equation. (2) Using FLI rather than binary data (i.e., NAFLD present or not). (3) From the coefficients in the equation, determining the relative importance of these features. (4) Focusing on young healthy men without medication, which might have otherwise impacted the independent variables. We thus consider our findings to be reliable.
Previous studies largely used the presence or absence of NAFLD as the dependent variable (categorical), with results presented as area under receiver operating curve or odds ratio. In contrast, the present study used FLI and machine learning methods. Since the FLI equation used TG, waist ratio, and BMI, we excluded these three variables in the machine learning model to better reveal the deeper pathophysiology of the NAFLD without the confounding effects from body weight.
Based on the coefficients, we discuss the equation features below in order of descending significance.
CRP is found to be the most important factor. Elevated C-reactive protein (CRP) levels, a marker of systemic inflammation, are consistently associated with the presence and progression of non-alcoholic fatty liver disease (NAFLD). Multiple studies have found that higher CRP levels correlate with liver fat accumulation, disease severity, and NAFLD development risk, even after adjusting for confounding factors like obesity [
35,
36]. Foroughi et al., also found that it is related to the severity of NAFLD [
37]. The underlying pathophysiology of their relationship could be explained by the chronic low-grade inflammation driven by cytokines such as interleukin-6, which stimulate CRP production in the liver and visceral fat [
37]. CRP is also known to upregulate nuclear factor κ-light-chain-enhancer of activated β cells signaling, a central driver of inflammation, which promotes the release of cytokines [
38]. Our result further supports these hypotheses, confirming the role of CRP in NAFLD.
Though CRP and WBC are both markers for inflammation, they have fundamental differences. Increased number of WBC is a response to infection, injury, or other inflammatory stimuli, reflecting the body’s mobilization of immune defenses [
39]. As previously noted, CRP participates in the immune response by activating the complement pathway, enhancing phagocytosis, and modulating cytokine production [
40]. These differences could support our results and demonstrate that CRP and WBC have independent effects on NAFLD.
UA is the third most important factor contributing to FLI. In a prospective study, Xu et al. demonstrated that higher UA levels were a risk factor for NAFLD in 6890 subjects followed for 3 years [
41]. However, the authors treated NAFLD as a binary variable, and thus their results are less than fully persuasive. Other studies also support our finding [
42,
43]. The mechanisms behind this relationship are hypothesized to the following three causes: (1) lipid metabolism dysregulation; (2) oxidative stress; and (3) fructose metabolism [
44,
45], but a detailed discussion of these proposed mechanisms is beyond the scope of the present study.
The fourth most important factor was a negative association between HDL-C and FLI, highlighting that increased HDL-C might have a protective role in NAFLD. Xuan et al. pointed out that this relationship could be explained by the role of reverse cholesterol transportation that removes cholesterol from the liver [
46], a finding supported by other studies [
47,
48].
The next key variable is ALT. Of note, AST is also in the equation, which indicates that these two enzymes have independent impacts on FLI. Their differences are clearly explained in
Table 8 [
49,
50] and are compatible with our equation.
Other than ALT (coefficient = 0.699), the following variables had coefficients less than 0.5 compared to that of CRP (35.3).
NAFLD prevalence and risk factors are age-dependent, increasing with age in women (especially around menopause), peaking at middle age in men, and tending to decline in very old age [
51,
52]. NAFLD is common in the elderly and tends to have a more severe course in older adults, with higher risks of complications like non-alcoholic steatohepatitis (NASH), cirrhosis, hepatocellular carcinoma, and cardiovascular disease.
- 2.
FPG:
The relationship between FPG and NAFLD is characterized by a positive, independent, and nonlinear association [
53]. The underlying mechanism is that elevated FPG reflects impaired glucose metabolism and insulin resistance, which promote hepatic fat accumulation [
54].
- 3.
SBP:
Similar to age and FPG, the relationship between SBP and NAFLD is bidirectional and involves complex metabolic and inflammatory mechanisms. Zhang et al., and Maeda et al., suggest that NAFLD is associated with higher SBP, DBP, and pulse pressure. This relationship may be mediated by insulin resistance and type 2 diabetes, which are common in NAFLD and contribute to hypertension development [
55,
56].
- 4.
LDL-C:
The last variable in the equation was LDL-C which has an independent association with increased risk for NAFLD [
57]. Zhang et al. reported that patients with NAFLD also had higher LDL-C, TG, and low HDL-C [
58]. Excessive LDL-C might increase fat accumulation via mitochondrial dysfunction, activation of Kupffer cells, and promote hepatic fibrosis [
59,
60,
61].
The continuous estimation of FLI via an interpretable MARS model offers several potential advantages for clinical translation. First, it provides a quantitative risk score that can identify individuals in the early or subclinical stages of fatty liver, facilitating earlier intervention. Second, the model highlights specific, modifiable biomarkers (e.g., CRP, UA, HDL-C) as key drivers of FLI, suggesting that interventions targeting systemic inflammation, uric acid metabolism, or lipid profiles may be beneficial even in the absence of overt obesity. Finally, the simplicity of the equation—requiring only routine laboratory and clinical measures—allows for easy integration into electronic health records or health screening platforms to automate FLI estimation and flag at-risk individuals for further evaluation or lifestyle counseling.
Our study is subject to certain limitations. First, its cross-sectional design is less definitive for establishing causality than a longitudinal study. Future longitudinal research would better clarify the importance of these variables for NAFLD development. Second, our study focused exclusively on young Taiwanese men (aged 20–50 years) without alcohol consumption or medications for metabolic conditions. This homogeneous cohort was deliberately selected to minimize confounding and to clearly identify early risk factors, but it necessarily limits the immediate generalizability of our predictive equation to women, older adults, other ethnic groups, or individuals with treated comorbidities or alcohol use. Future validation studies should include female participants, given the well-documented sex differences in NAFLD epidemiology and pathophysiology. The equation should therefore be viewed as specifically developed for and validated in this demographic. Future studies are needed to validate and potentially adapt this model to more diverse populations, including women, multi-ethnic cohorts, and individuals across a broader age range and health status spectrum. Third, several tables and figures referenced in the text (
Table 1,
Table 2,
Table 3,
Table 4,
Table 5,
Table 6,
Table 7 and
Table 8,
Figure 1,
Figure 2 and
Figure 3) are not included in this manuscript version but would be essential for full interpretation of the results in a published article. The absence of these visual aids limits the reader’s ability to fully appreciate the relationships described.