You are currently viewing a new version of our website. To view the old version click .
Diagnostics
  • Article
  • Open Access

11 December 2025

Explainable Machine Learning Models for Predicting FEV1 in Non-Smoking Taiwanese Men Aged 45–55 Years

,
,
,
,
,
,
,
and
1
Division of Chest Medicine, Department of Internal Medicine, Fu Jen Catholic University Hospital, Fu Jen Catholic University, New Taipei City 243089, Taiwan
2
School of Medicine, College of Medicine, Fu Jen Catholic University, New Taipei City 242062, Taiwan
3
Graduate Institute of Applied Science and Engineering, Fu-Jen Catholic University, New Taipei City 242062, Taiwan
4
Department of Internal Medicine, Fu Jen Catholic University Hospital, Fu Jen Catholic University, New Taipei City 243089, Taiwan
This article belongs to the Special Issue Machine-Learning-Based Disease Diagnosis and Prediction

Abstract

Background: Traditional regression explains only part of the variation in forced expiratory volume in one second (FEV1). Machine learning (ML) methods may capture nonlinear patterns beyond linear assumptions. Methods: We analyzed 23,943 non-smoking Taiwanese men aged 45–55 years from the MJ Health Screening Cohort. Random Forest (RF), Stochastic Gradient Boosting (SGB), and XGBoost were compared with multiple linear regression (MLR) using repeated train–test splits. Model performance was evaluated with RMSE, RAE, RRSE, and SMAPE. Shapley additive explanations (SHAP) were used to interpret variable effects. Results: ML models achieved slightly lower prediction errors than MLR. The most influential predictors across models were lactate dehydrogenase (LDH), body weight (BW), education level, leukocyte count, total bilirubin, and sport area. SHAP indicated negative effects of LDH and leukocyte count and positive associations for BW, bilirubin, education, and physical activity. Conclusions: ML approaches provided modest accuracy gains and clearer interpretability compared with MLR. Biochemical and lifestyle factors—including LDH, BW, education, inflammation markers, and physical activity—contribute meaningfully to FEV1 among healthy middle-aged men.

1. Introduction

Asthma, a prevalent chronic respiratory disease, affects 1–29% of adults worldwide [1,2]. In 2019, approximately 262 million individuals struggled with asthma, leading to 455,000 deaths in the world [3]. An estimated 24 million people lived with asthma, causing 3517 deaths in USA in 2021 according to Centers for Disease Control and Prevention.
Chronic obstructive pulmonary disease (COPD) has also become one of the top three global causes of death, with over 90% of COPD-related mortality occurring in low- and middle-income countries [4,5]. The incidence of COPD peaks around age 45 [6], corresponding to our study population, and adult-onset asthma becomes the predominant phenotype by age 50 in men [7].
Forced expiratory volume in one second (FEV1)—a spirometric measure of the volume of air exhaled during the first second of a forced breath—is a cornerstone parameter for lung function assessment. Conventionally, FEV1 prediction equations are based primarily on age and height [8,9]. However, accumulating evidence suggests that FEV1 is influenced by additional factors, including body composition [10], smoking [11], and environmental exposure [12].
Most prior studies have employed traditional regression models, which may fail to capture nonlinear or complex interactions. Machine learning (ML) methods can overcome these limitations by learning intricate data patterns without a priori assumptions [13]. Although ML models have been increasingly applied in pulmonary research, most prior studies have focused on mixed-age or disease-specific populations and relied on limited sets of predictors. In contrast, our study investigates a highly specific, healthy, non-smoking male cohort aged 45–55 years and incorporates a broader spectrum of demographic, biochemical, and lifestyle variables than previously examined in MJ cohort research.
Rather than predicting lung disease or reproducing reference equations, this study aims to clarify secondary, non-anthropometric determinants of FEV1 using interpretable ML techniques. By integrating Shapley additive explanations (SHAP) analysis with three ML models, we provide new insights into nonlinear relationships and lesser-known correlates of pulmonary function within a homogeneous midlife population.

2. Materials and Methods

2.1. Patient Selection

Some of the following content was adapted from our previous publication [14]. Data were obtained from the Taiwan MJ Cohort, an ongoing prospective health-examination study conducted by the MJ Health Screening Centers in Taiwan [15]. Each health examination includes biological and clinical indicators, such as anthropometric measurements, blood tests, and imaging studies. Participants complete a self-administered questionnaire detailing medical history, current health status, lifestyle, physical activity, sleep, and dietary habits [16].
At the time of examination, all participants provided general informed consent for future anonymous research use. The dataset is maintained by the MJ Interpretation Foundation. The data used in this research were authorized and received from the MJ Health Research Foundation and MJ Interpretation Foundation (Authorization Code: MJHRF2023007A). The interpretations and conclusions presented here are solely those of the authors and do not represent the views of MJ Health Research.
Because this was a secondary analysis of de-identified data, only a brief ethics review was required. Detailed data-collection procedures are described in the annual technical report [16]. The study protocol was approved by the Fu Jen Catholic University Institutional Review Board (IRB No.: FJUH112321) and conducted in accordance with the principles of the Declaration of Helsinki.
Initially, 100,040 subjects were identified. After applying the exclusion criteria, 23,943 non-smoking men aged 45–55 years remained available for analysis (Figure 1).
Figure 1. Patient selection scheme.
Inclusion criteria:
  • Men aged 45–55 years;
  • Never smoked;
  • Not taking medications known to affect pulmonary function;
  • Not taking medications for metabolic syndrome;
  • No major systemic or chronic diseases.

2.2. Measurements of Anthropometry and Biochemistry

Part of the following section has been published previously by our team [17]. On the day of examination, trained nurses recorded participants’ history of smoking, alcohol, and betel-nut use, as well as education level. Body weight (BW) was measured using a calibrated electronic scale, and chest circumference (CC) was measured horizontally at the nipple line. Systolic (SBP) and diastolic blood pressure (DBP) were measured with an automated electronic sphygmomanometer (Omron HEM-7155T, OHEM-7155T; MVN0032000 OMRON Healthcare Manufacturing Vietnam Co., Ltd. Thu Dau Mot City, Vietnam). Venous blood samples were collected after an overnight (≥10 h) fast. Plasma was separated within 1 h and stored at − 30 °C until analysis. Triglycerides (TG) were quantified using a dry multilayer analytical slide method on a Fuji Dri-Chem 3000 analyzer (Fuji Photo Film, Tokyo, Japan). High-density lipoprotein cholesterol (HDL-C) and low-density lipoprotein cholesterol (LDL-C) were measured enzymatically after dextran-sulfate precipitation.

2.3. Measurement of FEV1

Qualified paramedical technicians conducted pulmonary function testing using two identical electronic flow-type pneumotachometer spirometers (Moose PFT System; Cybermedic, Louisville, CO, USA; software version 3.8D). All tests were performed with participants seated and wearing a nose clip to prevent air leakage.
To ensure measurement accuracy, each participant completed at least three technically acceptable FEV1 maneuvers, with a minimum of two meeting the reproducibility criteria recommended by the American Thoracic Society (ATS) [18]. The highest FEV1 value among reproducible trials was used for subsequent analysis.

2.4. Traditional Statistical Analysis

Continuous variables are presented as means ± standard deviations (SDs). Education level, treated as an ordinal variable, was analyzed using analysis of variance (ANOVA). Pearson’s correlation was applied to examine relationships between continuous variables and FEV1. Multiple linear regression (MLR) was conducted as a benchmark model for comparison with ML methods. All statistical tests were two-sided, and a p-value < 0.05 was considered statistically significant. Analyses were performed using SPSS version 10.0 for Windows (SPSS Inc., Chicago, IL, USA).

2.5. Description of the Study Dataset

In this study, Table 1 summarizes the definitions and measurement methods of the 25 clinical variables included in the analysis. Because age and height are already embedded in conventional FEV1 prediction equations, these two variables were excluded from model development.
Table 1. The demographic, biochemistry and lifestyle data of participants (n = 23,943).
The variables were grouped into three major categories:
  • Demographic: BW, CC, SBP, DBP, and education level.
  • Biochemical: leukocyte count, hemoglobin, platelet count, total bilirubin, total protein, albumin, aspartate aminotransferase (AST), alanine aminotransferase (ALT), gamma-glutamyltransferase (γ-GT), LDH, creatinine, uric acid, TG, HDL-C, LDL-C, thyroid-stimulating hormone (TSH), and C-reactive protein (CRP).
  • Lifestyle: drinking habits and sport area.
Drinking habits were classified as non-drinkers or drinkers. The sport area was calculated by multiplying exercise duration, frequency, and intensity to represent overall physical activity. All parameters listed above served as independent variables, whereas FEV1 was the dependent variable.

2.6. Data Preprocessing, Model Validation and Sensitivity Analyses

Missing values were assessed across all predictors. Because the overall proportion of missingness was low (<3% for most variables, with only sport area exceeding 10%) and shown as Supplementary Materials Table S1, a complete-case analysis was applied, and the final analytic dataset included only participants with complete information.
Continuous predictors were visually inspected for extreme values using histograms and boxplots; observations exceeding ± 4 standard deviations were winsorized to reduce undue influence while preserving sample size. Right-skewed biochemical markers (e.g., LDH, γ-GT, TG, CRP) were log-transformed to improve symmetry and stabilize variance.
To minimize overfitting, all ML models were trained using 10-fold cross-validation with internal hyperparameter tuning performed within each fold to prevent data leakage. Model performance metrics were averaged across folds. Because no independent dataset of healthy Taiwanese men aged 45–55 was available, external validation could not be conducted and is acknowledged as a limitation. The study therefore focuses on internal validation to assess robustness and generalizability within this population.
To evaluate the robustness of our findings, we repeated the entire modeling pipeline under several alternative specifications, including (i) adding age and height as predictors, (ii) switching the data split from 80/20 to 70/30, (iii) replacing 10-fold CV with 5-fold CV, and (iv) removing both winsorization and log-transformation steps. Across all scenarios, RMSE differences remained within a small tolerance (ΔRMSE < 0.05 across preprocessing variations), and most of the Wilcoxon signed-rank test p-values for ML vs. MLR remained within the same interpretation category (p < 0.05) summarized as Supplementary Materials Table S2. These findings indicate that the model performance and comparative conclusions were stable regardless of reasonable changes in preprocessing or validation strategy.

2.7. Proposed Machine Learning Scheme

We developed predictive models to identify and rank factors associated with FEV1 using three distinct ML methods. Part of the methodological framework has been reported previously by our research team [14].
The first algorithm applied was the Random Forest (RF) model, an ensemble learning approach based on decision trees. RF combines bootstrap resampling and bagging [19]. It constructs multiple classification and regression trees (CART) using randomly selected subsets of the data and predictors, employing Gini impurity reduction as the splitting criterion. Predictions from individual trees are then aggregated by averaging (for regression) or voting (for classification), resulting in a robust final model [20].
The second method was Stochastic Gradient Boosting (SGB), a tree-based boosting algorithm that integrates both bagging and boosting techniques to reduce overfitting [21]. In SGB, weak learners (typically shallow decision trees) are sequentially built so that each successive tree focuses on the residual errors of the previous one. This process continues iteratively until the model converges or reaches a predefined stopping criterion. The ensemble of trees collectively contributes to the final, high-performing model.
The third model was eXtreme Gradient Boosting (XGBoost), an advanced implementation of gradient boosting that improves computational efficiency and prediction accuracy [22]. XGBoost employs a second-order Taylor expansion to approximate the objective function and integrates regularization terms to control model complexity, thereby mitigating overfitting [23].
While these ML methods effectively identify important predictors, they do not directly indicate whether each variable exerts a positive or negative influence. To address this limitation, Shapley Additive Explanations (SHAP) analysis was applied to the XGBoost model. SHAP is based on cooperative game theory and quantifies each feature’s contribution to individual predictions by comparing model outputs with and without the feature across all possible feature combinations [24,25,26]. This approach provides both a ranking of feature importance and the directionality (positive or negative) of each feature’s effect.
Figure 2 illustrates the overall ML workflow. The dataset was randomly divided into 80% training and 20% testing subsets. During model development, models were tuned using 10-fold cross-validation with a modest grid of hyperparameters. The model achieving the lowest root mean square error (RMSE) on the validation folds was selected as optimal for each algorithm (RF, SGB, and XGBoost). We added the full hyperparameter search space and the final selected values for RF, SGB, and XGBoost in Supplementary Materials Table S3 to improve transparency and reproducibility.
Figure 2. Proposed machine learning prediction scheme.
In the testing phase, model performance was evaluated using RMSE as the primary accuracy metric. Although other regression metrics such as symmetric mean absolute percentage error (SMAPE), relative absolute error (RAE), and root relative squared error (RRSE) exist, they were not applied here, as they are less commonly used in pulmonary function prediction and were not part of our prespecified analysis plan [27,28,29,30,31]. The specific metric values can be found in Table 2.
Table 2. Equation of performance metrics.
All ML algorithms generated feature importance rankings, which were compared across models to identify consistent predictors. In the final analytical phase, SHAP was used to quantify both the relative importance and directionality of each variable’s contribution to the predicted FEV1.
All analyses were performed using R software (version 4.0.5) with RStudio (version 1.1.453), and relevant R packages (available at www.R-project.org and RStudio). SHAP analyses were conducted in Python 3.10 using the shap, pandas, numpy, and matplotlib libraries for computation and visualization, including summary, bar, and waterfall plots.

3. Results

A total of 23,943 men were included in the final analysis (Figure 1). Table 1 summarizes participant characteristics. The mean age was 49.6 ± 3.2 years.
From the Pearson correlation analysis (Table 3), FEV1 showed significant negative correlations with leukocyte count (r = −0.161, p < 0.0005), SBP (r = −0.150, p < 0.0005), and LDH (r = −0.337, p < 0.0005). Positive correlations were observed with BW (r = 0.156, p < 0.0005), total bilirubin (r = 0.146, p < 0.0005), HDL-C (r = 0.136, p < 0.0005), and education level (r = 0.298, p < 0.0005).
Table 3. The results of Pearson correlation between baseline demographic, biochemistry, lifestyle and FEV1. BW: Body weight; CC: Chest circumference; WBC: White blood cell; Hb: Hemoglobin; PLT: Platelet count; TB: Total bilirubin; TP: Total protein; AST: Aspartate aminotransferase; ALT: Alanine aminotransferase; γ-GT: Gamma-glutamyltransferase; LDH: Lactate dehydrogenase; Cr: Creatinine; UA: Uric acid; TG: Triglyceride; HDL-C: High density lipoprotein cholesterol; LDL-C: Low density lipoprotein cholesterol; TSH: Thyroid stimulating hormone; CRP: C-reactive protein; SBP: Systolic blood pressure; DBP: Diastolic blood pressure. * p < 0.05, *** p < 0.005.
Approximately 64% of participants held a higher education degree (college, university, or postgraduate). Significant differences in FEV1 were observed across educational levels, specifically between primary and junior high, junior and senior high, senior high and college, and between university and postgraduate levels (p < 0.0005). Most participants were non-drinkers (78.3%), and FEV1 differed significantly between drinkers and non-drinkers (p < 0.0005).
All three ML models—RF, SGB, and XGBoost—produced slightly lower prediction errors than the traditional MLR benchmark. RF (SMAPE = 0.1446, RAE = 0.1381, RRSE = 0.1381, RMSE = 0.8778), SGB (SMAPE =0.1429, RAE = 0.1370, RRSE = 0.8582), XGBoost (SMAPE =0.1426, RAE = 0.1369, RRSE = 0.8576), and MLR (SMAPE = 0.1460, RAE = 0.1460, RRSE = 0.8762) shown as Table 4.
Table 4. The performance of linear regression and three machine learning methods.
Across repeated train–test partitions, all three ML models showed lower mean RMSE than the MLR baseline (Table 5). Paired Wilcoxon signed-rank tests further confirmed that each ML model achieved significantly lower RMSE compared with MLR (RF: p = 0.0039; SGB: p = 0.0019; XGBoost: p = 0.0019). Although these results demonstrate statistically consistent improvements, the absolute performance gains were modest, indicating that the clinical significance of these differences should be interpreted with caution.
Table 5. RMSE ± SD and 95% Confidence Intervals and Wilcoxon Signed-Rank Tests Comparing RMSE of ML Models vs. MLR.
Table 6 presents the variable importance scores across ML methods, with the averaged rankings displayed in the rightmost column. The six most influential predictors of FEV1 were LDH, BW, education level, leukocyte count, total bilirubin, and sport area. The relative importance rankings from each model are visualized in Figure 3.
Table 6. The rank of the importance (%) of the factors derived from linear regression and machine learning methods.
Figure 3. The ranks of the factors derived from four different machine learning methods.
As detailed in the Methods, SHAP analysis was applied to quantify each variable’s contribution and direction of effect on the XGBoost model. Figure 4 (SHAP summary plot) depicts the overall feature contributions across all participants: each dot represents a participant’s SHAP value for a given feature, with warmer colors (red) indicating higher feature values and cooler colors (blue) lower values. LDH emerged as the most influential variable.
Figure 4. The summary plot. Red indicates higher feature values and blue indicates lower values. Points positioned to the right reflect positive contributions to FEV1, and those to the left reflect negative contributions.
Figure 5 displays the mean absolute SHAP values, confirming the same order of global importance as Figure 4. Figure 6 presents the signed SHAP values, showing whether each variable positively or negatively influenced predicted FEV1. Consistent with earlier findings, LDH and leukocyte count exerted negative effects, while BW, total bilirubin, sport area, and education level had positive effects.
Figure 5. The importance plot (mean absolute SHAP values). Bars display the mean absolute SHAP value for each predictor in the final ML model, summarizing the global contribution of that variable to the model’s predictions; longer bars indicate greater overall importance.
Figure 6. The signed SHAP value plot (mean signed SHAP values with directionality). This plot displays the mean signed SHAP value for each predictor in the final model, summarizing both the strength and average direction of each feature’s effect on the model’s predicted outcome. The dashed vertical line marks SHAP = 0. Bar length reflects effect size; color is uniform to emphasize direction rather than magnitude.

4. Discussion

This study offers a targeted examination of the factors associated with FEV1 in a homogeneous population of healthy, non-smoking Taiwanese men aged 45–55 years. Rather than proposing new algorithms, the novelty lies in applying established ML methods to a uniquely constrained cohort to uncover secondary biochemical and lifestyle determinants that are not emphasized in conventional lung-function equations.
Despite modest performance differences among ML models, the consistent identification of LDH, BW, education level, leukocyte count, bilirubin, and physical activity highlights patterns that traditional regression alone may not fully capture. Using SHAP for model interpretability further clarifies the direction and relative contribution of each predictor, offering a clearer understanding of subtle, nonlinear relationships influencing pulmonary function in midlife males.
Our aim was not to replicate or replace population-reference FEV1 equations, but rather to explore secondary determinants of lung function within a homogeneous midlife male cohort where age- and height-related variability is minimized.
LDH emerged as the strongest negative predictor of FEV1. Elevated LDH activity in the airways may arise from necrosis or rupture of airway or alveolar epithelial cells and activated macrophages, reflecting cellular injury and inflammation [32]. Because LDH is an intracellular enzyme present in nearly all tissues, its appearance in extracellular fluid indicates tissue damage. Consistent with our results, a cross-sectional analysis of 3453 adults reported that each 1 U/L increase in LDH was associated with a 1.11 mL decline in FEV1 (95% CI: −1.82 to −0.39; p = 0.0025) [33]. The strong negative correlation observed in our study (r = −0.337, p < 0.0005) reinforces the role of LDH as a sensitive and easily measurable biomarker of subclinical pulmonary injury.
BW ranked as the second most influential factor. The relationship between body composition and lung function has not been widely explored in healthy populations. A Korean study found that underweight adults exhibited lower pulmonary function, potentially due to insufficient respiratory muscle mass—particularly in the diaphragm—leading to reduced ventilatory capacity [34]. Conversely, excessive weight and obesity have been associated with adverse effects on lung function and progressive decline in FEV1 [35]. Our study uniquely focuses on a healthy population, revealing a modest positive correlation between BW and FEV1 (r = 0.156, p < 0.0005), suggesting that optimal muscle mass and nutritional status may contribute to better pulmonary mechanics.
Education level was the third strongest predictor of FEV1, reflecting the broader influence of socioeconomic status on health. In a Dutch cohort of 2679 men aged 26–66 years, those with the lowest education had FEV1 values 221 mL lower than those with higher education [36]. Similarly, Polak et al. [37] found that sustained high socioeconomic status throughout life was associated with greater FEV1 compared with persistent low SES or downward mobility. Our findings align with these results: participants with the highest education level demonstrated mean FEV1 values 820 mL higher than those with the lowest level. Although differences between adjacent education categories (e.g., college vs. university) were non-significant—likely due to small sample sizes—the overall trend underscores the importance of socioeconomic and behavioral factors in respiratory health. Higher educational attainment may reflect better health literacy, healthier lifestyle patterns, and improved socioeconomic conditions, which are known to influence pulmonary health through reduced exposure to smoking, better nutrition, and access to preventive care.
Leukocyte count was inversely related to FEV1, supporting the concept that systemic inflammation contributes to impaired lung function. Elevated leukocytes may release proteolytic enzymes and reactive oxygen species, causing tissue injury and airway remodeling [38,39]. In the U.S. NHANES III cohort (n = 16,312), higher leukocyte counts were independently associated with lower FEV1 [40]. Our findings (r = −0.161, p < 0.0005) corroborate this relationship, suggesting that even in non-smokers, subtle systemic inflammation may adversely influence pulmonary capacity.
Total bilirubin exhibited a positive association with FEV1, consistent with its recognized antioxidant and cytoprotective properties [41,42]. Bilirubin has been proposed to mitigate oxidative stress and inflammation in tissues exposed to environmental oxygen, including the lungs. In a Korean community-based cohort of 7986 adults, higher bilirubin levels were significantly associated with better FEV1, FVC, and FEF25–75%, particularly among never-smokers [43]. Our study extends these findings to a Taiwanese male cohort, suggesting that moderate elevations in bilirubin may confer protection against subclinical airway inflammation and decline in lung function.
Physical activity (the sport area variable)—representing exercise frequency, duration, and intensity—ranked sixth in importance. Previous studies have reported that non-smokers exhibit a dose–response relationship between exercise level and FEV1 [44]. For instance, men with higher physical activity had mean FEV1 values 154 mL greater than sedentary peers (p = 0.001). Other studies likewise support positive associations between fitness and lung capacity [45,46]. Higher activity levels could improve respiratory muscle strength, reduce systemic inflammation, and enhance overall cardiometabolic fitness, all of which support better FEV1. Although our simple correlations did not reach statistical significance, ML methods detected a latent relationship between physical activity and pulmonary performance. This suggests that nonlinear modeling may uncover subtle or interaction-dependent associations that traditional statistics overlook.
This study has several limitations. First, the cross-sectional design precludes causal inference. Future longitudinal studies are needed to verify temporal relationships and assess predictive validity. Second, the sample consisted exclusively of Taiwanese men; caution should therefore be exercised in extrapolating these findings to other ethnicities or women. Third, external validation could not be conducted, because no independent dataset of healthy Taiwanese men aged 45–55 was available. Finally, despite careful variable selection, residual confounding from unmeasured factors (e.g., environmental exposures) cannot be excluded.

5. Conclusions

This study contributes to the existing literature by focusing on a highly specific and clinically relevant subgroup—healthy, non-smoking Taiwanese men aged 45–55 years—which is rarely analyzed independently, despite this age range being a key turning point for the onset of adult respiratory decline.
Unlike most ML studies that evaluate mixed-age or disease-specific populations, our analysis isolates secondary biochemical and lifestyle predictors while controlling for the dominant influences of age and height. By applying explainable ML methods and SHAP analysis, the study uncovers nonlinear relationships involving LDH, leukocyte count, bilirubin, BW, and physical activity that traditional regression may understate.
These findings extend prior MJ cohort research by demonstrating that even within a narrow, homogeneous age-band, measurable biochemical signals and lifestyle behaviors meaningfully shape pulmonary function. This refined understanding highlights potential targets for early screening and health-promotion strategies in midlife men.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/diagnostics15243152/s1, Table S1: Missing Values Summary (n = 23,943); Table S2: Scenario Model RMSE (mean ± SD) with ΔRMSE vs. Primary; Table S3: Hyperparameter Search Space and Selected Values for Machine-Learning Models.

Author Contributions

D.P. and C.-Z.W. conceived the idea and proposed the method. T.-W.C., H.-S.S. and C.-Y.H. collected the data. Y.-J.L. validated the results. C.-Y.C., Y.-L.K. and L.-N.L. wrote the manuscript. D.P. revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

No funding was received to assist with the preparation of this manuscript.

Institutional Review Board Statement

All or part of the data used in this research were authorized by and received from MJ Health Research Foundation and MJ Interpretation Foundation (Authorization Code: MJHRF2023007A). As this was a secondary study, only a short review was made. The present study was approved by Fu Jen Catholic University (IRB No.: FJUH112321, 17 October 2024) and the methods were carried out in accordance with the principles of the Declaration of Helsinki.

Data Availability Statement

The data are not publicly available due to privacy restrictions.

Acknowledgments

Thank you to all office colleagues for their hard work and collaboration; special thanks to Ta-Wei Chu for his diligent efforts in data collection.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Mortimer, K.; Lesosky, M.; García-Marcos, L.; Asher, M.I.; Pearce, N.; Ellwood, E.; Bissell, K.; El Sony, A.; Ellwood, P.; Marks, G.B.; et al. The burden of asthma, hay fever and eczema in adults in 17 countries: GAN Phase I study. Eur. Respir. J. 2022, 60, 2102865. [Google Scholar] [CrossRef]
  2. Asher, M.I.; Rutter, E.C.; Bissell, K.; Chiang, C.-Y.; El Sony, A.; Ellwood, E.; Ellwood, P.; García-Marcos, L.; Marks, G.B.; Morales, E.; et al. Worldwide trends in the burden of asthma symptoms in school-aged children: Global Asthma Network Phase I cross-sectional study. Lancet 2021, 398, 1569–1580. [Google Scholar] [CrossRef] [PubMed]
  3. Vos, T.; Lim, S.S.; Abbafati, C.; Abbas, K.M.; Abbasi, M.; Abbasifard, M.; Bhutta, Z.A. Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: A systematic analysis for the Global Burden of Disease Study 2019. Lancet 2020, 396, 1204–1222. [Google Scholar] [CrossRef]
  4. Halpin, D.M.G.; Celli, B.R.; Criner, G.J.; Frith, P.; López–Varela, M.V.L.; Salvi, S.; Vogelmeier, C.F.; Chen, R.; Mortimer, K.; Montes de Oca, M.; et al. The GOLD Summit on chronic obstructive pulmonary disease in low- and middle-income countries. Int. J. Tuberc. Lung Dis. 2019, 23, 1131–1141. [Google Scholar] [CrossRef]
  5. Meghji, J.; Mortimer, K.; Agusti, A.; Allwood, B.W.; Asher, I.; Bateman, E.D.; Bissell, K.; Bolton, E.C.; Bush, A.; Celli, B.; et al. Improving lung health in low-income and middle-income countries: From challenges to solutions. Lancet 2021, 397, 928–940. [Google Scholar] [CrossRef]
  6. Terzikhan, N.; Verhamme, K.M.C.; Hofman, A.; Stricker, B.H.; Brusselle, G.G.; Lahousse, L. Prevalence and incidence of COPD in smokers and non-smokers: The Rotterdam Study. Eur. J. Epidemiol. 2016, 31, 785–792. [Google Scholar] [CrossRef]
  7. Honkamaki, J.; Hisinger-Mölkänen, H.; Ilmarinen, P.; Piirilä, P.; Tuomisto, L.E.; Andersén, H.; Huhtala, H.; Sovijärvi, A.; Backman, H.; Lundbäck, B.; et al. Age- and gender-specific incidence of new asthma diagnosis from childhood to late adulthood. Respir. Med. 2019, 154, 56–62. [Google Scholar] [CrossRef]
  8. Janmeja, A.K.; Mohapatra, P.R.; Gupta, R.; Aggarwal, D. Spirometry Reference Values and Equations in North Indian Geriatric Population. Indian J. Chest Dis. Allied Sci. 2017, 59, 125–130. [Google Scholar] [CrossRef]
  9. Johnson, D.C.; Johnson, B.G. Spirometry Reference Equations Including Existing and Novel Parameters. Open Respir. Med. J. 2023, 17, e187430642212260. [Google Scholar] [CrossRef] [PubMed]
  10. Park, J.E.; Chung, J.H.; Lee, K.H.; Shin, K.C. The effect of body composition on pulmonary function. Tuberc. Respir. Dis. 2012, 72, 433–440. [Google Scholar] [CrossRef]
  11. Tantisuwat, A.; Thaveeratitham, P. Effects of smoking on chest expansion, lung function, and respiratory muscle strength of youths. J. Phys. Ther. Sci. 2014, 26, 167–170. [Google Scholar] [CrossRef] [PubMed]
  12. Marchetti, N.; Garshick, E.; Kinney, G.L.; McKenzie, A.; Stinson, D.; Lutz, S.M.; Lynch, D.A.; Criner, G.J.; Silverman, E.K.; Crapo, J.D. Association between occupational exposure and lung function, respiratory symptoms, and high-resolution computed tomography imaging in COPDGene. Am. J. Respir. Crit. Care Med. 2014, 190, 756–762. [Google Scholar] [CrossRef] [PubMed]
  13. Beverin, L.; Topalovic, M.; Halilovic, A.; Desbordes, P.; Janssens, W.; De Vos, M. Predicting total lung capacity from spirometry: A machine learning approach. Front. Med. 2023, 10, 1174631. [Google Scholar] [CrossRef]
  14. Wu, C.-Z.; Huang, L.-Y.; Chen, F.-Y.; Kuo, C.-H.; Yeih, D.-F. Using Machine Learning to Predict Abnormal Carotid Intima-Media Thickness in Type 2 Diabetes. Diagnostics 2023, 13, 1834. [Google Scholar] [CrossRef]
  15. Wu, X.; Tsai, S.P.; Tsao, C.K.; Chiu, M.L.; Tsai, M.K.; Lu, P.J.; Lee, J.H.; Chen, C.H.; Wen, C.; Chang, S.-S.; et al. Cohort Profile: The Taiwan MJ Cohort: Half a million Taiwanese with repeated health surveillance data. Int. J. Epidemiol. 2017, 46, 1744–1744g. [Google Scholar] [CrossRef]
  16. Foundation MHR. The Introduction of MJ Health Database. MJ Health Research Foundation Technical Report, MJHRF-TR-01. 2016. Available online: http://www.mjhrf.org/en/index.php?action=database&id=6 (accessed on 22 August 2016).
  17. Tzou, S.J.; Peng, C.-H.; Huang, L.-Y.; Chen, F.-Y.; Kuo, C.-H.; Wu, C.-Z.; Chu, T.-W. Comparison between linear regression and four different machine learning methods in selecting risk factors for osteoporosis in a Taiwanese female aged cohort. J. Chin. Med. Assoc. 2023, 86, 1028–1036. [Google Scholar] [CrossRef]
  18. Pellegrino, R.; Viegi, G.; Brusasco, V.; Crapo, R.O.; Burgos, F.; Casaburi, R.; Coates, A.; Van Der Grinten, C.P.M.; Gustafsson, P.; Hankinson, J.; et al. Interpretative strategies for lung function tests. Eur. Respir. J. 2005, 26, 948–968. [Google Scholar] [CrossRef]
  19. Chistiakov, D.A. Diabetic retinopathy: Pathogenic mechanisms and current treatments. Diabetes Metab. Syndr. 2011, 5, 165–172. [Google Scholar] [CrossRef] [PubMed]
  20. Nichols, G.A.; Vupputuri, S.; Lau, H. Medical care costs associated with progression of diabetic nephropathy. Diabetes Care 2011, 34, 2374–2378. [Google Scholar] [CrossRef]
  21. Chen, L.K.; Lin, M.-H.; Chen, Z.-J.; Hwang, S.-J.; Chiou, S.-T. Association of insulin resistance and hematologic parameters: Study of a middle-aged and elderly Taiwanese population in Taiwan. J. Chin. Med. Assoc. 2006, 69, 248–253. [Google Scholar] [CrossRef]
  22. Manns, B.; Hemmelgarn, B.; Tonelli, M.; Au, F.; Chiasson, T.C.; Dong, J.; Klarenbach, S. Population based screening for chronic kidney disease: Cost effectiveness study. Bmj 2010, 341, c5869. [Google Scholar] [CrossRef] [PubMed]
  23. Hex, N.; Bartlett, C.; Wright, D.; Taylor, M.; Varley, D. Estimating the current and future costs of Type 1 and Type 2 diabetes in the UK, including direct health costs and indirect societal and productivity costs. Diabet. Med. 2012, 29, 855–862. [Google Scholar] [CrossRef]
  24. Shapley, L.S. A value for n-person games. In Contributions to the Theory; Princeton University Press: Princeton, NJ, USA, 1953. [Google Scholar]
  25. Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  26. Choi, H.-W.; Abdirayimov, S. Demonstrating the Power of SHAP Values in AI-Driven Classification of Marvel Characters. J. Multimed. Inf. Syst. 2024, 11, 167–172. [Google Scholar] [CrossRef]
  27. Armstrong, J.S. Long-Range Forecasting: From Crystal Ball to Computer, 2nd ed.; Wiley: New York, NY, USA, 1985. [Google Scholar]
  28. Makridakis, S. Accuracy measures: Theoretical and practical concerns. Int. J. Forecast. 1993, 9, 527–529. [Google Scholar] [CrossRef]
  29. Hamner, B.F.M.; Maleki, A. R package, version 0.1.4; Metrics: Machine Learning Evaluation Metrics for Regression and Classification. 2025. Available online: https://github.com/mfrasco/Metrics (accessed on 21 July 2025).
  30. Lang, M.B.M.; Richter, J.; Schratz, P.; Pfisterer, F.; Coors, S.; Binder, M.; Au, Q.; Casalicchio, G.; Kotthoff, L.; Bischl, B. mlr3: A modern object-oriented machine learning framework in R. J. Open Source Softw. 2019, 4, 1903. [Google Scholar] [CrossRef]
  31. Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
  32. Roth, R.A. Effect of pneumotoxicants on lactate dehydrogenase activity in airways of rats. Toxicol. Appl. Pharmacol. 1981, 57, 69–78. [Google Scholar] [CrossRef]
  33. Hu, S.; Ye, J.; Guo, Q.; Zou, S.; Zhang, W.; Zhang, D.; Zhang, Y.; Wang, S.; Su, L.; Wei, Y. Serum lactate dehydrogenase is associated with impaired lung function: NHANES 2011–2012. PLoS ONE 2023, 18, e0281203. [Google Scholar] [CrossRef]
  34. Do, J.G.; Park, C.-H.; Lee, Y.-T.; Yoon, K.J. Association between underweight and pulmonary function in 282,135 healthy adults: A cross-sectional study in Korean population. Sci. Rep. 2019, 9, 14308. [Google Scholar] [CrossRef] [PubMed]
  35. Peralta, G.P.; Marcon, A.; Carsin, A.-E.; Abramson, M.J.; Accordini, S.; Amaral, A.F.; Antó, J.M.; Bowatte, G.; Burney, P.; Corsico, A.; et al. Body mass index and weight change are associated with adult lung function trajectories: The prospective ECRHS study. Thorax 2020, 75, 313–320. [Google Scholar] [CrossRef] [PubMed]
  36. Tabak, C.; Spijkerman, A.M.W.; Verschuren, W.M.M.; Smit, H.A. Does educational level influence lung function decline (Doetinchem Cohort Study)? Eur. Respir. J. 2009, 34, 940–947. [Google Scholar] [CrossRef]
  37. Polak, M.; Szafraniec, K.; Kozela, M.; Wolfshaut-Wolak, R.; Bobak, M.; Pająk, A. Socioeconomic status and pulmonary function, transition from childhood to adulthood: Cross-sectional results from the polish part of the HAPIEE study. BMJ Open 2019, 9, e022638. [Google Scholar] [CrossRef]
  38. Janoff, A. Biochemical links between cigarette smoking and pulmonary emphysema. J. Appl. Physiol. Respir. Environ. Exerc. Physiol. 1983, 55, 285–293. [Google Scholar] [CrossRef]
  39. Babior, B.M. The respiratory burst of phagocytes. J. Clin. Invest. 1984, 73, 599–601. [Google Scholar] [CrossRef] [PubMed]
  40. Yang, H.F.; Kao, T.W.; Wang, C.C.; Peng, T.C.; Chang, Y.W.; Chen, W.L. Serum white blood cell count and pulmonary function test are negatively associated. Acta Clin. Belg. 2015, 70, 419–424. [Google Scholar] [CrossRef] [PubMed]
  41. Lin, J.P.; Vitek, L.; Schwertner, H.A. Serum bilirubin and genes controlling bilirubin concentrations as biomarkers for cardiovascular disease. Clin. Chem. 2010, 56, 1535–1543. [Google Scholar] [CrossRef] [PubMed]
  42. Schwertner, H.A.; Vítek, L. Gilbert syndrome, UGT1A1*28 allele, and cardiovascular disease risk: Possible protective effects and therapeutic applications of bilirubin. Atherosclerosis 2008, 198, 1–11. [Google Scholar] [CrossRef]
  43. Leem, A.Y.; Kim, H.Y.; Kim, Y.S.; Park, M.S.; Chang, J.; Jung, J.Y. Association of serum bilirubin level with lung function decline: A Korean community-based cohort study. Respir. Res. 2018, 19, 99. [Google Scholar] [CrossRef]
  44. Holmen, T.L.; Barrett-Connor, E.; Clausen, J.; Holmen, J.; Bjermer, L. Physical exercise, sports, and lung function in smoking versus nonsmoking adolescents. Eur. Respir. J. 2002, 19, 8–15. [Google Scholar] [CrossRef]
  45. Doherty, M.; Dimitriou, L. Comparison of lung volume in Greek swimmers, land based athletes, and sedentary controls using allometric scaling. Br. J. Sports Med. 1997, 31, 337–341. [Google Scholar] [CrossRef]
  46. MacAuley, D.; McCrum, E.; Evans, A.; Stott, G.; Boreham, C.; Trinick, T. Physical activity, physical fitness and respiratory function--exercise and respiratory function. Ir. J. Med. Sci. 1999, 168, 119–123. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.