Saved Queries

Solar chimneys represent an effective passive ventilation technology capable of improving indoor thermal comfort while reducing building energy consumption. In this study, the thermal and fluid dynamic performance of a solar chimney integrated into a residential building located in Bordj Bou Arréridj (Eastern Algeria) was investigated through a comprehensive numerical, predictive, and optimization framework. A transient mathematical model was developed to evaluate the influence of key geometric parameters, including chimney width and inlet opening width, as well as environmental factors such as solar radiation intensity and wind speed, on the system performance. The generated simulation database was subsequently employed to develop and compare four machine learning models, namely, Artificial Neural Networks with Bayesian Regularization (ANN-BR), Deep Neural Networks optimized by Improved Grey Wolf Optimization (DNN-IGWO), k-Nearest Neighbors (KNN), and Extreme Gradient Boosting (XGBoost), for predicting eight output parameters including glazing temperature, fluid temperature, absorber temperature, outlet temperature, thermal efficiency, air change rate (ACH), mass flow rate, and outlet velocity. The results demonstrated that increasing chimney and inlet widths significantly enhances ventilation performance by increasing airflow rate and ACH. Weather conditions and wind speed were also found to strongly affect thermal efficiency and buoyancy-driven airflow. Among the predictive models, XGBoost and DNN-IGWO exhibited the highest predictive accuracy, achieving coefficients of determination (

R^{2}

) close to unity and very low prediction errors for all output variables, confirming their robustness and generalization capability. The proposed methodology provides a reliable tool for rapid performance prediction and design optimization of solar chimney systems under different climatic and operating conditions, thereby supporting the development of energy-efficient passive ventilation strategies for residential buildings. Full article

12 pages, 373 KB

Open AccessArticle

Extreme Event Modelling and Forecasting: Empirical Evidence from Predicting GDP and Unemployment in the USA

by R. Shankar, Azzam Alroomi, V. Bougioukos and K. Nikolopoulos

Forecasting 2026, 8(3), 46; https://doi.org/10.3390/forecast8030046 (registering DOI) - 9 Jun 2026

Abstract

This paper contributes to the stream of literature on extreme event modelling and forecasting by comparing various forecasting methods for predicting extreme movements in GDP and unemployment in the United States. The data were obtained from multiple open sources for the USA, including CNBC, the U.S. National Library of Medicine, the National Institutes of Health, the Centres for Disease Control and Prevention, the Bureau of Transportation Statistics site, Investing Com, the U.S. Bureau of Labour Statistics, Yahoo Finance, The Balance and Wikipedia. The research focuses on identifying the optimal forecasting method between Machine Learning and time-series forecasting algorithms, for predicting extreme values of GDP and unemployment, accounting for natural disasters and industrial and economic factors. The statistical and analytical insights derived from this study, if used judiciously, can inform policymaking and planning. Full article

►▼ Show Figures

Figure 1

23 pages, 14756 KB

Open AccessArticle

CT-Based Liver Segmentation for Liver Surgery: A Hybrid Approach Based on 3D U-Net–ELM Model

by Zeki Ogut, Eser Sert, Ertugrul Kaya and Muhammed Yildirim

Biomedicines 2026, 14(6), 1298; https://doi.org/10.3390/biomedicines14061298 - 7 Jun 2026

Viewed by 134

Abstract

Background: Accurate liver segmentation from abdominal computed tomography (CT) images is an important task for surgical planning, volumetric analysis, and tumor assessment. Although recent deep learning-based three-dimensional segmentation approaches provide high segmentation performance, these models generally require high computational resources and long training times. Methods: In this study, a hybrid liver segmentation framework combining the 3D U-Net architecture with the extreme learning machine (ELM) method was proposed. In the proposed approach, deep volumetric feature maps extracted from the bottleneck layer of the trained 3D U-Net were used as input to an ELM-based classifier for final segmentation refinement. All experiments were performed on the Task03_Liver_rs dataset, which is a rescaled version of the Medical Segmentation Decathlon liver dataset. To provide a more reliable evaluation, fivefold cross-validation experiments were conducted using the same preprocessing pipeline, training protocol, and hyperparameter settings for all comparison models. In addition to overlap-based metrics, boundary-based and clinically relevant metrics including HD95, ASD, surface Dice, and volumetric error were also evaluated. Results: Experimental results demonstrated that the proposed 3D U-Net–ELM framework achieved competitive and stable segmentation performance compared with nnU-Net, standard 3D U-Net, SwinUNet, and SwinUNet–ELM models. The proposed model achieved a mean Dice score of 0.9399 ± 0.0210 and an IoU score of 0.8874 ± 0.0358 under fivefold cross-validation. Furthermore, the proposed approach produced lower HD95 and ASD values together with higher surface Dice scores, indicating improved boundary consistency and volumetric segmentation quality. In addition, the hybrid ELM-based structure provided advantages in computational efficiency and training cost. Conclusions: The obtained findings indicate that the proposed 3D U-Net–ELM framework provides a balanced and computationally efficient alternative for volumetric liver segmentation. Nevertheless, the absence of independent multicenter external validation remains an important limitation of the study. Future studies will focus on evaluating the proposed framework using larger and more diverse multicenter datasets to further investigate its clinical applicability and generalizability. Full article

(This article belongs to the Special Issue Applications of Biomedical Engineering and Biomaterials in Human Diseases)

►▼ Show Figures

Figure 1

27 pages, 9604 KB

Open AccessArticle

An Evaluation of Machine Learning Methods for Leaf Area Index Retrieval

by Dong Wang, Lijuan Miao, Yutian Lu, Hanyang Jiang and Qiang Liu

Remote Sens. 2026, 18(12), 1884; https://doi.org/10.3390/rs18121884 - 7 Jun 2026

Viewed by 254

Abstract

The Leaf Area Index (LAI) serves as a vital biophysical parameter for quantifying vegetation dynamics and ecosystem functioning. While traditional LAI retrieval methods face challenges in handling nonlinear spectral-vegetation relationships, machine learning (ML) approaches offer promising alternatives through their data-driven adaptability. This study presents a comprehensive cross-site assessment of 13 ML algorithms for LAI estimation, leveraging ground observations from 98 sites worldwide. Our systematic assessment reveals three key findings: First, ensemble methods consistently outperformed other approaches, with Gradient Boosted Tree Regression (GBTR) achieving superior accuracy (R² = 0.647, RMSE = 0.899) and robustness (ΔR² < 0.05 beyond n = 69 training samples). Second, Gaussian Process Regression (GPR) illustrated exceptional stability across varying training sizes (R² = 0.607 ± 0.012), highlighting its reliability for data-limited scenarios. Third, all tested ML models substantially outperformed operational LAI products, with the GBTR model demonstrating superior explanatory power (external validation R² = 0.647) compared to MODIS; its R² value had increased by 0.489. This optimal balance of accuracy, computational efficiency, and resistance to overfitting positions GBTR as a reasonable choice for large-scale LAI mapping. These findings underscore ML’s promising potential in vegetation monitoring while highlighting the need for hybrid approaches that combine physical principles with data-driven learning to address current limitations in extreme-value estimation and ecological generalizability. Full article

►▼ Show Figures

Figure 1

25 pages, 1041 KB

Open AccessArticle

Adaptive Meta-Weighting Learning Model for Financial Distress Prediction in Listed Corporations

by Zhanbo Chen, Haoyang Huang and Jun Zhang

Mathematics 2026, 14(11), 2013; https://doi.org/10.3390/math14112013 - 5 Jun 2026

Viewed by 176

Abstract

Corporate debt crises constitute a critical source of instability in modern financial distress, rendering their early prediction essential for market regulators and investors. However, corporate debt crisis prediction is severely hindered by extreme class imbalance, as actual crisis samples are far fewer than normal ones. This issue greatly undermines the robustness and generalization ability of conventional forecasting models. To address this issue, we propose an adaptive meta weighting learning (named AMetaW) for corporate debt crisis prediction. Specifically, the model incorporates an adaptive meta weighting mechanism to alleviate class imbalance, ensuring that rare crisis samples receive sufficient attention during training. Moreover, AMetaW integrates multiple financial characteristics into a unified framework, while employing explainable machine learning techniques to reveal the heterogeneous importance of indicators across regions. Empirical analysis using firm-level data across multiple provinces in China demonstrates that: (1) AMetaW achieves superior predictive performance compared with state-of-the-art baselines under imbalanced conditions; (2) our analysis reveals that short-term benchmark interest rate, equity concentration degree, and operating profit margin are consistently the strongest predictors of debt crises; and (3) the relative importance of indicators varies across regions, with eastern firms more sensitive to equity concentration degree and cash ratio, while western firms are more exposed to risks from short-term benchmark interest rate and operating profit margin. These findings provide both methodological contributions to Corporate Debt Crises forecast model and practical insights for region-specific debt crisis prevention and offering practical guidance for group enterprises and regulators. Full article

(This article belongs to the Special Issue Statistical Analysis and AI Models in the Big Data Era)

►▼ Show Figures

Figure 1

21 pages, 2966 KB

Open AccessArticle

Pipeline Leakage Detection Using Machine Learning Techniques in Multiphase Flow Systems

by Hassan Naanouh and Manus Henry

Digital 2026, 6(2), 45; https://doi.org/10.3390/digital6020045 - 5 Jun 2026

Viewed by 153

Abstract

Pipelines remain the primary mode of oil and gas transportation but are vulnerable to leaks that pose environmental and safety risks, particularly in two-phase flow systems. Conventional detection methods often struggle under transient multiphase conditions, while many data-driven studies rely on static evaluation metrics that do not reflect continuous monitoring requirements. This study develops a machine learning framework for leak detection using OLGA-simulated datasets from a previously published study, comprising approximately 180,000 labelled samples across nine leak scenarios and one no-leak case. Pressure, temperature, and mass-flow variables were enhanced through feature engineering to capture nonlinear leak behaviour. Random forest and extreme gradient boosting (XGBoost) classifiers were trained using an 80/20 stratified split with synthetic minority oversampling technique (SMOTE)-based balancing applied only to training data. XGBoost achieved 99.2% accuracy and reduced false positives by 53% relative to random forest while maintaining near-zero false negatives. A sliding-window suspicion framework extended static classification into time-dependent detection, producing delays of between 9.81 s and 82.04 s with zero false alarms in the no-leak scenario. Physical validation using pressure, flow, and fast Fourier transform (FFT) analysis confirmed that detections correspond to genuine hydraulic disturbances, demonstrating the reliability and physical credibility of the proposed framework. Full article

(This article belongs to the Special Issue Applications of Artificial Intelligence and Data Management in Data Analysis)

►▼ Show Figures

Figure 1

12 pages, 996 KB

Open AccessArticle

Lack of Evidence for Well-Separated Clinical Phenotypes in Surgically Treated Infective Endocarditis Using Routine Clinical Variables: A Machine Learning Approach

by Diego Sangiorgi, Elisa Mikus, Mariafrancesca Fiorentino, Antonino Costantino, Simone Calvi, Elena Tenti, Anna Milione and Carlo Savini

Mach. Learn. Knowl. Extr. 2026, 8(6), 154; https://doi.org/10.3390/make8060154 - 4 Jun 2026

Viewed by 161

Abstract

Background: Infective endocarditis (IE) is characterized by marked heterogeneity in microbiological etiology, clinical presentation, valvular involvement, and patient complexity, which complicates risk stratification. Unsupervised machine learning has been proposed to identify latent clinical phenotypes in complex diseases; however, whether IE exhibits a natural cluster structure remains unclear. Methods: In a cohort of 739 patients undergoing surgery for IE, unsupervised clustering was performed using K-medoids based on Gower distance to account for mixed-type variables, which is a common scenario in clinical settings. The optimal number of clusters was selected by maximizing the average silhouette width and the gap statistic. Density and semi-parametric algorithms (K-prototypes, KAMILA, hierarchical clustering, and HDBSCAN) were applied as a sensitivity analysis. Differences in postoperative outcomes across clusters were explored using logistic regression. Results: K-medoids clustering identified three patient groups; however, the average silhouette width was low (0.129), indicating very weak separation between clusters. Sensitivity analysis confirmed the absence of a natural cluster structure. Despite this, a descriptive comparison of forced clusters revealed a gradient of clinical severity, with one group characterized by older age, higher comorbidity burden, complex infection features, and worse postoperative outcomes. Conclusions: Unsupervised clustering did not identify natural clinical phenotypes in surgically treated IE, likely reflecting the extreme intrinsic heterogeneity of the disease. Although forced clustering highlighted clinically interpretable gradients of risk, these groups should not be considered true latent phenotypes. Alternative approaches, such as continuous risk modeling, may be more appropriate for patient stratification in IE. Full article

(This article belongs to the Section Learning)

►▼ Show Figures

Graphical abstract

22 pages, 13923 KB

Open AccessArticle

Use of Machine Learning Techniques for Fertilization Traceability Discrimination via Core Quality Indicators of Korla Fragrant Pear Fruits

by Junkai Zeng, Haixia Wang, Mingyang Yu, Yan Chen and Jianping Bao

Foods 2026, 15(11), 2003; https://doi.org/10.3390/foods15112003 - 4 Jun 2026

Viewed by 192

Abstract

Rational fertilization directly affects the fruit quality of the Korla fragrant pear. However, the variation patterns of fruit appearance and texture indicators under different N-P₂O₅-K₂O ratios are complex, and redundancy among high-dimensional indicators restricts the practical application of quality discrimination and fertilization traceability. In this study, Korla fragrant pear fruits harvested under eight fertilization treatments (including the control) were selected as research materials. Significant differences existed in nutrient composition and application rate among treatments: no N-P₂O₅-K₂O was applied in the CK treatment; for treatments H1–H7, nitrogen (N) application rate ranged from 396.36 to 524.2 g·plant⁻¹, phosphorus (P₂O₅) from 326.08 to 652.17 g·plant⁻¹, and potassium (K₂O) from 450.67 to 1200.08 g·plant⁻¹, with the most prominent differences observed in P-K ratios and application rates. On this basis, 12 appearance and flesh texture indicators were determined, including single-fruit weight, longitudinal diameter, transverse diameter, fruit shape index, pericarp thickness, sclereid content, hardness, adhesiveness, cohesiveness, springiness, gumminess and chewiness. Three machine-learning algorithms, namely Random Forest (RF), Extreme Learning Machine (ELM) and K-Nearest Neighbor (KNN), were used to construct fruit quality discriminant models. The results showed that the RF model achieved the optimal discriminative performance, with accuracy values of 0.876 and 0.865 for the training and validation sets, respectively. Seven core quality indicators, including sclereid content and longitudinal diameter, were screened via feature-importance intersection analysis. The reconstructed RF model based on this indicator set exhibited nearly no loss in discriminative accuracy despite a ~42% reduction in indicator quantity, providing theoretical and technical support for quality grading, fertilization traceability and precision fertilization of Korla fragrant pear. Full article

(This article belongs to the Special Issue Advanced Analytical Methods for Food Safety and Composition Analysis)

►▼ Show Figures

Figure 1

18 pages, 4410 KB

Open AccessArticle

Stochastic Risk Assessment of Cotton Pest Outbreaks in Tropical India: Entropy, Gini Coefficients, and Machine Learning for Sustainable Agroecosystem Management

by Guhan Velusamy, Sheshakumar Goroshi, Dharma Raju Akasapu, Nagaratna Kopparthi, Mansour Almazroui and Mohamed Elhag

Sustainability 2026, 18(11), 5673; https://doi.org/10.3390/su18115673 - 3 Jun 2026

Viewed by 213

Abstract

This study developed an integrated stochastic framework to forecast cotton pest outbreaks across six tropical Indian agroecosystems. Methodologically, the approach fused entropy and Gini inequality indices, Random Forest machine learning, SHAP-based feature interpretation, fuzzy logic risk assessment, and climate scenario simulations (+2 °C, +20% rainfall) to quantify outbreak variability, driver importance, and system resilience. Findings revealed extreme pest stochasticity (mean = 15.7, variance > 4200), with low entropy (0.06) and a high Gini coefficient (0.82) confirming highly concentrated spatial and temporal outbreaks. While Random Forest demonstrated limited predictive skill (RMSE = 68.9, R² = 0.07), SHAP analysis transparently identified evaporation, wind speed, and humidity as dominant drivers. Fuzzy logic yielded an average risk score of 1.0, reflecting frequent exceedance of biological thresholds. Scenario simulations demonstrated pronounced climate sensitivity: a +2 °C temperature increase raised mean incidence to 18.7, and +20% rainfall increased it to 18.6, resulting in a resilience index of 1.51 that indicates disproportionate vulnerability. In conclusion, combining stochastic variability metrics, explainable machine learning, and threshold-based risk modeling significantly advances tropical pest forecasting under uncertainty. Importantly, this framework contributes to sustainability by enabling climate-resilient cotton production, reducing reliance on chemical pesticides, and supporting adaptive advisory systems that strengthen long-term agroecosystem resilience. These results emphasize the critical need for adaptive, location-specific management strategies to mitigate climate-driven pest intensification and enhance resilience in cotton production systems. Full article

►▼ Show Figures

Figure 1

16 pages, 971 KB

Open AccessArticle

HS-SPME-GC-MS Coupled with Chemometrics for Detecting HFCS and Invert Sugar Adulteration in Coriander Honey

by Amir Pourmoradian, Mohsen Barzegar, Luis Noguera-Artiaga and Ángel A. Carbonell-Barrachina

Foods 2026, 15(11), 1988; https://doi.org/10.3390/foods15111988 - 3 Jun 2026

Viewed by 189

Abstract

This study presents a novel analytical approach combining headspace solid-phase microextraction (HS-SPME) with gas chromatography–mass spectrometry (GC–MS) and advanced chemometric techniques to detect adulteration in coriander honey. A total of 34 volatile compounds were identified and quantified, revealing a progressive decrease in both profile complexity and compound abundance with increasing levels of invert sugar and high-fructose corn syrup (HFCS) adulteration. Chromatographic and chemometric analyses effectively distinguished authentic from adulterated samples, with the Extreme Gradient Boosting (XGBoost) model achieving a high classification performance of 95.83% accuracy. The study highlights the critical impact of adulteration on honey’s chemical composition and confirms the efficacy of integrating modern analytical and machine learning tools for rapid, sensitive, and reliable honey authenticity assessment. This methodology offers a valuable framework for food quality control and fraud prevention, addressing current challenges in the honey market and protecting consumer interests. Full article

(This article belongs to the Special Issue Advances in Food Analytical Chemistry, Bioactive Compounds, Microbiology, and Probiotics: Bridging Quality, Safety, and Health)

►▼ Show Figures

Graphical abstract

15 pages, 3013 KB

Open AccessArticle

Forecasting of Macroclimatic Phases Through Stochastic Modeling and Machine Learning: Implications for Regional Hydrological Analysis

by Fernando Oñate-Valdivieso, Paúl Piedra Faicán and Arianna Oñate-Paladines

Water 2026, 18(11), 1358; https://doi.org/10.3390/w18111358 - 3 Jun 2026

Viewed by 221

Abstract

Droughts are complex extreme phenomena that severely impact regional development and water availability. Although the influence of interannual and decadal macroclimatic patterns, such as the El Niño–Southern Oscillation (ENSO) and the Pacific Decadal Oscillation (PDO), on precipitation alteration is widely recognized, current water management systems lack multivariate predictive approaches to anticipate their phases with sufficient operational lead time. This study developed a predictive framework to project ENSO and PDO phases, establishing an optimal temporal window to forecast drought-triggering conditions. Using monthly historical records, teleconnections were evaluated through cross-correlation and Granger causality. Subsequently, Vector Autoregression (VAR) models and machine learning algorithms (Random Forest) were implemented to project anomalies and classify climatic phases. The Granger causality test demonstrated that ENSO variations statistically precede PDO phase shifts, establishing an optimal forecasting window of three to four months. The VAR model exhibited robust joint explanatory capacity for a continuous four-month projection, while the Random Forest algorithm achieved a predictive accuracy of 52.2% specifically for categorical phase classification at a three-month lead time. It is concluded that this lagged interaction allows for reliable mathematical anticipation, providing an essential analytical framework for exploring regional hydrological dynamics and supporting local preventive water management. Full article

(This article belongs to the Special Issue Advanced Hydrological Modeling for Extreme Events: Floods, Droughts, and Risk Assessment)

►▼ Show Figures

Figure 1

28 pages, 4088 KB

Open AccessArticle

Research on the Flat Field Measurement Method of Coronagraph

by Yulong Feng, Xuefei Zhang, Hongfei Liang, Yu Liu, Mingzhe Sun, Tengfei Song and Mingyu Zhao

Universe 2026, 12(6), 165; https://doi.org/10.3390/universe12060165 - 3 Jun 2026

Viewed by 156

Abstract

The solar corona has an extremely low density, and its brightness is only about one millionth of that of the photosphere. High-dynamic-range imaging of its faint structure is therefore essential for studying coronal heating, coronal mass ejections, and space weather. Quantitative coronagraph imaging requires flat-field measurement and calibration, which underpin intensity calibration, small-scale feature detection, and long-term cyclic analysis. This paper analyzes the coronagraph imaging chain (baffle–optical system–detector) and the origins of flat-field errors, including optical aberrations, stray light, and pixel-response non-uniformity, and summarizes the resulting calibration requirements of next-generation coronagraphs. On this basis, ground-based and space-based flat-fielding methods are systematically reviewed: the ground-based methods include integrating-sphere uniform light sources, opal glass/diffuser plates, clear-sky and thin-cloud backgrounds, and solar disk scanning, while the space-based methods include internal light sources and diffuser plates, attitude-roll and off-corona offset observations, and multi-phase statistical self-consistent flat-fielding. Their accuracy, resource cost, and applicability are compared. The review shows that no single method is simultaneously high-precision, easy to update, and engineer-friendly; a hierarchical, multi-method calibration framework is therefore recommended. Finally, a new method is proposed in which lithographically generated structured light fields, combined with Fourier optics and machine learning inversion, are used to estimate the pixel-response function. Preliminary experiments show that this method achieves a lower residual error than the integrating-sphere and opal glass methods, providing a high-precision reference for future wide-band, high-resolution coronagraph calibration. Full article

(This article belongs to the Section Solar and Stellar Physics)

►▼ Show Figures

Figure 1

28 pages, 7559 KB

Open AccessArticle

GA-GBDT: A Spatio-Temporal Graph-Augmented Gradient Boosting Framework for GNSS Network–Based Landslide Event Warning in Mining Areas

by Jinhua Wu, Liang Fei, Wei Dong, Chengdu Cao, Bo Zhang, Xiangyang Han, Ting On Chan, Yuli Wang and Joseph Awange

Appl. Sci. 2026, 16(11), 5569; https://doi.org/10.3390/app16115569 - 2 Jun 2026

Viewed by 244

Abstract

Landslide event warning in mining areas is essential for geohazard risk mitigation and infrastructure safety. With the increasing use of Global Navigation Satellite System (GNSS) monitoring networks, warning decisions are often derived from abnormal deformation responses in continuous displacement records. However, deriving stable and transferable warning decisions from GNSS networks is challenged by spatially coupled station responses, time-varying displacement patterns, and incomplete or disturbed observations. To address these issues, this study proposes a graph-augmented gradient boosting decision tree framework, termed GA-GBDT (Graph-Augmented Gradient Boosting Decision Trees), for multi-station landslide event warning in mining areas. The framework first constructs a weighted station graph to encode spatial dependence across stations. Based on this graph, a Gated Recurrent Unit (GRU) and a Graph Convolutional Network (GCN) are integrated to learn spatio-temporal embeddings, which are then fused with station-wise features and fed into XGBoost (eXtreme Gradient Boosting) for warning decision-making. Experiments on a 90-station GNSS network show that GA-GBDT outperforms representative rule-based, machine-learning, and deep-learning baselines, achieving more robust warning performance with improved generalization and false-alarm control. These results indicate that GA-GBDT improves warning robustness, decision stability, and cross-zone generalization for GNSS-based landslide warning in mining areas, with potential transferability to other slope warning scenarios. Full article

(This article belongs to the Section Earth Sciences)

►▼ Show Figures

Figure 1

19 pages, 2734 KB

Open AccessArticle

Predicting Shield Machine Penetration Rate Using the CTCM-DELM Algorithm

by Da Yuan, Dong Huang, Yu Lei, Minhao Wang, Ji Lu, Xude Li, Xuedong Luo and Yong Liu

Appl. Sci. 2026, 16(11), 5549; https://doi.org/10.3390/app16115549 - 2 Jun 2026

Viewed by 87

Abstract

The penetration rate (PR) is a critical indicator affecting the safety and cost of shield tunnel construction. However, due to the complexity of geological conditions and the nonlinear nature of tunneling parameters, traditional prediction methods struggle to achieve high-accuracy predictions. To address this issue, six hybrid deep extreme learning machine models were developed for PR prediction. Normalized mutual information (NMI) was employed to select key features, and an isolation forest (IForest) algorithm was employed to remove outliers and construct a valid dataset. Subsequently, deep extreme learning machines optimized using six metaheuristic algorithms were applied to predict the penetration rate. Finally, the key factors influencing tunneling rate prediction were identified based on SHAP analysis. The experimental results demonstrate that among the six optimized algorithm models, along with the BP neural network, uniaxial compressive strength (UCS), rock quality designation (RQD), and cutterhead torque were identified as key factors influencing PR. For the first time, the CTCM-DELM model is applied to predict the advance rate of shield tunneling. Combined with SHAP analysis, it is quantitatively revealed that the contribution of geological parameters is greater than that of equipment parameters, which provides novel insight for engineering practice. Full article

►▼ Show Figures

Figure 1

25 pages, 1201 KB

Open AccessArticle

Gradient Boosting Framework with Weight of Evidence Encoding for Vehicle Credit Default Prediction Under Extreme Class Imbalance

by Zehra Keskin and Vildan Özkır

Mathematics 2026, 14(11), 1935; https://doi.org/10.3390/math14111935 - 2 Jun 2026

Viewed by 202

Abstract

Accurate prediction of loan defaults is essential for financial institutions seeking to minimize credit losses and maintain portfolio stability. In the vehicle financing segment of emerging markets, real-world datasets frequently exhibit extreme class imbalance ratios that far exceed those encountered in standard benchmark corpora, posing severe challenges for conventional machine learning pipelines. This study introduces a gradient boosting framework integrating Weight of Evidence (WoE) transformation, Bayesian hyperparameter optimization, and three complementary classifiers—Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Categorical Boosting (CatBoost)—to predict vehicle loan default risk. The methodology is evaluated on a large-scale, fully anonymized Turkish vehicle loan dataset (

N = 207, 572

) with an extreme imbalance ratio of 1:1133 (183 defaults versus 207,389 non-defaults). A strict three-way data partition (60% training, 20% validation, 20% test) is adopted to ensure leakage-free model selection and unbiased performance estimation. A multi-stage experimental pipeline is developed encompassing: (i) statistical feature selection via Mann–Whitney U and chi-square tests with adaptive thresholding, (ii) a comparative analysis of seven resampling strategies including Synthetic Minority Oversampling Technique (SMOTE) variants, Adaptive Synthetic Sampling (ADASYN), and focal loss weighting, (iii) a greedy forward selection ensemble procedure for heterogeneous model fusion, and (iv) a systematic training-set size sensitivity analysis across eight majority undersampling ratios. Under the leakage-free evaluation protocol, the highest-AUC individual model (LightGBM with SMOTE-ENN) achieves an Area Under the Curve (AUC) Receiver Operating Characteristic (ROC) of 0.710 (95% bootstrap CI: 0.614–0.798), while CatBoost with cost-sensitive weighting exhibits superior operational metrics (KS

= 0.389

, PR-AUC

= 0.011

). The greedy ensemble procedure exhibits high selection instability with only 37 validation-set positives, providing a methodological finding on the minimum sample requirements for reliable ensemble construction under extreme scarcity. Ablation results confirm that WoE encoding contributes 3.1 percentage points to the overall AUC gain. Tree SHAP-based interpretability analysis identifies the financing-to-age ratio, WoE-encoded occupation group, and log financing amount as the primary predictive drivers, with cross-model stability confirmed via Spearman rank correlation. A decision support analysis provides precision–recall curves, a Brier score of 0.0082, reliability diagrams, and threshold-dependent performance at operationally plausible review rates. Fairness evaluation across gender and marital status subgroups demonstrates that threshold-dependent metrics such as Disparate Impact Ratio and Equalized Odds Gap are inherently compromised under extreme minority scarcity, whereas rank-based subgroup AUC analysis with bootstrap 95% confidence intervals preserves meaningful discriminative assessment. These findings provide an empirically validated framework for credit default prediction in highly imbalanced and data-scarce financial environments. Full article

(This article belongs to the Special Issue Application of Machine Learning and Data Analysis in Personal Finance and Financial Services Industry)

►▼ Show Figures

Figure 1

Show export options Show export options

Select all

Export citation of selected articles as:

Error

Oops... you haven't selected anything for export.

Displaying article 1-50 on page 1 of 89.

Go to page 1 2 3 4 5

Search Results (4,425)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI