Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (373)

Search Parameters:
Keywords = random oversampling

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
34 pages, 4013 KB  
Article
Machine Learning-Based Cyber Fraud Detection: A Comparative Study of Resampling Methods for Imbalanced Credit Card Data
by Eyad Btoush, Thaeer Kobbaey, Hatem Tamimi and Xujuan Zhou
Appl. Sci. 2026, 16(2), 850; https://doi.org/10.3390/app16020850 - 14 Jan 2026
Viewed by 117
Abstract
The prevalence of online transactions and extensive adoption of credit card payments have contributed to the escalation of credit card cyber fraud in modern society. These trends are propelled by technological advancements, which provide fraudulent actors with more opportunities. Fraudsters exploit victims’ financial [...] Read more.
The prevalence of online transactions and extensive adoption of credit card payments have contributed to the escalation of credit card cyber fraud in modern society. These trends are propelled by technological advancements, which provide fraudulent actors with more opportunities. Fraudsters exploit victims’ financial vulnerabilities by obtaining illegal access to sensitive credit card information through deceptive means, such as phishing, fraudulent phone calls, and fraudulent SMS messages. This study predicts and detects potential instances of cyber fraud in credit card transactions by employing Machine Learning (ML) techniques, including Decision Tree (DT); Random Forest (RF); Logistic Regression (LR); Support Vector Machine (SVM); K-Nearest Neighbors (KNN); XGBoost; CatBoost; and sampling techniques such as Tomek Link, Synthetic Minority oversampling technique (SMOTE), Edited Nearest Neighbor (ENN), Tomek+ENN, and SMOTE+ENN. To determine the performance of the algorithms in terms of accuracy, precision, recall, F1 score, and ROC-AUC for credit card cyber fraud detection, we conducted a comparative analysis of the extant ML techniques. Full article
Show Figures

Figure 1

22 pages, 2526 KB  
Article
Evaluating Machine Learning Models for Classifying Diabetes Using Demographic, Clinical, Lifestyle, Anthropometric, and Environmental Exposure Factors
by Rifa Tasnia and Emmanuel Obeng-Gyasi
Toxics 2026, 14(1), 76; https://doi.org/10.3390/toxics14010076 - 14 Jan 2026
Viewed by 221
Abstract
Diabetes develops through a mix of clinical, metabolic, lifestyle, demographic, and environmental factors. Most current classification models focus on traditional biomedical indicators and do not include environmental exposure biomarkers. In this study, we develop and evaluate a supervised machine learning classification framework that [...] Read more.
Diabetes develops through a mix of clinical, metabolic, lifestyle, demographic, and environmental factors. Most current classification models focus on traditional biomedical indicators and do not include environmental exposure biomarkers. In this study, we develop and evaluate a supervised machine learning classification framework that integrates heterogeneous demographic, anthropometric, clinical, behavioral, and environmental exposure features to classify physician-diagnosed diabetes using data from the National Health and Nutrition Examination Survey (NHANES). We analyzed NHANES 2017–2018 data for adults aged ≥18 years, addressed missingness using Multiple Imputation by Chained Equations, and corrected class imbalance via the Synthetic Minority Oversampling Technique. Model performance was evaluated using stratified ten-fold cross-validation across eight supervised classifiers: logistic regression, random forest, XGBoost, support vector machine, multilayer perceptron neural network (artificial neural network), k-nearest neighbors, naïve Bayes, and classification tree. Random Forest and XGBoost performed best on the balanced dataset, with ROC AUC values of 0.891 and 0.885, respectively, after imputation and oversampling. Feature importance analysis indicated that age, household income, and waist circumference contributed most strongly to diabetes classification. To assess out-of-sample generalization, we conducted an independent 80/20 hold-out evaluation. XGBoost achieved the highest overall accuracy and F1-score, whereas random forest attained the greatest sensitivity, demonstrating stable performance beyond cross-validation. These results indicate that incorporating environmental exposure biomarkers alongside clinical and metabolic features yields improved classification performance for physician-diagnosed diabetes. The findings support the inclusion of chemical exposure variables in population-level diabetes classification and underscore the value of integrating heterogeneous feature sets in machine learning-based risk stratification. Full article
Show Figures

Figure 1

41 pages, 80556 KB  
Article
Why ROC-AUC Is Misleading for Highly Imbalanced Data: In-Depth Evaluation of MCC, F2-Score, H-Measure, and AUC-Based Metrics Across Diverse Classifiers
by Mehdi Imani, Majid Joudaki, Ayoub Bagheri and Hamid R. Arabnia
Technologies 2026, 14(1), 54; https://doi.org/10.3390/technologies14010054 - 10 Jan 2026
Viewed by 375
Abstract
This study re-evaluates ROC-AUC for binary classification under severe class imbalance (<3% positives). Despite its widespread use, ROC-AUC can mask operationally salient differences among classifiers when the costs of false positives and false negatives are asymmetric. Using three benchmarks, credit-card fraud detection (0.17%), [...] Read more.
This study re-evaluates ROC-AUC for binary classification under severe class imbalance (<3% positives). Despite its widespread use, ROC-AUC can mask operationally salient differences among classifiers when the costs of false positives and false negatives are asymmetric. Using three benchmarks, credit-card fraud detection (0.17%), yeast protein localization (1.35%), and ozone level detection (2.9%), we compare ROC-AUC with Matthews Correlation Coefficient, F2-score, H-measure, and PR-AUC. Our empirical analyses span 20 classifier–sampler configurations per dataset, combined with four classifiers (Logistic Regression, Random Forest, XGBoost, and CatBoost) and four oversampling methods plus a no-resampling baseline (no resampling, SMOTE, Borderline-SMOTE, SVM-SMOTE, ADASYN). ROC-AUC exhibits pronounced ceiling effects, yielding high scores even for underperforming models. In contrast, MCC and F2 align more closely with deployment-relevant costs and achieve the highest Kendall’s τ rank concordance across datasets; PR-AUC provides threshold-independent ranking, and H-measure integrates cost sensitivity. We quantify uncertainty and differences using stratified bootstrap confidence intervals, DeLong’s test for ROC-AUC, and Friedman–Nemenyi critical-difference diagrams, which collectively underscore the limited discriminative value of ROC-AUC in rare-event settings. The findings recommend a shift to a multi-metric evaluation framework: ROC-AUC should not be used as the primary metric in ultra-imbalanced settings; instead, MCC and F2 are recommended as primary indicators, supplemented by PR-AUC and H-measure where ranking granularity and principled cost integration are required. This evidence encourages researchers and practitioners to move beyond sole reliance on ROC-AUC when evaluating classifiers in highly imbalanced data. Full article
Show Figures

Figure 1

12 pages, 1953 KB  
Article
Prognosis from Pixels: A Vendor-Protocol-Specific CT-Radiomics Model for Predicting Recurrence in Resected Lung Adenocarcinoma
by Abdalla Ibrahim, Eduardo J. Ortiz, Stella T. Tsui, Cameron N. Fick, Kay See Tan, Binsheng Zhao, Michelle Ginsberg, Lawrence H. Schwartz and David R. Jones
Cancers 2026, 18(2), 200; https://doi.org/10.3390/cancers18020200 - 8 Jan 2026
Viewed by 209
Abstract
Background: Radiomics can provide quantitative descriptors of tumor phenotype, but translation is often limited by feature instability across scanners and protocols. We aimed to develop and internally validate a protocol-specific CT-radiomics model using preoperative imaging to predict 5-year recurrence in patients with stage [...] Read more.
Background: Radiomics can provide quantitative descriptors of tumor phenotype, but translation is often limited by feature instability across scanners and protocols. We aimed to develop and internally validate a protocol-specific CT-radiomics model using preoperative imaging to predict 5-year recurrence in patients with stage I lung adenocarcinoma after complete surgical resection. Methods: The retrospective study included 270 patients with completely resected stage I lung adenocarcinoma from January 2010–December 2021, among whom 23 (8.5%) experienced recurrence within five years. Radiomic features were extracted from routine preoperative CT scans. After preprocessing to remove highly constant and highly correlated features, the Synthetic Minority Over-sampling Technique addressed class imbalance in the training set. Recursive Feature Elimination identified the most predictive radiomic features. An XGBoost classifier was trained using optimized hyperparameters identified through RandomizedSearchCV with cross-validation. Model performance was evaluated using the ROC curve and predictive metrics. Results: Five radiomic features differed significantly between recurrence groups (p = 0.007 to <0.001): Shape Sphericity, first-order 90Percentile, GLCM Autocorrelation, GLCM Cluster Shade, and GLDM Large Dependence Low Gray Level Emphasis. The radiomics model showed excellent discriminatory ability with AUC values of 0.99 (95% CI: 0.98–1.00), 0.97 (95% CI: 0.91–1.00), and 0.96 (95% CI: 0.85–1.00) on the training, validation, and test sets, respectively. On the test set, the model achieved sensitivity of 100% (95% CI: 51–100%), specificity of 94% (95% CI: 81–98%), PPV of 67% (95% CI: 30–90%), NPV of 100% (95% CI: 90–100%), and overall accuracy of 95% (95% CI: 83–99%). Conclusions: Under protocol-homogeneous imaging conditions, CT radiomics accurately predicted recurrence in patients with completely resected stage I lung adenocarcinoma. External multi-vendor validation is needed before broader deployment. Full article
(This article belongs to the Section Methods and Technologies Development)
Show Figures

Figure 1

15 pages, 239 KB  
Article
Race, Breastfeeding Support, and the U.S. Infant Formula Shortage: An Exploratory Cross-Sectional Study
by John P. Bartkowski, Katherine Klee, Stephen Bartkowski, Ginny Garcia-Alexander, Jacinda B. Roach and Shakeizia (Kezi) Jones
Healthcare 2026, 14(2), 148; https://doi.org/10.3390/healthcare14020148 - 7 Jan 2026
Viewed by 211
Abstract
Background/Objectives: African American women are less likely to breastfeed in general and to breastfeed exclusively for the first six months of infancy. Racial and ethnic breastfeeding disparities are especially pronounced in the South, particularly in rural communities. These differences are attributed largely to [...] Read more.
Background/Objectives: African American women are less likely to breastfeed in general and to breastfeed exclusively for the first six months of infancy. Racial and ethnic breastfeeding disparities are especially pronounced in the South, particularly in rural communities. These differences are attributed largely to structural lactation impediments that include less breastfeeding support in healthcare settings, workplaces, and communities. While a great deal of research has explored racial differences in breastfeeding, minimal attention has been paid to the social correlates and racial disparities associated with the 2022 U.S. infant formula shortage. Our study explores racial distinctions in the formula shortage’s effect on breastfeeding support among Gulf Coast Mississippians. Methods: We use data from the second wave of the Mississippi REACH (Racial and Ethnic Approaches to Community Health) Social Climate Survey to determine if racial differences are evident in the formula shortage’s influence on breastfeeding support. We predict that the infant formula shortage will have prompted African American respondents to become much more supportive of breastfeeding than their White counterparts, net of sociodemographic controls. This hypothesis is based on the lower prevalence of exclusive breastfeeding among African Americans, thereby indicating a greater reliance on formula. The study uses a general population (random digit dial) sample and purposive (exclusively African American) oversample to analyze validated data from a cross-sectional survey. Sampling took place between September and December 2023, with a sample population of adult male and female Mississippians. A series of binary logistic regression models were employed to measure the association of race with breastfeeding support changes resulting from the infant formula shortage. Results: The study results support the hypothesis, as seen by a positive association between African Americans and increased breastfeeding support directly related to the infant formula shortage. Further, the baseline statistical model reveals African American respondents to be five times more likely than White respondents (p < 0.001) to report that the formula shortage increased their support of breastfeeding. Conclusions: We conclude by discussing this study’s implications and promising directions for future research. Full article
34 pages, 6621 KB  
Article
Oversampling Algorithm Based on Improved K-Means and Gaussian Distribution
by Wenhao Xie and Xiao Huang
Information 2026, 17(1), 28; https://doi.org/10.3390/info17010028 - 31 Dec 2025
Viewed by 208
Abstract
Oversampling is common and effective in resolving the classification problem of imbalanced data. Traditional oversampling methods are prone to generating overlapping or noisy samples. Clustering can effectively alleviate the above problems to a certain extent. However, the quality of clustering results has a [...] Read more.
Oversampling is common and effective in resolving the classification problem of imbalanced data. Traditional oversampling methods are prone to generating overlapping or noisy samples. Clustering can effectively alleviate the above problems to a certain extent. However, the quality of clustering results has a significant impact on the final classification performance. To address this problem, an oversampling algorithm based on the Gaussian distribution oversampling algorithm and the K-means clustering algorithm combining compactness and separateness (CSKGO) is proposed in this paper. The algorithm first uses the K-means clustering algorithm, combining compactness and separateness to cluster the minority samples, constructs the cluster compactness index and inter-cluster separateness index to obtain the optimal number of clusters and the clustering results, and obtains the local distribution characteristics of the minority samples through clustering. Secondly, the sampling ratio for each cluster is assigned based on the compactness of the clustering results to determine the number of samples for each cluster in the minority class. Then, the mean vectors and covariance matrices of each cluster are calculated, and the Gaussian distribution oversampling algorithm is used to generate new samples that match the distribution of characteristics of the real minority samples, which are combined with the majority samples to form balanced data. To verify the effectiveness of the proposed algorithm, 24 datasets were selected from the University of California Irvine (UCI) Repository, and they were oversampled using the CSKGO algorithm proposed in this paper and other oversampling algorithms, respectively. Finally, these datasets were classified using Random Forest, Support Vector Machine, and K-Nearest Neighbor Classifiers. The results indicate that the algorithm proposed in this paper has higher accuracy, F-measure, G-mean, and AUC values, which can effectively improve the classification performance of the imbalanced datasets. Full article
Show Figures

Graphical abstract

46 pages, 3751 KB  
Article
Wangiri Fraud Detection: A Comprehensive Approach to Unlabeled Telecom Data
by Amirreza Balouchi, Meisam Abdollahi, Ali Eskandarian, Kianoush Karimi Pour Kerman, Elham Majd, Neda Azouji and Amirali Baniasadi
Future Internet 2026, 18(1), 15; https://doi.org/10.3390/fi18010015 - 27 Dec 2025
Viewed by 376
Abstract
Wangiri fraud is a pervasive telecommunications scam that exploits missed calls to lure victims into dialing premium-rate numbers, resulting in significant financial losses for operators and consumers. This paper presents a comprehensive machine learning framework for detecting Wangiri fraud in highly imbalanced and [...] Read more.
Wangiri fraud is a pervasive telecommunications scam that exploits missed calls to lure victims into dialing premium-rate numbers, resulting in significant financial losses for operators and consumers. This paper presents a comprehensive machine learning framework for detecting Wangiri fraud in highly imbalanced and unlabeled Call Detail Record (CDR) datasets. We introduce a novel unsupervised labeling approach using domain-driven heuristics, coupled with advanced feature engineering to capture temporal, geographic, and behavioral patterns indicative of fraud. To address severe class imbalance, we evaluate multiple sampling strategies like the Synthetic Minority Over-sampling Technique (SMOTE) and undersampling, and also compare the performance of Logistic Regression, Decision Trees, Random Forest, XGBoost, and Multi-Layer Perceptron (MLP). Our results demonstrate that ensemble methods, particularly Random Forest and XGBoost, achieve near-perfect accuracy (e.g., Receiver Operating Characteristic Area Under the Curve (ROC-AUC) >0.99) on balanced data while maintaining interpretability. The proposed pipeline offers a scalable and practical solution for real-time fraud detection, providing telecom operators with an effective tool to mitigate Wangiri fraud risks. Full article
(This article belongs to the Special Issue Cybersecurity in the Age of AI, IoT, and Edge Computing)
Show Figures

Graphical abstract

27 pages, 904 KB  
Article
An Interpretable Hybrid RF–ANN Early-Warning Model for Real-World Prediction of Academic Confidence and Problem-Solving Skills
by Mostafa Aboulnour Salem and Zeyad Aly Khalil
Math. Comput. Appl. 2025, 30(6), 140; https://doi.org/10.3390/mca30060140 - 18 Dec 2025
Viewed by 373
Abstract
Early identification of students at risk for low academic confidence, poor problem-solving skills, or poor academic performance is crucial to achieving equitable and sustainable learning outcomes. This research presents a hybrid artificial intelligence (AI) framework that combines feature selection using a Random Forest [...] Read more.
Early identification of students at risk for low academic confidence, poor problem-solving skills, or poor academic performance is crucial to achieving equitable and sustainable learning outcomes. This research presents a hybrid artificial intelligence (AI) framework that combines feature selection using a Random Forest (RF) algorithm with data classification via an Artificial Neural Network (ANN) to predict risks related to Academic Confidence and Problem-Solving Skills (ACPS) among higher education students. Three real-world datasets from Saudi universities were used: MSAP, EAAAM, and MES. Data preprocessing included Min–Max normalisation, class balancing using SMOTE (Synthetic Minority Oversampling Technique), and recursive feature elimination. Model performance was evaluated using five-fold cross-validation and a paired t-test. The proposed model (RF-ANN) achieved an average accuracy of 98.02%, outperforming benchmark models such as XGBoost, TabNet, and an Autoencoder–ANN. Statistical tests confirmed the significant performance improvement (p < 0.05; Cohen’s d = 1.1–2.7). Feature importance and explainability analysis using a Random Forest and Shapley Additive Explanations (SHAP) showed that psychological and behavioural factors—particularly study hours, academic engagement, and stress indicators—were the most influential drivers of ACPS risk. Hence, the findings demonstrate that the proposed framework combines high predictive accuracy with interpretability, computational efficiency, and scalability. Practically, the model supports Sustainable Development Goal 4 (Quality Education) by enabling early, transparent identification of at-risk students, thereby empowering educators and academic advisors to deliver timely, targeted, and data-driven interventions. Full article
Show Figures

Figure 1

20 pages, 1504 KB  
Article
Early Prediction of Acute Respiratory Distress Syndrome in Critically Ill Polytrauma Patients Using Balanced Random Forest ML: A Retrospective Cohort Study
by Nesrine Ben El Hadj Hassine, Sabri Barbaria, Omayma Najah, Halil İbrahim Ceylan, Muhammad Bilal, Lotfi Rebai, Raul Ioan Muntean, Ismail Dergaa and Hanene Boussi Rahmouni
J. Clin. Med. 2025, 14(24), 8934; https://doi.org/10.3390/jcm14248934 - 17 Dec 2025
Viewed by 751
Abstract
Background/Objectives: Acute respiratory distress syndrome (ARDS) represents a critical complication in polytrauma patients, characterized by diffuse lung inflammation and bilateral pulmonary infiltrates with mortality rates reaching 45% in intensive care units (ICU). The heterogeneous nature of ARDS and complex clinical presentation in severely [...] Read more.
Background/Objectives: Acute respiratory distress syndrome (ARDS) represents a critical complication in polytrauma patients, characterized by diffuse lung inflammation and bilateral pulmonary infiltrates with mortality rates reaching 45% in intensive care units (ICU). The heterogeneous nature of ARDS and complex clinical presentation in severely injured patients poses substantial diagnostic challenges, necessitating early prediction tools to guide timely interventions. Machine learning (ML) algorithms have emerged as promising approaches for clinical decision support, demonstrating superior performance compared to traditional scoring systems in capturing complex patterns within high-dimensional medical data. Based on the identified research gaps in early ARDS prediction for polytrauma populations, our study aimed to: (i) develop a balanced random forest (BRF) ML model for early ARDS prediction in critically ill polytrauma patients, (ii) identify the most predictive clinical features using ANOVA-based feature selection, and (iii) evaluate model performance using comprehensive metrics addressing class imbalance challenges. Methods: This retrospective cohort study analyzed 407 polytrauma patients admitted to the ICU of the Center of Traumatology and Major Burns of Ben Arous, Tunisia, between 2017 and 2021. We implemented a comprehensive ML pipeline that incorporates Tomek Links undersampling, ANOVA F-test feature selection for the top 10 predictive variables, and SMOTE oversampling with a conservative sampling rate of 0.3. The BRF classifier was trained with class weighting and evaluated using stratified 5-fold cross-validation. Performance metrics included AUROC, PR-AUC, sensitivity, specificity, F1-score, and Matthews correlation coefficient. Results: Among 407 patients, 43 developed ARDS according to the Berlin definition, representing a 10.57% incidence. The BRF model demonstrated exceptional predictive performance with an AUROC of 0.98, a sensitivity of 0.91, a specificity of 0.80, an F1-score of 0.84, and an MCC of 0.70. Precision–recall AUC reached 0.86, demonstrating robust performance despite class imbalance. During stratified cross-validation, AUROC values ranged from 0.93 to 0.99 across folds, indicating consistent model stability. The top 10 selected features included procalcitonin, PaO2 at ICU admission, 24-h pH, massive transfusion, total fluid resuscitation, presence of pneumothorax, alveolar hemorrhage, pulmonary contusion, hemothorax, and flail chest injury. Conclusions: Our BRF model provides a robust, clinically applicable tool for early prediction of ARDS in polytrauma patients using readily available clinical parameters. The comprehensive two-step resampling approach, combined with ANOVA-based feature selection, successfully addressed class imbalance while maintaining high predictive accuracy. These findings support integrating ML approaches into critical care decision-making to improve patient outcomes and resource allocation. External validation in diverse populations remains essential for confirming generalizability and clinical implementation. Full article
(This article belongs to the Section Respiratory Medicine)
Show Figures

Graphical abstract

36 pages, 1582 KB  
Article
A Deep Random Forest Model with Symmetry Analysis for Hyperspectral Image Data Classification Based on Feature Importance
by Jie Lian, Wei Feng, Qing Wang, Yuhang Dong, Gabriel Dauphin and Jian Bai
Symmetry 2025, 17(12), 2172; https://doi.org/10.3390/sym17122172 - 17 Dec 2025
Viewed by 244
Abstract
Hyperspectral imagery (HSI), as a core data carrier in remote sensing, plays a crucial role in many fields. Still, it also faces numerous challenges, including the curse of dimensionality, noise interference, and small samples. These problems severely affect the generalization ability and classification [...] Read more.
Hyperspectral imagery (HSI), as a core data carrier in remote sensing, plays a crucial role in many fields. Still, it also faces numerous challenges, including the curse of dimensionality, noise interference, and small samples. These problems severely affect the generalization ability and classification accuracy of traditional machine learning and deep learning algorithms. Existing solutions suffer from bottlenecks such as unknown cost matrices and excessive computational overhead. And ensemble learning fails to fully exploit the deep semantic features and feature importance relationships of high-dimensional data. To address these issues, this paper proposes a dual ensemble classification framework (DRF-FI) based on feature importance analysis and a deep random forest. This method integrates feature selection and two-layer ensemble learning. First, it identifies discriminative spectral bands through feature importance quantification. Then, it constructs a balanced training subset through random oversampling. Finally, it integrates four different ensemble strategies. Experimental results on three benchmark hyperspectral datasets demonstrate that DRF-FI exhibits outstanding performance across multiple datasets, particularly excelling in handling highly imbalanced data. Compared to traditional random forests, the proposed method achieves stable improvements in both overall accuracy (OA) and average accuracy (AA). On specific datasets, OA and AA were enhanced by up to 0.84% and 1.24%, respectively. This provides an effective solution to the class imbalance problem in hyperspectral images. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

26 pages, 1087 KB  
Article
Sustainable Road Safety: Predicting Traffic Accident Severity in Portugal Using Machine Learning
by José Cunha, José Silvestre Silva, Ricardo Ribeiro and Paulo Gomes
Sustainability 2025, 17(24), 11199; https://doi.org/10.3390/su172411199 - 14 Dec 2025
Viewed by 681
Abstract
Road traffic accidents remain a major global challenge, contributing to significant human and economic losses each year. In Portugal, the analysis and prevention of severe accidents are critical for optimizing the allocation of law enforcement resources and improving emergency response strategies. This study [...] Read more.
Road traffic accidents remain a major global challenge, contributing to significant human and economic losses each year. In Portugal, the analysis and prevention of severe accidents are critical for optimizing the allocation of law enforcement resources and improving emergency response strategies. This study aims to develop and evaluate predictive models for accident severity using real-world data collected by the Portuguese Guarda Nacional Republicana (GNR) between 2019 and 2023. Four algorithms, Random Forest, XGBoost, Multilayer Perceptron (MLP), and Deep Neural Networks (DNN), were implemented to capture both linear and non-linear relationships within the dataset. To address the natural class imbalance, class weighting, Synthetic Minority Oversampling Technique (SMOTE), and Random Undersampling were applied. The models were assessed using Recall, F1-score, and G-Mean, with particular emphasis on detecting severe accidents. Results showed that DNNs achieved the best balance between sensitivity and overall performance, especially under SMOTE and class weighting conditions. The findings highlight the potential of classical machine learning and deep learning models to support proactive road safety management and inform resource allocation decisions in high-risk scenarios.This research contributes to sustainability by enabling data-driven road safety management, which reduces human and economic losses associated with traffic accidents and supports more efficient allocation of public resources. By improving the prediction of severe accidents, the study reinforces sustainable development goals related to safe mobility, resilient infrastructure, and effective disaster prevention and response policies. Full article
Show Figures

Figure 1

20 pages, 12133 KB  
Article
Lithofacies Identification by an Intelligent Fusion Algorithm for Production Numerical Simulation: A Case Study on Deep Shale Gas Reservoirs in Southern Sichuan Basin, China
by Yi Liu, Jin Wu, Boning Zhang, Chengyong Li, Feng Deng, Bingyi Chen, Chen Yang, Jing Yang and Kai Tong
Processes 2025, 13(12), 4040; https://doi.org/10.3390/pr13124040 - 14 Dec 2025
Viewed by 300
Abstract
Lithofacies, as an integrated representation of key reservoir attributes including mineral composition and organic matter enrichment, provides crucial geological and engineering guidance for identifying “dual sweet spots” and designing fracturing strategies in deep shale gas reservoirs. However, reliable lithofacies characterization remains particularly challenging [...] Read more.
Lithofacies, as an integrated representation of key reservoir attributes including mineral composition and organic matter enrichment, provides crucial geological and engineering guidance for identifying “dual sweet spots” and designing fracturing strategies in deep shale gas reservoirs. However, reliable lithofacies characterization remains particularly challenging owing to significant reservoir heterogeneity, scarce core data, and imbalanced facies distribution. Conventional manual log interpretation tends to be cost prohibitive and inaccurate, while existing intelligent algorithms suffer from inadequate robustness and suboptimal efficiency, failing to meet demands for both precision and practicality in such complex reservoirs. To address these limitations, this study developed a super-integrated lithofacies identification model termed SRLCL, leveraging well-logging data and lithofacies classifications. The proposed framework synergistically combines multiple modeling advantages while maintaining a balance between data characteristics and optimization effectiveness. Specifically, SRLCL incorporates three key components: Newton-Weighted Oversampling (NWO) to mitigate data scarcity and class imbalance, the Polar Light Optimizer (PLO) to accelerate convergence and enhance optimization performance, and a Stacking ensemble architecture that integrates five heterogeneous algorithms—Support Vector Machine (SVM), Random Forest (RF), Light Gradient Boosting Machine (LightGBM), Convolutional Neural Network (CNN), and Long Short-Term Memory (LSTM)—to overcome the representational limitations of single-model or homogeneous ensemble approaches. Experimental results indicated that the NWO-PLO-SRLCL model achieved an overall accuracy of 93% in lithofacies identification, exceeding conventional methods by more than 6% while demonstrating remarkable generalization capability and stability. Furthermore, production simulations of fractured horizontal wells based on the lithofacies-controlled geological model showed only a 6.18% deviation from actual cumulative gas production, underscoring how accurate lithofacies identification facilitates development strategy optimization and provides a reliable foundation for efficient deep shale gas development. Full article
(This article belongs to the Special Issue Numerical Simulation and Application of Flow in Porous Media)
Show Figures

Figure 1

27 pages, 797 KB  
Article
Predicting Segment-Level Road Traffic Injury Counts Using Machine Learning Models: A Data-Driven Analysis of Geometric Design and Traffic Flow Factors
by Noura Hamdan and Tibor Sipos
Future Transp. 2025, 5(4), 197; https://doi.org/10.3390/futuretransp5040197 - 12 Dec 2025
Viewed by 482
Abstract
Accurate prediction of road traffic crash severity is essential for developing data-driven safety strategies and optimizing resource allocation. This study presents a predictive modeling framework that utilizes Random Forest (RF), Gradient Boosting (GB), and K-Nearest Neighbors (KNN) to estimate segment-level frequencies of fatalities, [...] Read more.
Accurate prediction of road traffic crash severity is essential for developing data-driven safety strategies and optimizing resource allocation. This study presents a predictive modeling framework that utilizes Random Forest (RF), Gradient Boosting (GB), and K-Nearest Neighbors (KNN) to estimate segment-level frequencies of fatalities, serious injuries, and slight injuries on Hungarian roadways. The model integrates an extensive array of predictor variables, including roadway geometric design features, traffic volumes, and traffic composition metrics. To address class imbalance, each severity class was modeled using resampled datasets generated via the Synthetic Minority Over-sampling Technique (SMOTE), and model performance was optimized through grid-search cross-validation for hyperparameter optimization. For the prediction of serious- and slight-injury crash counts, the Random Forest (RF) ensemble model demonstrated the most robust performance, consistently attaining test accuracies above 0.91 and coefficient of determination (R2) values exceeding 0.95. In contrast, for fatalities count prediction, the Gradient Boosting (GB) model achieved the highest accuracy (0.95), with an R2 value greater than 0.87. Feature importance analysis revealed that heavy vehicle flows consistently dominate crash severity prediction. Horizontal alignment features primarily influenced fatal crashes, while capacity utilization was more relevant for slight and serious injuries, reflecting the roles of geometric design and operational conditions in shaping crash occurrence and severity. The proposed framework demonstrates the effectiveness of machine learning approaches in capturing non-linear relationships within transportation safety data and offers a scalable, interpretable tool to support evidence-based decision-making for targeted safety interventions. Full article
Show Figures

Figure 1

31 pages, 1941 KB  
Article
Boosting Traffic Crash Prediction Performance with Ensemble Techniques and Hyperparameter Tuning
by Naima Goubraim, Zouhair Elamrani Abou Elassad, Hajar Mousannif and Mohamed Ameksa
Safety 2025, 11(4), 121; https://doi.org/10.3390/safety11040121 - 9 Dec 2025
Viewed by 1254
Abstract
Road traffic crashes are a major global challenge, resulting in significant loss of life, economic burden, and societal impact. This study seeks to enhance the precision of traffic accident prediction using advanced machine learning techniques. This study employs an ensemble learning approach combining [...] Read more.
Road traffic crashes are a major global challenge, resulting in significant loss of life, economic burden, and societal impact. This study seeks to enhance the precision of traffic accident prediction using advanced machine learning techniques. This study employs an ensemble learning approach combining the Random Forest, the Bagging Classifier (Bootstrap Aggregating), the Extreme Gradient Boosting (XGBoost) and the Light Gradient Boosting Machine (LightGBM) algorithms. To address class imbalance and feature relevance, we implement feature selection using the Extra Trees Classifier and oversampling using the Synthetic Minority Over-sampling Technique (SMOTE). Rigorous hyperparameter tuning is applied to optimize model performance. Our results show that the ensemble approach, coupled with hyperparameter optimization, significantly improves prediction accuracy. This research contributes to the development of more effective road safety strategies and can help to reduce the number of road accidents. Full article
(This article belongs to the Special Issue Road Traffic Risk Assessment: Control and Prevention of Collisions)
Show Figures

Figure 1

19 pages, 8434 KB  
Article
Predicting Persistent Forest Fire Refugia Using Machine Learning Models with Topographic, Microclimate, and Surface Wind Variables
by Sven Christ, Tineke Kraaij, Coert J. Geldenhuys and Helen M. de Klerk
ISPRS Int. J. Geo-Inf. 2025, 14(12), 480; https://doi.org/10.3390/ijgi14120480 - 5 Dec 2025
Viewed by 575
Abstract
Persistent forest fire refugia are areas within fire-prone landscapes that remain fire-free over long periods of time and are crucial for ecosystem resilience. Modelling to develop maps of these refugia is key to informing fire and land use management. We predict persistent forest [...] Read more.
Persistent forest fire refugia are areas within fire-prone landscapes that remain fire-free over long periods of time and are crucial for ecosystem resilience. Modelling to develop maps of these refugia is key to informing fire and land use management. We predict persistent forest fire refugia using variables linked to the fire triangle (aspect, slope, elevation, topographic wetness, convergence and roughness, solar irradiation, temperature, surface wind direction, and speed) in machine learning algorithms (Random Forest, XGBoost; two ensemble models) and K-Nearest Neighbour. All models were run with and without ADASYN over-sampling and grid search hyperparameterisation. Six iterations were run per algorithm to assess the impact of omitting variables. Aspect is twice as influential as any other variable across all models. Solar radiation and surface wind direction are also highlighted, although the order of importance differs between algorithms. The predominant importance of aspect relates to solar radiation received by sun-facing slopes and resultant heat and moisture balances and, in this study area, the predominant fire wind direction. Ensemble models consistently produced the most accurate results. The findings highlight the importance of topographic and microclimatic variables in persistent forest fire refugia prediction, with ensemble machine learning providing reliable forecasting frameworks. Full article
Show Figures

Figure 1

Back to TopTop