MDPI - Publisher of Open Access Journals

22 pages, 2186 KB

Open AccessArticle

Prediction of Large-Scale Traffic Accident Severity in Qatar: A Binary Reformulation Approach for Extreme Class Imbalance with Interpretable AI

by Mohammed Alshriem and Yin Yang

Future Transp. 2026, 6(2), 88; https://doi.org/10.3390/futuretransp6020088 - 15 Apr 2026

Viewed by 69

Abstract

Road traffic injuries represent one of the most critical public health challenges in the Gulf region. Predicting traffic accident severity is therefore a critical component of evidence-based road safety management. In this study, we develop machine learning frameworks for predicting traffic accident severity [...] Read more.

Road traffic injuries represent one of the most critical public health challenges in the Gulf region. Predicting traffic accident severity is therefore a critical component of evidence-based road safety management. In this study, we develop machine learning frameworks for predicting traffic accident severity using Qatar’s national dataset (2020–2025), addressing extreme class imbalance and interpretability. A dataset of 588,023 accident records was systematically preprocessed from 1,000,500 raw reports. We compare three approaches: multi-class (four severity levels), binary (Safe vs. Severe), and cascaded two-stage (combining both). Six classifiers were evaluated across two encoding methods and three balancing strategies. Systematic hyperparameter tuning with 5-fold stratified cross-validation was performed for all models. The binary LightGBM classifier achieved BA = 71.04%, AUC-ROC = 0.772, Sensitivity = 61.03%, and Specificity = 81.05%, demonstrating superior performance over multi-class approaches. Temporal validation on 2025 data (trained on 2020–2024 data) supported good temporal generalization. Analysis of 10,000 test instances identified the time period as the dominant predictor of accident severity. The binary LightGBM framework provides an interpretable and effective approach for severe accident identification and risk prioritization, with SHAP findings supporting targeted temporal enforcement and pedestrian safety as evidence-based policy priorities. Full article

► Show Figures

Figure 1

21 pages, 1058 KB

Open AccessArticle

Cross-Disease Breathomics by PTR-TOF-MS: Multiclass Machine Learning and Network Remodeling across Asthma, COPD, Cystic Fibrosis, and Lymphangioleiomyomatosis

by Malika Mustafina, Artemiy Silantyev, Aleksandr Suvorov, Stanislav Krasovskiy, Marina Makarova, Alexander Chernyak, Olga Suvorova, Anna Shmidt, Daria Gognieva, Aleksandra Bykova, Nana Gogiberidze, Andrei Akselrod, Andrey Belevskiy, Sergey Avdeev, Vladimir Betelin, Abram Syrkin and Philipp Kopylov

Int. J. Mol. Sci. 2026, 27(8), 3483; https://doi.org/10.3390/ijms27083483 - 13 Apr 2026

Viewed by 256

Abstract

Chronic obstructive and inflammatory lung diseases share overlapping clinical manifestations and spirometric features, complicating differential diagnosis and monitoring. In this study, we performed an integrative real-time proton-transfer-reaction time-of-flight mass spectrometry (PTR-TOF-MS) breathomics analysis to assess whether exhaled volatile organic compound (VOC) profiles enable [...] Read more.

Chronic obstructive and inflammatory lung diseases share overlapping clinical manifestations and spirometric features, complicating differential diagnosis and monitoring. In this study, we performed an integrative real-time proton-transfer-reaction time-of-flight mass spectrometry (PTR-TOF-MS) breathomics analysis to assess whether exhaled volatile organic compound (VOC) profiles enable multiclass discrimination among bronchial asthma (BA), chronic obstructive pulmonary disease (COPD), cystic fibrosis (CF), and lymphangioleiomyomatosis (LAM), with healthy individuals as controls. Breath VOC data from 843 subjects were analyzed using a stratified 70/30 train/test split. An ensemble feature selection strategy based on gradient boosting (XGBoost with SMOTE within cross-validation) identified stable VOC panels (top 25% selection probability), yielding 29 VOCs and 31 features including clinical covariates. On the independent test set, the VOC-only model achieved a macro-averaged one-vs-one (OvO) AUC of 0.866 (95% CI 0.829–0.903), while the combined model improved to 0.888 (95% CI 0.853–0.919), indicating modest value of clinical variables. Pairwise analysis demonstrated highest discrimination for CF (AUC up to 0.988), whereas BA and LAM showed lower sensitivity (<0.60), likely reflecting heterogeneity and limited sample size. Given differences in age, sex, BMI, and smoking status across cohorts, confounding effects were assessed, confirming that VOC signatures retain independent diagnostic information. Disease-specific VOC interaction networks revealed distinct remodeling patterns, with central metabolites not captured by univariate analysis. Overall, PTR-TOF-MS breathomics demonstrates proof-of-concept multiclass discrimination across chronic lung diseases. Full article

(This article belongs to the Special Issue Molecular Perspectives in Lung Diseases: Pathogenesis, Diagnosis, and Treatment—2nd Edition)

18 pages, 2652 KB

Open AccessArticle

Eavesdropping Detection and Classification in Passive Optical Networks Using Machine Learning

by Hussain Shah Syed Bukhari, Jie Zhang, Yajie Li, Wei Wang, Asif Ali Wagan and Saifullah Memon

Photonics 2026, 13(4), 369; https://doi.org/10.3390/photonics13040369 - 13 Apr 2026

Viewed by 185

Abstract

Passive Optical Networks (PONs) play a vital role in providing high-speed broadband access in the 5G and F5G generation. However, their shared nature makes them vulnerable to physical-layer attacks like fiber bending, tapping and fiber cut. The problem is more serious in high-density [...] Read more.

Passive Optical Networks (PONs) play a vital role in providing high-speed broadband access in the 5G and F5G generation. However, their shared nature makes them vulnerable to physical-layer attacks like fiber bending, tapping and fiber cut. The problem is more serious in high-density PONs, where high split ratios result in high optical loss and overlapping back-scattered light, making it difficult to distinguish small attacks from background noise. Contrary to most existing works that neglect class imbalance and signal interference in high-density networks, this paper proposes a robust hierarchical two-stage attack detection scheme. First, we employ a binary classifier to distinguish eavesdropping attacks from normal traffic. Then, a second stage focuses on the specific eavesdropping categories (C1–C4). To address the small amount of attack samples, SMOTE is utilized for oversampling the minority class, and PCA-SVM is used to refine feature selection. Finally, the output of both stages is combined using probability score to obtain reliable decision. The experimental results show the effectiveness of our approach, achieving a classification accuracy of 89.07%. When evaluated on the same data, it has shown superior results in comparison to conventional machine learning algorithms, including decision tree (86.3%), k-nearest neighbors (79%), logistic regression (60%), and Naïve Bayes (52.6%). Full article

(This article belongs to the Special Issue Advancements and Future Perspectives in All-Optical Detection and Reliability Improvement Technologies)

► Show Figures

Figure 1

25 pages, 2747 KB

Open AccessArticle

An Ensemble Learning-Based Early Warning Framework for Brucellosis Outbreaks in High-Altitude Pastoral Systems

by Liu Xi, Faez Firdaus Abdullah Jesse, Bura Thlama Paul, Eric Lim Teik Chung and Mohd Azmi Mohd Lila

Appl. Biosci. 2026, 5(2), 32; https://doi.org/10.3390/applbiosci5020032 - 13 Apr 2026

Viewed by 159

Abstract

Brucellosis poses a persistent threat to livestock health in high-altitude pastoral regions of China, where harsh environments and semi-nomadic grazing increase transmission risk. Existing surveillance systems rely mainly on periodic serological testing and lack effective early warning capability. This study proposes an ensemble [...] Read more.

Brucellosis poses a persistent threat to livestock health in high-altitude pastoral regions of China, where harsh environments and semi-nomadic grazing increase transmission risk. Existing surveillance systems rely mainly on periodic serological testing and lack effective early warning capability. This study proposes an ensemble learning-based early warning framework integrating veterinary epidemiological indicators with environmental and herd-movement data. A total of 4826 herd-level records collected over five years (2019–2024) were analyzed, with an overall positivity rate of 11.4%. Multi-source data, including serological, clinical, reproductive, vaccination, meteorological, pasture-management, and herd-movement information (from GPS tracking and structured surveys), were integrated through epidemiology-guided feature engineering. To address class imbalance and temporal dynamics, Synthetic Minority Over-sampling Technique (SMOTE) resampling and sliding time-window features were applied. The proposed ensemble model combines Random Forest, XGBoost, and LightGBM using a soft-voting strategy, with logistic regression as a baseline. Results show that the ensemble model outperforms single models, achieving an AUC of 0.86 and a PR-AUC of 0.65. After threshold optimization, sensitivity increased from 0.78 to 0.87. Under field conditions, the system provided herd-level early warnings with an average lead time of approximately 12 days before confirmed outbreaks, demonstrating its feasibility and practical value for proactive brucellosis surveillance in high-altitude pastoral systems. Full article

► Show Figures

Figure 1

23 pages, 1520 KB

Open AccessArticle

Explainable Machine Learning Reveals Time-Dependent Cognitive Risk in Minor Neurocognitive Disorder: Implications for Health Promotion and Early Risk Stratification

by Anna Tsiakiri, Christos Kokkotis, Dimitrios Tsiptsios, Leonidas Panos, Nikolaos Aggelousis, Konstantinos Vadikolias and Foteini Christidi

Biomedicines 2026, 14(4), 880; https://doi.org/10.3390/biomedicines14040880 - 12 Apr 2026

Viewed by 421

Abstract

Background/Objectives: Minor neurocognitive disorder (minor NCD) represents a heterogeneous and potentially modifiable stage along the continuum from normal aging to dementia, offering a critical window for targeted health promotion interventions. Early identification of individuals at increased risk of progression is essential for [...] Read more.

Background/Objectives: Minor neurocognitive disorder (minor NCD) represents a heterogeneous and potentially modifiable stage along the continuum from normal aging to dementia, offering a critical window for targeted health promotion interventions. Early identification of individuals at increased risk of progression is essential for implementing preventive strategies that may delay functional decline. This study developed a transparent machine learning (ML) framework to predict diagnostic change from minor to major NCD at 12 and 24 months using baseline demographic, clinical, and multidomain neuropsychological data. Methods: A retrospective cohort of 162 memory clinic patients was analyzed using a rigorously controlled pipeline incorporating nested stratified cross-validation, SMOTE-based imbalance correction, and sequential forward feature selection. Logistic regression, support vector machines (SVMs), and XGBoost were evaluated, with SHapley Additive exPlanations (SHAPs) applied to ensure interpretability. Results: SVM achieved the most balanced predictive performance at both 12 months (accuracy = 0.90) and 24 months (accuracy = 0.81). Short-term progression was primarily driven by subtle multidomain cognitive inefficiencies, while longer-term risk reflected continued cognitive vulnerability modulated by metabolic factors such as diabetes. Conclusions: These findings highlight the potential of explainable ML as a health promotion tool and suggest that explainable ML can uncover clinically meaningful cognitive risk signatures at the earliest stages of NCD. By identifying modifiable systemic contributors alongside cognitive risk profiles, this framework supports precision-oriented preventive strategies and proactive longitudinal monitoring in minor NCD. Full article

(This article belongs to the Special Issue Advances in Neurological Diseases: Pathogenesis, Diagnosis and Therapeutic Strategies (2nd Edition))

► Show Figures

Figure 1

17 pages, 629 KB

Open AccessArticle

A Hybrid Feature-Weighting and Resampling Model for Imbalanced Sentiment Analysis in User Game Reviews

by Thao-Trang Huynh-Cam, Long-Sheng Chen, Hsuan-Jung Huang and Hsiu-Chia Ko

Mathematics 2026, 14(8), 1273; https://doi.org/10.3390/math14081273 - 11 Apr 2026

Viewed by 173

Abstract

Sentiment analysis of online game reviews has increasingly become important in understanding player experiences and supporting data-driven game development. However, research in this domain has continuously faced two unresolved challenges: (1) the extreme imbalance between positive and negative feedback, and (2) the inefficiency [...] Read more.

Sentiment analysis of online game reviews has increasingly become important in understanding player experiences and supporting data-driven game development. However, research in this domain has continuously faced two unresolved challenges: (1) the extreme imbalance between positive and negative feedback, and (2) the inefficiency of existing feature-weighting schemes in capturing sentiment signals embedded in informal gaming discourses. Prior works demonstrated that negative feedback—though a few in number are highly influential—usually contain richer emotional content and longer textual structures; yet, prevailing classification models often perform poorly for these minorities (i.e., negative feedback). Numerous studies explored multimodal imbalance issues, class imbalance in cross-lingual ABSA (Aspect-Based Sentiment Analysis), reinforcement-learning-based architectures for imbalanced extraction tasks, and oversampling strategies like SMOTE (Synthetic Minority Over-sampling Technique) variants. Few investigations specifically addressed imbalanced sentiment classification in the contexts of online game reviews, where user-generated content exhibits unique lexical, structural, and emotional characteristics. To address these gaps, this study integrated TF-IDF (Term Frequency-Inverse Document Frequency), VADER (Valence Aware Dictionary and Sentiment Reasoner) lexicon features, and IGM (Inverse Gravity Moment) weightings with advanced oversampling methods such as ADASYN (Adaptive Synthetic Sampling Approach for Imbalanced Learning) and Borderline-SMOTE to improve the detection of minority sentiment classes. Ensemble models, including XGBoost (Extreme Gradient Boosting) and LightGBM (Light Gradient-Boosting Machine), were further employed to enhance the robustness of imbalance. Using a large-scale dataset of Steam game reviews, the proposed framework demonstrated substantial improvement in identifying negative sentiments, addressing a critical limitation in the existing computational game-analysis literature, and advancing the modeling for detecting the emotion-rich but imbalance-prone user feedback. Full article

(This article belongs to the Special Issue IIHMSP: Intelligent Information Hiding and Multimedia Signal Processing)

37 pages, 1133 KB

Open AccessArticle

Artificial Intelligence, Academic Resilience, and Gender Equity in Education Systems: Ethical Challenges, Predictive Bias, and Governance Implications

by Francisco R. Trejo-Macotela, Mayra Fabiola González-Peralta, Gregoria C. Godínez-Flores and Mayte Olivares-Escorza

Educ. Sci. 2026, 16(4), 605; https://doi.org/10.3390/educsci16040605 - 10 Apr 2026

Viewed by 206

Abstract

The rapid integration of artificial intelligence (AI) into educational systems is transforming how student performance is analysed and how educational policies are informed by large-scale data. Within this context, machine learning techniques are increasingly used to identify patterns associated with academic success and [...] Read more.

The rapid integration of artificial intelligence (AI) into educational systems is transforming how student performance is analysed and how educational policies are informed by large-scale data. Within this context, machine learning techniques are increasingly used to identify patterns associated with academic success and educational inequality. However, the use of predictive algorithms in education also raises important questions regarding transparency, fairness, and potential algorithmic bias. This study examines the predictive performance and fairness implications of machine learning models used to identify academically resilient students using data from the Programme for International Student Assessment (PISA) 2022. The analysis is based on a dataset containing more than 600,000 student observations across multiple national education systems. Academic resilience is operationalised following the OECD framework, identifying students who belong to the lowest quartile of the socioeconomic status index (ESCS) within their country while simultaneously achieving mathematics performance in the top quartile (PV1MATH). A predictive framework incorporating six supervised learning algorithms—Logistic Regression, Random Forest, Gradient Boosting, XGBoost, LightGBM, and CatBoost—was implemented. The modelling pipeline includes data preprocessing, missing value imputation, class imbalance correction using SMOTE, and model evaluation through multiple classification metrics, including accuracy, F1-score, and the area under the ROC curve (AUC). In addition, fairness diagnostics are conducted to examine potential disparities in prediction outcomes across gender groups, while feature importance analysis and SHAP-based explanations are used to interpret the contribution of key predictors. The results indicate that ensemble-based models achieve the highest predictive performance, particularly those based on gradient boosting techniques. At the same time, the analysis reveals that socioeconomic status, migration background, and school repetition constitute the most influential predictors of academic resilience. Although gender displays relatively low predictive importance, measurable differences in positive prediction rates across gender groups suggest the presence of potential algorithmic disparities. These findings highlight the importance of integrating fairness evaluation, transparency, and interpretability into educational data science workflows. The study contributes to ongoing discussions on the responsible use of artificial intelligence in education by emphasising the need for governance frameworks capable of ensuring that algorithmic systems support equity-oriented educational policies. Full article

27 pages, 524 KB

Open AccessArticle

Synthetic Data Augmentation for Imbalanced Tabular Protein Subcellular Localization: A Comparative Study of SMOTE, CTGAN, TVAE, and TabDDPM Methods

by Ali Fatih Gündüz and Canan Batur Şahin

Appl. Sci. 2026, 16(8), 3694; https://doi.org/10.3390/app16083694 - 9 Apr 2026

Viewed by 289

Abstract

Class imbalance is a persistent challenge in supervised machine learning, particularly in biological datasets where minority classes represent functionally critical categories. Synthetic data generation has emerged as a principal strategy for mitigating this problem, yet systematic comparisons of classical and modern deep generative [...] Read more.

Class imbalance is a persistent challenge in supervised machine learning, particularly in biological datasets where minority classes represent functionally critical categories. Synthetic data generation has emerged as a principal strategy for mitigating this problem, yet systematic comparisons of classical and modern deep generative approaches remain limited. This study presents a comprehensive benchmark evaluation of four synthetic data generation methods—SMOTE, CTGAN, TVAE, and TabDDPM—across two well-established biological datasets from the UCI Machine Learning Repository: the E. coli protein localization dataset (307 samples, 6 features, 4 classes) and the yeast protein localization dataset (1299 samples, 8 features, 4 classes). Synthetic data quality was rigorously assessed using a multi-dimensional evaluation framework encompassing distributional fidelity (Fréchet Distance, Wasserstein Distance), machine learning utility (Train-on-Synthetic-Test-on-Real and Train-on-Real-Test-on-Real protocols using XGBoost version 3.2.0, Logistic Regression, Support Vector Machines, and Random Forest), and distinguishability (Classifier Two-Sample Test). The datasets are rather imbalanced. During the experiments, the dataset size increased to three times its original size while preserving the imbalanced class-sample ratio. To evaluate the quality of synthetic data, the max(AUC,1−AUC) score is proposed. This score is inversely proportional to classification performance, indicating that synthetic data are not easily distinguishable from real data. Per-class analysis reveals that minority classes remain the primary challenge across all generative methods. SMOTE and TabDDPM obtained the highest predictive utility F1-scores across both datasets. TVAE offers the strongest distributional fidelity among deep generative models, producing synthetic samples that are most difficult to distinguish from real data (lowest C2ST scores). CTGAN exhibits significant performance degradation on both small- and medium-scale datasets, with F1 utility ratios below 0.50. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

29 pages, 3152 KB

Open AccessArticle

Enhancing Darknet Traffic Classification: Integrating Traffic-Aware SMOTE and Adaptive Weighted Feature Aggregation

by Javeriah Saleem, Rafiqul Islam, Irfan Altas and Md Zahidul Islam

J. Cybersecur. Priv. 2026, 6(2), 68; https://doi.org/10.3390/jcp6020068 - 7 Apr 2026

Viewed by 204

Abstract

With the widespread adoption of anonymity networks such as Tor, I2P, and JonDonym, reliably classifying darknet traffic remains challenging due to feature redundancy and severe class imbalance in encrypted flows. Existing approaches often rely on static feature-selection strategies and generic oversampling methods, which [...] Read more.

With the widespread adoption of anonymity networks such as Tor, I2P, and JonDonym, reliably classifying darknet traffic remains challenging due to feature redundancy and severe class imbalance in encrypted flows. Existing approaches often rely on static feature-selection strategies and generic oversampling methods, which limit robustness and may distort traffic semantics. This study proposes an adaptive classification framework integrating Adaptive Weighted Feature Aggregation (AWFA) for reliability-aware feature selection and Traffic-Aware SMOTE (TA-SMOTE) for semantically constrained perturbations of packet-size and timing features while preserving flow-level structure. The framework is evaluated on a two-layer hierarchy comprising browser-level (L1) and application-level (L2) classification. At the L2, the proposed AWFA and TA-SMOTE pipeline attains a macro-F1 score of 73.81%, significantly exceeding PCA-based reduction and traditional RF-based selection with SMOTE. At the browser level (L1), macro-F1 rises from 91.58% to 96.09% while reducing the feature space from 84 to 40 attributes, highlighting both performance improvements and structural efficiency gains. Additional semantic validation confirms that the balancing process preserves the statistical and structural characteristics of genuine darknet traffic. These results indicate that reliability-aware feature aggregation and traffic-aware balancing provide a practical, trustworthy approach to modern darknet traffic classification. Full article

(This article belongs to the Section Privacy)

► Show Figures

Figure 1

35 pages, 2740 KB

Open AccessArticle

Prediction of Depression Risk on Social Media Using Natural Language Processing and Explainable Machine Learning

by Ronewa Mabodi, Elliot Mbunge, Tebogo Makaba and Nompumelelo Ndlovu

Appl. Sci. 2026, 16(7), 3489; https://doi.org/10.3390/app16073489 - 3 Apr 2026

Viewed by 329

Abstract

Major Depressive Disorder (MDD) is a significant global health burden that contributes to disability and reduced quality of life. Its impact extends beyond individuals, placing emotional, social, and economic strain on families and healthcare systems worldwide. Despite its prevalence, MDD remains widely misunderstood, [...] Read more.

Major Depressive Disorder (MDD) is a significant global health burden that contributes to disability and reduced quality of life. Its impact extends beyond individuals, placing emotional, social, and economic strain on families and healthcare systems worldwide. Despite its prevalence, MDD remains widely misunderstood, with limited mental health literacy and persistent stigma often preventing individuals from seeking help. This research explored the prediction of MDD utilising social media data via Natural Language Processing (NLP), Machine Learning (ML), and explainable Machine Learning (xML) techniques. The research aimed at identifying depressive indicators on X (formerly Twitter) and developing interpretable models for depression risk detection. The study’s methodology followed the Cross-Industry Standard Process for Data Mining (CRISP-DM) framework to ensure a systematic approach to data analysis. Data was collected via X’s API and processed using regex-based noise removal, normalisation, tokenisation, and lemmatisation. Symptoms were mapped to DSM-5-TR criteria at the post-level, with user-level MDD risk assessed based on symptom persistence over a two-week period. Risk levels were classified as No Risk, Monitor, and High Risk to facilitate early intervention. Six ML models were trained and tested, while the Synthetic Minority Over-sampling Technique (SMOTE) was applied to mitigate class imbalance. The dataset was partitioned into training and testing sets using an 80:20 split. ML models were evaluated, and the Extreme Gradient Boosting model outperformed the others. Extreme Gradient Boosting achieved an accuracy of 0.979, F1-score of 0.970, and ROC-AUC of 0.996, surpassing benchmark results reported in prior studies. Explainability techniques, such as LIME and tree-based feature importance, enhance model transparency and clinical interpretability. Depressed mood consistently emerged as the highest-weighted predictor across different models. The findings highlight the value of aligning ML models with validated diagnostic frameworks to improve trustworthiness and reduce false positives. Future research can expand beyond text-based analysis by incorporating multimodal features to broaden diagnostic depth. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Information Systems)

► Show Figures

Figure 1

25 pages, 12554 KB

Open AccessArticle

An Explainable Artificial Intelligence-Driven Framework for Predicting Groundwater Irrigation Suitability in Hard-Rock Aquifers: Moving Beyond Traditional Bivariate Diagnostics

by Mohamed Hussein Yousif, Quanrong Wang, Anurag Tewari, Abara A. Biabak Indrick, Hafizou M. Sow, Yousif Hassan Mohamed Salh and Wakeel Hussain

Water 2026, 18(7), 854; https://doi.org/10.3390/w18070854 - 2 Apr 2026

Viewed by 453

Abstract

Groundwater is the primary source of irrigation in many semi-arid hard-rock aquifer regions. Yet, its suitability assessment is often hindered by the nonlinear hydrochemical dynamics that traditional bivariate tools, such as the U.S. Salinity Laboratory (USSL) diagram, cannot adequately resolve. To overcome this [...] Read more.

Groundwater is the primary source of irrigation in many semi-arid hard-rock aquifer regions. Yet, its suitability assessment is often hindered by the nonlinear hydrochemical dynamics that traditional bivariate tools, such as the U.S. Salinity Laboratory (USSL) diagram, cannot adequately resolve. To overcome this limitation, we developed an explainable artificial intelligence (XAI) framework that predicts irrigation suitability categories directly from hydrochemical variables, without relying on calculated indices. Using 1872 post-monsoon groundwater samples from Telangana, India, we trained three ensemble tree-based classifiers (Random Forest, LightGBM, and XGBoost) on 11 hydrochemical variables (Na⁺, K⁺, Ca²⁺, Mg²⁺, HCO₃⁻, Cl⁻, F⁻, NO₃⁻, SO₄²⁻, pH, and total hardness). Class imbalance was addressed using the Synthetic Minority Over-sampling Technique (SMOTE), and model hyperparameters were optimized with Optuna. Among the tested models, LightGBM achieved the best performance (balanced accuracy = 0.938). Model interpretability was enabled using Shapley Additive Explanations (SHAP), supported by Piper and Gibbs diagrams, revealing a critical distinction between sodicity-driven salinity and hardness-driven mineralization, identifying calcium-saturated waters for which gypsum amendment can be chemically futile. To bridge the gap between algorithmic accuracy and operational simplicity, we distilled SHAP explanations into linear heuristics and quantified the trade-off between accuracy and simplicity. Accordingly, we proposed a tiered hydrochemical triage framework in which quantitative heuristics handled approximately 62.5% of the routine samples, while XAI resolved the complex and ambiguous cases. Overall, the proposed framework transforms classic suitability assessment tools into an adaptable, evidence-informed, proactive decision-support system for sustainable agricultural water management under increasing environmental stress. Full article

(This article belongs to the Special Issue Water Quality Analytics in the Digital Era: Methods, Models and Management)

► Show Figures

Figure 1

28 pages, 4366 KB

Open AccessFeature PaperArticle

Temporal Transformer with Conditional Tabular GAN for Credit Card Fraud Detection: A Sequential Deep Learning Approach

by Jiaying Chen, Yiwen Liang, Jingyi Liu and Mengjie Zhou

Mathematics 2026, 14(7), 1183; https://doi.org/10.3390/math14071183 - 1 Apr 2026

Viewed by 556

Abstract

Credit card fraud detection remains a critical challenge in financial security, characterized by severe class imbalance and the need to capture complex temporal patterns in transaction sequences. Traditional machine learning approaches treat transactions as independent events, failing to model the sequential nature of [...] Read more.

Credit card fraud detection remains a critical challenge in financial security, characterized by severe class imbalance and the need to capture complex temporal patterns in transaction sequences. Traditional machine learning approaches treat transactions as independent events, failing to model the sequential nature of user behavior and suffering from inadequate handling of minority class samples. In this paper, we propose an integrated framework that combines generative modeling and time-aware sequential learning for credit card fraud detection. Our approach addresses two fundamental limitations: (1) we model transaction histories as temporal sequences using a Transformer-based architecture that captures both long-term dependencies and abrupt behavioral changes through multi-head self-attention mechanisms, and (2) we employ CTGAN to generate high-quality synthetic fraudulent samples, providing more effective oversampling than conventional techniques like SMOTE. The Time-Aware Transformer incorporates temporal encoding and position-aware attention to preserve transaction order and time intervals, while CTGAN learns the complex conditional distributions of fraudulent transactions to produce realistic synthetic samples. We evaluate our framework on the IEEE-CIS Fraud Detection dataset, demonstrating significant improvements over representative classical and sequential deep-learning baselines. Experimental results show that our method achieves superior performance with an AUC-ROC of 0.982, precision of 0.891, recall of 0.876, and F1-score of 0.883, outperforming the representative baselines considered in this study, including traditional machine learning models, standalone deep learning architectures, and supervised sequential neural models. Ablation studies confirm the individual contributions of both the sequential modeling component and the generative oversampling strategy. Our work demonstrates that combining temporal sequence modeling with generative synthesis provides a robust solution for imbalanced fraud detection, with potential applications extending to other domains requiring sequential pattern recognition under extreme class imbalance. Full article

► Show Figures

Figure 1

22 pages, 2730 KB

Open AccessArticle

Ensemble Learning Based on Bagging and Hybrid Sampling for Food Safety Risk Prediction

by Dafang Li, Zhengyong Zhang, Qingchun Wu and Xin Chen

Foods 2026, 15(7), 1176; https://doi.org/10.3390/foods15071176 - 31 Mar 2026

Viewed by 282

Abstract

Food safety sampling inspections are critical for risk prevention in complex supply chains, yet the extremely low frequency of high-risk samples poses substantial challenges for accurate risk prediction. To address the limitations of conventional machine learning models under severe class imbalance, this study [...] Read more.

Food safety sampling inspections are critical for risk prevention in complex supply chains, yet the extremely low frequency of high-risk samples poses substantial challenges for accurate risk prediction. To address the limitations of conventional machine learning models under severe class imbalance, this study proposes a unified Bagging–Stacking framework that integrates stacking ensembles, bagging, and SMOTE–Tomek hybrid resampling to enhance minority-class detection in food safety risk prediction. The stacking ensemble serves as the core of the framework, combining five tree-based base learners with Logistic Regression as the meta-learner to enhance classification robustness. Balanced bootstrap subsets generated through bagging and SMOTE–Tomek hybrid resampling further improve minority-class representation, while a probability-based threshold optimization mechanism is incorporated to refine high-risk classification. Experiments on real-world inspection data show that the proposed framework substantially improves high-risk recall while simultaneously increasing precision, yielding the highest F1 among all compared models. It also maintains a stable overall performance across varying test set proportions, demonstrating strong robustness and consistent generalization under varying evaluation conditions. SHAP analysis identifies storage conditions, production month, shelf life, package, and food category as key contributors to risk prediction, aligning with established mechanisms of food safety risk formation. Overall, the proposed framework provides accurate, robust, and interpretable support for food safety risk prediction, offering practical value for proactive risk prevention and more efficient regulatory resource allocation. Full article

(This article belongs to the Section Food Engineering and Technology)

► Show Figures

Figure 1

22 pages, 2045 KB

Open AccessArticle

GA-SMOTE-RF Enhanced Kalman Filter with Adaptive Noise Reduction

by Yiming Wang, Hui Zou, Yuzhou Liu, Tianchang Qiao, Xinyuan Xu, Yihang Li, Changxun He, Shunv Zhou, Hanjie Wang, Qingqing Geng and Qiqi Song

Sensors 2026, 26(7), 2165; https://doi.org/10.3390/s26072165 - 31 Mar 2026

Viewed by 279

Abstract

Low-noise free-space laser communication has widespread applications in military and rescue fields, but atmospheric turbulence severely affects communication quality. This paper proposes an intelligent classification and adaptive noise reduction system that integrates genetic algorithms (GA), synthetic minority oversampling technique (SMOTE), random forest (RF), [...] Read more.

Low-noise free-space laser communication has widespread applications in military and rescue fields, but atmospheric turbulence severely affects communication quality. This paper proposes an intelligent classification and adaptive noise reduction system that integrates genetic algorithms (GA), synthetic minority oversampling technique (SMOTE), random forest (RF), and Kalman filtering, significantly improving turbulence channel interference classification accuracy and communication quality. Simulation results show that the system achieves a classification accuracy of 98.27%, with corresponding F1-score of 0.9732 and MCC of 0.9653, far exceeding algorithms such as SVM and KNN. After noise reduction, the average RMSE for 400 signal groups is 0.6983, with zero estimated delay, and the mean and standard deviation of the innovative sequence are −0.0049 and 0.6960, respectively, demonstrating excellent signal quality and efficient real-time processing capabilities. Beyond synthetic simulations, we conducted real-world FSO data studies to validate practical applicability. A 24-h field experiment collected 283 real FSO measurement windows, on which the proposed GA–SMOTE–RF method achieves 0.308 RMSE and 0.75% Average Regret in Kalman filter parameter selection, outperforming KNN and SVM, confirming practical applicability for real-world FSO systems. Full article

(This article belongs to the Special Issue Antenna Technology for Advanced Communication and Sensing Systems)

► Show Figures

Figure 1

27 pages, 13483 KB

Open AccessArticle

Research on a Prediction Method for Maintenance Decision of Expressway Asphalt Pavement Based on Random Forest

by Chunguang He, Ya Duan, Tursun Mamat, Xinglin Zhu, Mahjoub Dridi, Yazan Mualla and Abdeljalil Abbas-Turki

Appl. Sci. 2026, 16(7), 3343; https://doi.org/10.3390/app16073343 - 30 Mar 2026

Viewed by 212

Abstract

This study predicts expressway asphalt pavement maintenance decisions using machine learning to overcome the information loss inherent in traditional composite indices like PQI and PCI. Using ten years of inspection data from the G3012 Expressway in Xinjiang, an interpretable Random Forest (RF) model [...] Read more.

This study predicts expressway asphalt pavement maintenance decisions using machine learning to overcome the information loss inherent in traditional composite indices like PQI and PCI. Using ten years of inspection data from the G3012 Expressway in Xinjiang, an interpretable Random Forest (RF) model was developed. The methodology integrates permutation-based feature selection, three imbalance mitigation strategies (Balanced Weighting, SMOTE, and Cost-Sensitive Learning), and a rigorous time-aware validation framework. Results indicate that raw distress features—specifically strip repairs, block cracking, transverse and longitudinal cracking—are the most influential predictors, significantly outperforming aggregated indices. The optimized model, using Balanced Weighting and mean imputation, achieved an accuracy of 0.826 and ROC-AUC of 0.853 under strict temporal validation, effectively identifying the minority “repair” class. This research demonstrates that leveraging raw distress data through an interpretable ensemble framework provides a robust, data-driven alternative to threshold-based planning, supporting the transition from reactive to preventive maintenance in complex infrastructure management. Full article

(This article belongs to the Special Issue Intelligent Transportation System Technologies and Applications, 2nd Edition)

► Show Figures

Figure 1

Search Results (768)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (768)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI