MDPI - Publisher of Open Access Journals

36 pages, 3400 KB

Open AccessArticle

Identifying Pre-Existing Diabetes at ICU Admission with Machine Learning on Public GOSSIS Data

by Lily Popova Zhuhadar

Diabetology 2026, 7(5), 100; https://doi.org/10.3390/diabetology7050100 - 21 May 2026

Viewed by 178

Background: Pre-existing diabetes mellitus is prevalent among critically ill adults and can influence initial glycemic targets, therapeutic decisions, and early risk stratification in the intensive care unit (ICU). However, diabetes status may be distributed across heterogeneous electronic health record (EHR) sources and may [...] Read more.

Background: Pre-existing diabetes mellitus is prevalent among critically ill adults and can influence initial glycemic targets, therapeutic decisions, and early risk stratification in the intensive care unit (ICU). However, diabetes status may be distributed across heterogeneous electronic health record (EHR) sources and may be incomplete at the time of ICU admission, particularly for inter-facility transfers. Methods: Using the public WiDS Datathon 2021 tabular release derived from the Global Open-Source Severity of Illness Score (GOSSIS) initiative, we conducted a retrospective machine-learning benchmarking study for admission-time identification of documented diabetes status in ICU patients. Candidate predictors included demographics, admission characteristics, anthropometrics, day-1 physiologic and laboratory summaries, APACHE-related variables, comorbidity indicators, and site descriptors. We compared CatBoost, random forest, tuned XGBoost, tuned LightGBM, histogram-based gradient boosting, and a soft-voting ensemble combining XGBoost, LightGBM, and histogram-based gradient boosting. Because class imbalance was a central concern, the final workflow emphasized model-intrinsic class weighting and threshold-aware evaluation rather than synthetic oversampling. Results: In the primary leakage-mitigated random validation split, the voting ensemble achieved the highest overall balance, with AUROC 0.8539, precision 0.5671, recall 0.6690, and F1-score 0.6138. Tuned LightGBM was the most sensitivity-oriented individual model, achieving recall 0.7677 and AUROC 0.8537, although with lower precision and a less favorable Brier score. Ablation analyses clarified the source of this performance: removing leakage-prone and APACHE-related variables caused only modest decreases in discrimination, whereas the strict reduced model that also excluded glucose-like predictors produced a marked decline, with LightGBM AUROC falling to 0.7432 and the voting ensemble AUROC falling to 0.7448. These findings, together with SHAP analyses identifying day-1 glucose maximum, day-1 glucose minimum, BMI, age, hemoglobin, and related clinical variables as major contributors, indicate that glucose-related admission variables remained the dominant predictive signal. In grouped hospital validation, tuned LightGBM maintained recall of 0.7684 while AUROC decreased modestly to 0.8443, indicating preserved case detection under stricter site separation but reduced precision. Precision–recall analysis further showed that average precision decreased from 0.622 under random validation to 0.551 under grouped validation; at a high-sensitivity grouped-site operating point, a probability threshold of 0.4537 achieved recall of 0.8001 with precision of 0.4314. Calibration curves and Brier scores showed that predicted probabilities were imperfectly calibrated. Conclusions: Although the dominance of glucose-related predictors is clinically plausible for identifying documented diabetes status, early glycemic measurements in critically ill patients may also partly capture acute stress physiology, treatment-related effects, monitoring intensity, or other forms of acute dysglycemia rather than chronic diabetes status alone. Therefore, these findings support gradient-boosted and ensemble models as reproducible tools for ICU admission-time phenotyping of documented diabetes status, but the proposed system should be interpreted primarily as a screening-oriented phenotyping aid for chart review, cohort enrichment, or workflow support, not as a stand-alone diagnostic tool. Further external validation, recalibration, threshold selection matched to intended use, and clinical review are needed before deployment. Full article

► Show Figures

Figure 1

19 pages, 1890 KB

Open AccessArticle

Machine Learning-Driven Prediction of Plant Water Potential in Kiwifruit Under Mediterranean Conditions

by Panagiotis Patseas, Anastasios Katsileros, Efthymios Kokkotos, Angelos Patakas and Anastasios Zotos

Agronomy 2026, 16(10), 1005; https://doi.org/10.3390/agronomy16101005 - 20 May 2026

Viewed by 131

Abstract

Kiwifruit (Actinidia deliciosa cv. Hayward) is a high-demand crop due to its nutritional value. Climate change increasingly challenges its cultivation, particularly under Mediterranean conditions, due to limited water resources. Therefore, the early detection of water stress onset is crucial for optimizing irrigation [...] Read more.

Kiwifruit (Actinidia deliciosa cv. Hayward) is a high-demand crop due to its nutritional value. Climate change increasingly challenges its cultivation, particularly under Mediterranean conditions, due to limited water resources. Therefore, the early detection of water stress onset is crucial for optimizing irrigation water use and enhancing kiwi productivity. In this context, advanced sensors capable of continuously monitoring critical hydrodynamic parameters, combined with machine learning approaches, offer a promising solution for reliable prediction of plant water status, supporting irrigation decision-making systems. This study develops and evaluates machine learning (ML) models to predict trunk water potential (Ψtrunk), integrating soil moisture, climatic variables, and plant-based measurements, including sap flow. Various machine learning models were evaluated including Ridge Regression, Lasso Regression, Random Forest, Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), using soil moisture, trunk water potential (Ψtrunk), sap flow, and microclimatic variables (relative humidity, wind speed, temperature, solar radiation, vapor pressure deficit, and reference evapotranspiration). Among the tested models, XGBoost demonstrated the best performance, achieving an accuracy of approximately 0.80, followed by Ridge, Lasso and SVM, which showed similar accuracy. Full article

(This article belongs to the Special Issue Crop Production in the Era of Climate Change)

► Show Figures

Figure 1

27 pages, 2283 KB

Open AccessArticle

Mining Customer Satisfaction from Online Reviews: An Explainable Kano-Based Framework for Product Improvement

by Huiru Yu and Yanlai Li

Systems 2026, 14(5), 585; https://doi.org/10.3390/systems14050585 - 20 May 2026

Viewed by 185

Abstract

Improving customer satisfaction (CS) and gaining competitive advantages are central goals of product improvement, both of which rely on accurate classification of product attributes. Online reviews on e-commerce platforms provide firms with abundant customer feedback, but accurately classifying and prioritizing product attributes remains [...] Read more.

Improving customer satisfaction (CS) and gaining competitive advantages are central goals of product improvement, both of which rely on accurate classification of product attributes. Online reviews on e-commerce platforms provide firms with abundant customer feedback, but accurately classifying and prioritizing product attributes remains challenging. To address this issue, we propose an interpretable Kano model. In this method, the Biterm Topic Model (BTM) is first used to identify product attributes from reviews. Then, the Enhanced BERT Model with Attribute-Aware and Convolutional Mechanisms (BERT-A-Conv) is employed to classify the sentiment categories of these attributes. Given the critical role of neutral sentiment, it is incorporated into the Light Gradient Boosting Machine (LightGBM) model to quantify the impact of AS on CS. The Shapley additive explanations (SHAP) method is then adopted to construct the marginal contribution difference (MCD) between adjacent categories, which this study uses to classify product attributes into five Kano categories. On this basis, we calculate the attribute improvement priority score (AIPS) by combining each attribute’s MCD and improvement potential, thereby offering firms a systematic analytical framework to support product iteration and improvement. A case study on smartwatches demonstrates the applicability and feasibility of the proposed method. Full article

(This article belongs to the Section Systems Practice in Social Science)

► Show Figures

Figure 1

16 pages, 3681 KB

Open AccessArticle

Application of Machine Learning Models for Predicting pIC50 Values of Plasticizers Against Cytochrome P450 Aromatase

by Itumeleng Lucky Mongadi, Nomasonto Rapulenyane, Walter Bonke Mahlangu and Jean-Nazaire Oyourou

Chemistry 2026, 8(5), 68; https://doi.org/10.3390/chemistry8050068 - 20 May 2026

Viewed by 191

Abstract

This study investigated the application of six machine learning regression algorithms such as Random Forest, CatBoost, K-Nearest Neighbours, XGBoost, LightGBM, and Gradient Boosting, paired with Molecular ACCess System (MACCS) key fingerprints for the quantitative prediction of aromatase (CYP19A1) inhibitory potency, expressed as pIC [...] Read more.

This study investigated the application of six machine learning regression algorithms such as Random Forest, CatBoost, K-Nearest Neighbours, XGBoost, LightGBM, and Gradient Boosting, paired with Molecular ACCess System (MACCS) key fingerprints for the quantitative prediction of aromatase (CYP19A1) inhibitory potency, expressed as pIC₅₀. A dataset of 187 compounds was assembled from the ChEMBL database (version 33, Target ID: CHEMBL1978) following by systematic data curation workflow encompassing duplicate removal, pIC₅₀ transformation, and activity-based filtering. Model performance was rigorously evaluated using an 80/20 stratified train/test split, 5-fold cross-validation, and Y-randomisation testing to ensure unbiased assessment of predictive generalisation. Feature selection via CatBoost permutation importance on the held-out test set identified the top 20 predictive MACCS keys from an initial 166-bit space, substantially reducing dimensionality and improving generalisation across all models. Among the algorithms evaluated, CatBoost trained on the top 20 features achieved the strongest test-set performance (R² = 0.693, RMSE = 0.794, MAE = 0.659) with the most stable cross-validation R² (0.062 ± 0.304), outperforming all other algorithms. Y-randomisation testing returned an empirical p-value of <0.01, confirming that model performance reflects genuine structure–activity relationships rather than statistical chance. Permutation importance and SHAP analysis identified nitrogen-containing heterocyclic fragments (MACCS_41, MACCS_145) and halide-bearing substructures (MACCS_109) as the primary structural determinants of aromatase inhibitory potency, consistent with established CYP19A1 pharmacophoric requirements. Application of the model to ten representative plasticizers demonstrated that the refined applicability domain (h* = 0.423) accommodated eight of the ten compounds, enabling reliable potency predictions across phthalate esters and bisphenol analogues. These findings establish a transparent and reproducible QSAR framework for first-tier endocrine disruption risk screening of plasticizers and highlight the importance of permutation-based feature selection and applicability domain assessment in QSAR model development. Full article

(This article belongs to the Special Issue AI and Big Data in Chemistry)

► Show Figures

Figure 1

31 pages, 4570 KB

Open AccessArticle

An IWMA-Optimized LightGBM Model for Early Ketosis Risk Screening in Dairy Cows Using DHI Data

by Yang Yang, Yongqiang Dai, Huan Liu and Rui Guo

Appl. Sci. 2026, 16(10), 5050; https://doi.org/10.3390/app16105050 - 19 May 2026

Viewed by 95

Abstract

Ketosis is a prevalent metabolic disorder in early-lactation dairy cows, significantly affecting animal health, milk production, and farm profitability. Developing accurate and non-invasive methods for early risk detection is therefore of critical importance. In this study, a hybrid optimization framework integrating an Improved [...] Read more.

Ketosis is a prevalent metabolic disorder in early-lactation dairy cows, significantly affecting animal health, milk production, and farm profitability. Developing accurate and non-invasive methods for early risk detection is therefore of critical importance. In this study, a hybrid optimization framework integrating an Improved Whale Migration Algorithm (IWMA) with a Light Gradient Boosting Machine (LightGBM) is proposed to predict ketosis risk based on the milk fat-to-protein ratio (F/P) using Dairy Herd Improvement (DHI) records. The proposed IWMA enhances optimization performance through cubic chaotic initialization, elite opposition-based learning, and a Cauchy–Gaussian hybrid mutation strategy, enabling improved global exploration and convergence stability. A dataset comprising 25,155 DHI records collected from multiple commercial dairy farms over seven months was used for model development and evaluation. Experimental results demonstrate that the IWMA–LightGBM model achieves a classification accuracy of 0.8997 and a mean squared error of 0.289, consistently outperforming six benchmark optimization methods. Feature analysis identifies Herd Within Index (WHI), Energy Corrected Milk (ECM), Days in Milk (DIM), Milk Urea Nitrogen, and Foremilk as key predictors associated with metabolic risk. Overall, the proposed approach provides a robust and effective non-invasive solution for early-stage metabolic risk screening at the herd level, offering practical value for precision dairy management. It should be noted that the model is intended for risk assessment rather than clinical diagnosis of ketosis. Full article

(This article belongs to the Special Issue Advanced Agricultural Technologies: Monitoring, Modeling, and Machine Learning Techniques)

► Show Figures

Figure 1

18 pages, 2447 KB

Open AccessArticle

Integrated Machine Learning and Health Risk Assessment for Groundwater Nitrate Contamination in Handan City, China

by Yuanchao Zhao, Jing Liu, Xiaokai Zhang, Qun Li and Jin Wu

Water 2026, 18(10), 1174; https://doi.org/10.3390/w18101174 - 13 May 2026

Viewed by 258

Abstract

Groundwater nitrate (NO₃⁻) pollution is a critical environmental challenge with direct implications for human health. In this work, we propose a comprehensive analytical framework that integrates multi-model intercomparison, interpretable machine learning techniques, and quantitative health risk evaluation to tackle the [...] Read more.

Groundwater nitrate (NO₃⁻) pollution is a critical environmental challenge with direct implications for human health. In this work, we propose a comprehensive analytical framework that integrates multi-model intercomparison, interpretable machine learning techniques, and quantitative health risk evaluation to tackle the pressing groundwater nitrate governance dilemmas in Handan City, a representative urban area in North China. Based on 157 groundwater samples and 17 hydrochemical parameters, comparative analysis of three state-of-the-art machine learning algorithms showed that the Light Gradient Boosting Machine (LightGBM) algorithm outperformed all counterparts, delivering the optimal predictive performance (R² = 0.753, RMSE = 3.67). SHapley Additive exPlanations (SHAP) analysis identified F⁻, Ca²⁺, Cl⁻, K⁺, total hardness, and Mg²⁺ as dominant factors influencing groundwater NO₃⁻ concentrations, reflecting the combined effects of carbonate dissolution, nitrification, and anthropogenic inputs. Subsequently, we performed a health risk assessment based on the standard methodological framework issued by the United States Environmental Protection Agency (USEPA), and the results indicated that children were the most vulnerable group, with hazard quotient (HQ, a non-carcinogenic risk indicator) values reaching 1.07 in the western mountainous region, exceeding the safety threshold (HQ > 1). These findings clarify the pollution mechanisms and spatial heterogeneity, and provide targeted policy guidance for groundwater protection as well as the safeguarding of public health. Full article

(This article belongs to the Section Hydrogeology)

► Show Figures

Figure 1

24 pages, 3507 KB

Open AccessArticle

A Comparative Study on Rice Diversity Mapping with PlanetScope and Sentinel-2 Red Edge Bands Based on Key Phenological Characteristics

by Yujun Wang, Yating Zhan, Ke Song, Yin Li, Ziqiao Xu, Hui Mu, Yingshi Xu, Yanmei Cui and Liang Hang

AgriEngineering 2026, 8(5), 187; https://doi.org/10.3390/agriengineering8050187 - 10 May 2026

Viewed by 289

Abstract

Precise mapping of rice cultivars is of great significance for crop management and food security evaluation. Nevertheless, differentiating between Indica and Japonica rice remains a formidable task, mainly due to subtle discrepancies in spectral characteristics and scattered planting distributions. This study evaluated the [...] Read more.

Precise mapping of rice cultivars is of great significance for crop management and food security evaluation. Nevertheless, differentiating between Indica and Japonica rice remains a formidable task, mainly due to subtle discrepancies in spectral characteristics and scattered planting distributions. This study evaluated the synergistic effect of spatial resolution and red edge information in rice variety classification using PlanetScope (PS) and Sentinel-2 (S2) images from the Tillering and Jointing stage, Heading and Flowering stage in Huai’an, Jiangsu Province. Multiple feature schemes were constructed, including spectral bands, vegetation indices, and texture features, with and without red-edge variables. A total of eight feature schemes have been constructed, including spectral bands, vegetation index, texture features, and red edge features. The feature scheme division is based on the participation of different sensors, growth periods, and red edges. We fine-tune three classification models, Random Forest (RF), Light Gradient Boosting Machine (LightGBM), and TabNet, to enhance classification performance. Additionally, we employ Shapley Additive Explanations (SHAP) to quantitatively measure the contribution of each feature to the prediction of distinct rice varieties. Results demonstrate that classification accuracy of different sensors reach the highest at the Heading and Flowering stage. The overall accuracy of PS scheme is 98.14%, the F1 scores of Japonica and Indica rice are 97.67% and 98.41%, the overall accuracy of S2 scheme is 97.87%, and the F1 scores of Japonica and Indica rice are 98.62% and 98.68, respectively. Incorporating red-edge features leads to a notable improvement in F1-scores for both Indica and Japonica rice under all experimental configurations. Although PS only has one red edge band set, its classification performance is similar to S2, and the boundaries between different rice variety recognition results and between non rice and rice plots are more refined compared to S2. Feature attribution analysis reveals that red-edge indices exert a dominant influence on the decision-making process of the models, especially during the Heading–Flowering period. These findings suggest that high-accuracy discrimination of rice varieties relies heavily on the synergistic optimization of phenological timing, red-edge spectral information, and spatial resolution, rather than merely increasing spectral dimensionality. The optimization direction for high-precision rice variety mapping in the future should prioritize the collaborative mechanism of phenological period, red edge data, and spatial resolution, rather than being limited to simple stacking in the spectral dimension. Full article

► Show Figures

Figure 1

27 pages, 2660 KB

Open AccessArticle

Strategic Risk Based Forecasting of Brent Crude Oil Prices: A Comparative Analysis of Econometric and Machine Learning Models

by Tuğçe Ekiz Yılmaz and Cemal Zehir

Entropy 2026, 28(5), 539; https://doi.org/10.3390/e28050539 - 9 May 2026

Viewed by 475

Abstract

Brent crude oil prices are strategically important due to their sensitivity to geopolitical developments, financial market stress, and global monetary conditions. This study examines whether strategic risk indicators improve the forecasting performance of Brent crude oil returns within an integrated econometric and machine [...] Read more.

Brent crude oil prices are strategically important due to their sensitivity to geopolitical developments, financial market stress, and global monetary conditions. This study examines whether strategic risk indicators improve the forecasting performance of Brent crude oil returns within an integrated econometric and machine learning framework. Monthly data from January 2001 to December 2025 are employed, using the Global Geopolitical Risk Index (GPR), the CBOE Volatility Index (VIX), and the U.S. 10-year Treasury yield (DGS10) as key explanatory variables. Methodologically, the analysis first estimates benchmark econometric models, including ARIMAX (AutoRegressive Integrated Moving Average with Explanatory Variable) and ARIMAX-gjrGARCH (Glosten-Jagannathan-Runkle Generalized Autoregressive Conditional Heteroscedasticity, and then implements machine learning models, namely XGBoost (eXtreme Gradient Boosting), LightGBM (Light Gradient Boosting Machine), and Random Forest, to capture potential nonlinear relationships. Using sMAPE (Symmetric Mean Absolute Percentage Error), forecast performance is assessed over multiple forecast horizons under a rolling-origin framework. Across several forecasting horizons and train-test split configurations, the empirical results consistently show that machine learning techniques, especially LightGBM, offer superior out-of-sample forecasting accuracy. These findings suggest that the dynamics of Brent crude oil returns are influenced by complex and nonlinear relationships between macro-financial conditions, financial uncertainty, and geopolitical risk. The study concludes that flexible data-driven forecasting frameworks offer stronger predictive performance than benchmark econometric models under strategic risk conditions and provide useful implications for energy market risk management and policy decision-making. Full article

(This article belongs to the Section Multidisciplinary Applications)

► Show Figures

Figure 1

19 pages, 1099 KB

Open AccessSystematic Review

Machine Learning Models for Predicting Post-Hepatectomy Liver Failure: A Systematic Review

by Calin Muntean, Vasile Gaborean, Razvan Constantin Vonica, Sebastian Aurelian Stefaniga, Alaviana Monique Faur and Catalin Vladut Ionut Feier

AI 2026, 7(5), 166; https://doi.org/10.3390/ai7050166 - 9 May 2026

Viewed by 1116

Abstract

Background and Objectives: Post-hepatectomy liver failure (PHLF) remains the leading cause of mortality following hepatic resection, with reported incidence rates ranging from 1.2% to 32%. Traditional scoring systems such as the Child–Pugh score, Model for End-Stage Liver Disease (MELD), and Albumin–Bilirubin (ALBI) grade [...] Read more.

Background and Objectives: Post-hepatectomy liver failure (PHLF) remains the leading cause of mortality following hepatic resection, with reported incidence rates ranging from 1.2% to 32%. Traditional scoring systems such as the Child–Pugh score, Model for End-Stage Liver Disease (MELD), and Albumin–Bilirubin (ALBI) grade have demonstrated limited predictive accuracy for PHLF. Machine learning (ML) algorithms have emerged as promising tools capable of integrating complex, multidimensional clinical data to improve predictive performance. This systematic review aims to evaluate the current evidence on ML-based prediction models for PHLF, assessing their predictive accuracy, methodological quality, clinical applicability, and the key variables utilized across models. Methods: A systematic literature search was conducted across PubMed, Embase, Web of Science, and the Cochrane Library from inception to January 2026. Studies that developed or validated ML models for predicting PHLF after hepatic resection were included. The Prediction Model Risk of Bias Assessment Tool (PROBAST) was used to evaluate the risk of bias. Data on model performance, algorithms employed, sample sizes, predictor variables, and validation strategies were extracted. The review was conducted in accordance with the PRISMA 2020 guidelines and registered in PROSPERO. Results: Twelve PubMed-verified studies involving 6913 patients were retained in the final analysis. Publication years ranged from 2020 to 2025, with five studies published in 2025. Gradient boosting approaches (LightGBM/XGBoost or phase-specific boosting models) were the most frequent best-performing architectures, while ANN/deep learning, radiomics-integrated, and ensemble approaches also showed clinically relevant discrimination. Best reported non-training AUCs ranged from 0.7927 to 0.981 (median, 0.873). The strongest generalization signals came from studies with temporal, external, or prospective validation structures. Common predictor domains included bilirubin-based liver function measures, coagulation variables, platelet count, volumetry or extent of resection, imaging-derived radiomics features, and perioperative dynamic data. Conclusions: Machine learning models remain promising for PHLF prediction, but the evidence base is smaller and more heterogeneous than the original draft suggested. Performance is highest in studies that combine clinical liver-reserve markers with imaging or perioperative temporal data; however, widespread clinical adoption is still limited by retrospective design predominance, inconsistent outcome definitions, and incomplete external validation. Full article

(This article belongs to the Section Medical & Healthcare AI)

► Show Figures

Figure 1

33 pages, 3894 KB

Open AccessArticle

Predictive Accuracy of Statistical and Machine Learning Models on Perceived Feelings of Safety in South Africa

by Boitumelo Mooketsi, Johannes Tshepiso Tsoku and Patrick Malose Leeto Shogole

Safety 2026, 12(3), 66; https://doi.org/10.3390/safety12030066 - 8 May 2026

Viewed by 1051

Abstract

This study compares the predictive performance of traditional multivariate time series models and machine learning (ML) techniques in modelling perceived feelings of safety among South African residents. The analysis uses secondary data from the Governance, Public Safety, and Justice Survey conducted by Statistics [...] Read more.

This study compares the predictive performance of traditional multivariate time series models and machine learning (ML) techniques in modelling perceived feelings of safety among South African residents. The analysis uses secondary data from the Governance, Public Safety, and Justice Survey conducted by Statistics South Africa, covering 2013/2014 to 2023/2024 and comprising 215,301 observations. Perceived safety while walking alone in the neighbourhood during the day and after dark served as the response variables, while socio-economic characteristics such as age, sex, province, and main source of income were included as predictors. A Vector Autoregressive Moving Average (VARMA) model was estimated alongside Random Forest (RF) and Light Gradient Boosting Machine (LightGBM) algorithms. VARMA (2,2) and VARMA (3,1) provided the best statistical fit for daytime and after-dark safety perceptions, respectively. However, ML models achieved higher predictive accuracy overall, with RF outperforming both LightGBM and VARMA in capturing nonlinear relationships and short-term dynamics. The findings underscore the value of integrating ML into public safety modelling to enhance evidence-based planning and socio-economic policy development in South Africa. Future research should consider integrating higher-frequency and alternative data sources, such as administrative crime statistics and real-time behavioural data to improve model sensitivity and forecasting accuracy. Full article

► Show Figures

Figure 1

20 pages, 4698 KB

Open AccessArticle

Prediction of High-Abundance Fishing Grounds for Chub Mackerel (Scomber japonicus) in the Northwest Pacific Ocean and Its Environmental Drivers Based on Interpretable Machine Learning Model

by Leilei Zhang, Wei Fan, Fenghua Tang, Yongchuang Shi and Shengmao Zhang

Fishes 2026, 11(5), 274; https://doi.org/10.3390/fishes11050274 - 6 May 2026

Viewed by 349

Abstract

Accurate prediction of fishing grounds plays a crucial role in supporting the efficient operation of ocean-going fishing vessels. Based on catch data of Chub Mackerel (Scomber japonicus) and multiple concomitant oceanographic variables from 2014 to 2022 in the Northwest Pacific Ocean, [...] Read more.

Accurate prediction of fishing grounds plays a crucial role in supporting the efficient operation of ocean-going fishing vessels. Based on catch data of Chub Mackerel (Scomber japonicus) and multiple concomitant oceanographic variables from 2014 to 2022 in the Northwest Pacific Ocean, we employed four machine learning methods, including Random Forest (RF; scikit-learn v1.7.2), Extreme Gradient Boosting (XGBoost; xgboost v3.1.3), Light Gradient Boosting Machine (LightGBM; lightgbm v4.6.0) and Categorical Boosting (CatBoost; catboost v1.2.8), to construct a prediction model for high-abundance fishing grounds of Chub Mackerel. After selecting the optimal model through evaluation metrics, we applied the SHapley Additive exPlanations (SHAP; shap v0.44.1) method to visualize and interpret the optimal model, quantifying the importance of environmental factors on high-abundance fishing grounds, thus enhancing the interpretability and credibility of the machine learning model. The results indicated that the catch exhibited significant fluctuations at both interannual and intramonthly scales (p < 0.05). The annual catch showed a phased increasing trend, peaking in 2017 and 2018. Monthly catches were highest in September and October. Evaluated against established performance metrics, the RF model demonstrated the highest predictive performance with the highest values of accuracy and F1-score, 76.33% and 77.73%, Precision 72.81%, Recall 83.36%, ROC-AUC 0.8393, respectively, and was therefore selected as the most suitable for predicting Chub Mackerel fishing grounds. SHAP analysis identified the temporal variables year and month as the most influential predictors, followed by chlorophyll-a concentration (Chl-a), sea surface salinity (SSS), and sea surface temperature (SST). SHAP analysis can comprehensively reveal the degree and direction of influence of each variable at both global and local levels. These findings indicate that integrating machine learning with explainability techniques can enhance the scientific robustness and transparency of fishing ground forecasts, providing data-driven support for ecosystem-based fishery management. Full article

(This article belongs to the Special Issue Technology for Fish and Fishery Monitoring—2nd Edition)

► Show Figures

Figure 1

16 pages, 1407 KB

Open AccessArticle

Methods of Machine Learning for Prediction and Anomaly Detection in Pipeline Systems

by Dana Satybaldina, Nurdaulet Teshebayev, Nurbol Shmitov, Aina Zakarina, Korlan Kulniyazova and Nurgul Kissikova

Appl. Sci. 2026, 16(9), 4437; https://doi.org/10.3390/app16094437 - 1 May 2026

Cited by 1 | Viewed by 547

Abstract

The detection of anomalies and the prediction of corrosion defects in oil and gas pipelines constitute critical tasks for ensuring industrial safety and improving operational reliability. This study addresses the problem of regression-based prediction of corrosion defect levels (CR—corrosion defect) using operational process [...] Read more.

The detection of anomalies and the prediction of corrosion defects in oil and gas pipelines constitute critical tasks for ensuring industrial safety and improving operational reliability. This study addresses the problem of regression-based prediction of corrosion defect levels (CR—corrosion defect) using operational process parameters. Machine learning methods, including Decision Tree, Random Forest, LightGBM, and CatBoost, were employed to develop predictive models. Data preprocessing was performed, including feature standardization and hyperparameter tuning using KFold cross-validation. Model performance was primarily evaluated using the Root Mean Square Error (RMSE) on both training and test datasets, as this metric is more sensitive to large prediction errors, which is particularly important in the context of corrosion defect analysis. Additionally, Mean Squared Error (MSE), Mean Absolute Error (MAE), and the coefficient of determination (R²) were used to provide a comprehensive assessment of model accuracy and robustness. Experimental results demonstrate that the CatBoost model achieved the best performance, yielding the lowest RMSE on the test dataset (0.02040) with a close value on the training dataset (0.01682), indicating strong generalization capability. Furthermore, this model outperformed the others in terms of MSE, MAE, and R² on the test dataset (MSE = 0.000418, MAE = 0.006319, R² = 0.695544). The obtained results confirm the effectiveness of ensemble methods and gradient boosting algorithms for regression modeling of corrosion defect development processes in pipeline systems. Full article

► Show Figures

Figure 1

23 pages, 13014 KB

Open AccessArticle

Seasonal Estimation of Net Surface Shortwave Radiation Using Multiple Machine Learning Algorithms, Remote Sensing Observation, and In-Situ Station

by Nuan Wang, Shisong Cao, Mingyi Du, Jingyi Chen, Ling Li, Yang Liu and Huiping Sun

Appl. Sci. 2026, 16(9), 4370; https://doi.org/10.3390/app16094370 - 29 Apr 2026

Viewed by 274

Abstract

Net surface shortwave radiation (NSSR) is a key parameter in the Earth’s energy cycle, greatly affecting global water and heat balance. Currently, a comprehensive comparative analysis regarding the accuracy of different models remains severely lacking, and there is also a notable deficiency in [...] Read more.

Net surface shortwave radiation (NSSR) is a key parameter in the Earth’s energy cycle, greatly affecting global water and heat balance. Currently, a comprehensive comparative analysis regarding the accuracy of different models remains severely lacking, and there is also a notable deficiency in the systematic exploration of seasonal radiative drivers. Therefore, we developed a machine learning-based seasonal NSSR estimation model. By integrating in-situ observational data with multi-source remote sensing datasets, we achieved precise quantification of radiative fluxes. This proposed model framework employed three cutting-edge algorithms, namely Random Forest (RF), eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), to capture the non-linear interactions among radiative drivers across the four seasons. Through mechanistic sensitivity analysis, we quantified the impacts of key variables on NSSR prediction. The results unequivocally demonstrated that the RF algorithm demonstrated the best performance. Its seasonal R² were 0.95 (spring), 0.89 (summer), 0.95 (autumn), and 0.96 (winter). The Solar Zenith Angle (SZA) dominated in spring and winter; its absence reduced R² by 0.23 and raised RMSE by 20.66–26.42 W/m². Meteorological factors mattered most in summer; excluding them cut R² by 0.17 and hiked RMSE by 23.82 W/m². This study provides actionable insights for terrestrial radiation budget research. Full article

(This article belongs to the Topic Machine Learning and Data Mining: Theory and Applications)

► Show Figures

Figure 1

35 pages, 9480 KB

Open AccessArticle

Battery State of Charge Estimation in Electric Vehicles Using Machine Learning with Feature Engineering and Seasonal Analysis Under On-Road Conditions

by Feristah Dalkilic, Kadriye Filiz Balbal, Kokten Ulas Birant, Elife Ozturk Kiyak, Yunus Dogan, Semih Utku and Derya Birant

Batteries 2026, 12(5), 159; https://doi.org/10.3390/batteries12050159 - 29 Apr 2026

Viewed by 336

Abstract

Estimating the state of charge (SoC) is a critical task for effective management of electric vehicle batteries. Simple machine learning methods (LR, KNN, etc.) often suffer from limited prediction accuracy, while deep learning approaches (LSTM, CNN, etc.) generally require high computational resources and [...] Read more.

Estimating the state of charge (SoC) is a critical task for effective management of electric vehicle batteries. Simple machine learning methods (LR, KNN, etc.) often suffer from limited prediction accuracy, while deep learning approaches (LSTM, CNN, etc.) generally require high computational resources and behave as black-box models with limited explainability. To overcome these limitations, the present work proposes a SoC estimation approach based on the Light Gradient Boosting Machine (LightGBM). The proposed model provides a balanced trade-off between prediction accuracy and computational efficiency. Furthermore, feature engineering is performed to derive additional informative features, improving the model’s ability to learn driving conditions and battery dynamics. In addition, the study incorporates a seasonal analysis by evaluating the model under both summer and winter conditions, allowing the impact of environmental variations on SoC estimation performance to be investigated. Moreover, Explainable Artificial Intelligence (XAI) techniques are employed to interpret the model predictions. Evaluation on real-world on-road data demonstrated that the proposed model achieved substantial improvements in estimation performance compared to recent studies. Full article

(This article belongs to the Special Issue Battery Degradation: Behavior, Mechanisms, Modeling, Estimation, and Optimization Strategies)

► Show Figures

Figure 1

33 pages, 6754 KB

Open AccessArticle

Warming and Drying Intensification Across Iran’s River Basins (1950–2040): Historical Trends and LightGBM-Based Projections

by Iman Rousta, Safoora Izadian, Haraldur Olafsson, Marjan Dalvi and Jaromir Krzyszczak

Atmosphere 2026, 17(5), 446; https://doi.org/10.3390/atmos17050446 - 28 Apr 2026

Viewed by 477

Abstract

Understanding long-term hydroclimatic variability in arid and semi-arid regions is essential for sustainable water resource management in the context of accelerating climate change. This study examines historical trends (1950–2024) and data-driven extrapolations to 2040 for precipitation and temperature across 30 secondary river basins [...] Read more.

Understanding long-term hydroclimatic variability in arid and semi-arid regions is essential for sustainable water resource management in the context of accelerating climate change. This study examines historical trends (1950–2024) and data-driven extrapolations to 2040 for precipitation and temperature across 30 secondary river basins in Iran using ERA5 reanalysis dataset and the Light Gradient Boosting Machine (LightGBM) model. Results reveal pronounced spatial heterogeneity in precipitation, with more than two-thirds of basins showing median values of 0 mm, reflecting extreme rainfall intermittency. Long-term analysis indicates significant precipitation increases in northern basins, whereas decadal trends show widespread drying since the early 2000s, particularly in eastern regions (30–60 mm per decade). Mean, maximum, and minimum temperatures exhibit significant upward trends (0.015–0.045 °C yr⁻¹), with stronger warming in northern and northwestern basins; however, minimum temperatures increased faster than maximum temperatures, reducing the diurnal temperature range and indicating a shift in regional thermal dynamics. Maximum temperature is negatively correlated with precipitation (R ≈ −0.27 to −0.34), suggesting enhanced evapotranspiration under warming conditions. LightGBM extrapolations to 2040 indicate continued warming (1–3 °C) and precipitation declines across more than 80% of Iran, underscoring intensifying hydroclimatic stress and increasing challenges for water resource management in dryland environments. Full article

(This article belongs to the Section Climatology)

► Show Figures

Figure 1

Search Results (543)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (543)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI