Saved Queries

In today’s digital ecosystem, where millions of users interact with diverse online services and generate vast amounts of textual, transactional, and behavioral data, ensuring the trustworthiness of this information has become a critical challenge. Low-quality data—manifesting as incompleteness, inconsistency, duplication, or noise—not only undermines analytics and machine learning models but also exposes unsuspecting users to unreliable services, compromised authentication mechanisms, and biased decision-making processes. Traditional data quality assessment methods, largely based on manual inspection or rigid rule-based validation, cannot cope with the scale, heterogeneity, and velocity of modern data streams. To address this gap, we propose DQMAF (Data Quality Modeling and Assessment Framework), a generalized machine learning–driven approach that systematically profiles, evaluates, and classifies data quality to protect end-users and enhance the reliability of Internet services. DQMAF introduces an automated profiling mechanism that measures multiple dimensions of data quality—completeness, consistency, accuracy, and structural conformity—and aggregates them into interpretable quality scores. Records are then categorized into high, medium, and low quality, enabling downstream systems to filter or adapt their behavior accordingly. A distinctive strength of DQMAF lies in integrating profiling with supervised machine learning models, producing scalable and reusable quality assessments applicable across domains such as social media, healthcare, IoT, and e-commerce. The framework incorporates modular preprocessing, feature engineering, and classification components using Decision Trees, Random Forest, XGBoost, AdaBoost, and CatBoost to balance performance and interpretability. We validate DQMAF on a publicly available Airbnb dataset, showing its effectiveness in detecting and classifying data issues with high accuracy. The results highlight its scalability and adaptability for real-world big data pipelines, supporting user protection, document and text-based classification, and proactive data governance while improving trust in analytics and AI-driven applications. Full article

(This article belongs to the Special Issue Machine Learning and Data Mining for User Classification)

►▼ Show Figures

Figure 1

36 pages, 3174 KB

Open AccessReview

A Bibliometric-Systematic Literature Review (B-SLR) of Machine Learning-Based Water Quality Prediction: Trends, Gaps, and Future Directions

by Jeimmy Adriana Muñoz-Alegría, Jorge Núñez, Ricardo Oyarzún, Cristian Alfredo Chávez, José Luis Arumí and Lien Rodríguez-López

Water 2025, 17(20), 2994; https://doi.org/10.3390/w17202994 (registering DOI) - 17 Oct 2025

Abstract

Predicting the quality of freshwater, both surface and groundwater, is essential for the sustainable management of water resources. This study collected 1822 articles from the Scopus database (2000–2024) and filtered them using Topic Modeling to create the study corpus. The B-SLR analysis identified exponential growth in scientific publications since 2020, indicating that this field has reached a stage of maturity. The results showed that the predominant techniques for predicting water quality, both for surface and groundwater, fall into three main categories: (i) ensemble models, with Bagging and Boosting representing 43.07% and 25.91%, respectively, particularly random forest (RF), light gradient boosting machine (LightGBM), and extreme gradient boosting (XGB), along with their optimized variants; (ii) deep neural networks such as long short-term memory (LSTM) and convolutional neural network (CNN), which excel at modeling complex temporal dynamics; and (iii) traditional algorithms like artificial neural network (ANN), support vector machines (SVMs), and decision tree (DT), which remain widely used. Current trends point towards the use of hybrid and explainable architectures, with increased application of interpretability techniques. Emerging approaches such as Generative Adversarial Network (GAN) and Group Method of Data Handling (GMDH) for data-scarce contexts, Transfer Learning for knowledge reuse, and Transformer architectures that outperform LSTM in time series prediction tasks were also identified. Furthermore, the most studied water bodies (e.g., rivers, aquifers) and the most commonly used water quality indicators (e.g., WQI, EWQI, dissolved oxygen, nitrates) were identified. The B-SLR and Topic Modeling methodology provided a more robust, reproducible, and comprehensive overview of AI/ML/DL models for freshwater quality prediction, facilitating the identification of thematic patterns and research opportunities. Full article

(This article belongs to the Special Issue Machine Learning Applications in the Water Domain)

►▼ Show Figures

Figure 1

31 pages, 1941 KB

Open AccessReview

Machine Learning in Slope Stability: A Review with Implications for Landslide Hazard Assessment

by Miguel Trinidad and Moe Momayez

GeoHazards 2025, 6(4), 67; https://doi.org/10.3390/geohazards6040067 (registering DOI) - 16 Oct 2025

Abstract

Slope failures represent one of the most serious geotechnical hazards, which can have severe consequences for personnel, equipment, infrastructure, and other aspects of a mining operation. Deterministic and stochastic conventional methods of slope stability analysis are useful; however, some limitations in applicability may arise due to the inherent anisotropy of rock mass properties and rock mass interactions. In recent years, Machine Learning (ML) techniques have become powerful tools for improving prediction and risk assessment in slope stability analysis. This review provides a comprehensive overview of ML applications for analyzing slope stability and delves into the performance of each technique as well as the interrelationship between the geotechnical parameters of the rock mass. Supervised learning methods such as decision trees, support vector machines, random forests, gradient boosting, and neural networks have been applied by different authors to predict the safety factor and classify slopes. Unsupervised learning techniques such as clustering and Gaussian mixture models have also been applied to identify hidden patterns. The objective of this manuscript is to consolidate existing work by highlighting the advantages and limitations of different ML techniques, while identifying gaps that should be analyzed in future research. Full article

►▼ Show Figures

Figure 1

25 pages, 2877 KB

Open AccessArticle

Integration of Field Data and UAV Imagery for Coffee Yield Modeling Using Machine Learning

by Sthéfany Airane dos Santos Silva, Gabriel Araújo e Silva Ferraz, Vanessa Castro Figueiredo, Margarete Marin Lordelo Volpato, Danton Diego Ferreira, Marley Lamounier Machado, Fernando Elias de Melo Borges and Leonardo Conti

Drones 2025, 9(10), 717; https://doi.org/10.3390/drones9100717 (registering DOI) - 16 Oct 2025

Abstract

The integration of machine learning (ML) techniques with unmanned aerial vehicle (UAV) imagery holds strong potential for improving yield prediction in agriculture. However, few studies have combined biophysical field variables with UAV-derived spectral data, particularly under conditions of limited sample size. This study evaluated the performance of different ML algorithms in predicting Arabica coffee (Coffea arabica) yield using field-based biophysical measurements and spectral variables extracted from multispectral UAV imagery. The research was conducted over two crop seasons (2020/2021 and 2021/2022) in a 1.2-hectare experimental plot in southeastern Brazil. Three modeling scenarios were tested with Random Forest, Gradient Boosting, K-Nearest Neighbors, Multilayer Perceptron, and Decision Tree algorithms, using Leave-One-Out cross-validation. Results varied considerably across seasons and scenarios. KNN performed best with raw data, while Gradient Boosting was more stable after variable selection and synthetic data augmentation with SMOTE. Nevertheless, limitations such as small sample size, seasonal variability, and overfitting, particularly with synthetic data, affected overall performance. Despite these challenges, this study demonstrates that integrating UAV-derived spectral data with ML can support yield estimation, especially when variable selection and phenological context are carefully addressed. Full article

(This article belongs to the Special Issue Innovative Approaches to Biodiversity and Ecology Monitoring: Artificial Intelligence (AI) and Drone Technology)

►▼ Show Figures

Figure 1

21 pages, 7603 KB

Open AccessArticle

Non-Invasive Inversion and Characteristic Analysis of Soil Moisture in 0–300 cm Agricultural Soil Layers

by Shujie Jia, Yaoyu Li, Boxin Cao, Yuwei Cheng, Abdul Sattar Mashori, Zheyu Bai, Mingyi Cui, Zhimin Zhang, Linqiang Deng and Wuping Zhang

Agriculture 2025, 15(20), 2143; https://doi.org/10.3390/agriculture15202143 - 15 Oct 2025

Abstract

Accurate profiling of deep (20–300 cm) soil moisture is crucial for precision irrigation but remains technically challenging and costly at operational scales. We systematically benchmark eight regression algorithms—including linear regression, Lasso, Ridge, elastic net, support vector regression, multi-layer perceptron (MLP), random forest (RF), and gradient boosting trees (GBDT)—that use easily accessible inputs of 0–20 cm surface soil moisture (SSM) and ten meteorological variables to non-invasively infer soil moisture at fourteen 20 cm layers. Data from a typical agricultural site in Wenxi, Shanxi (2020–2022), were divided into training and testing datasets based on temporal order (2020–2021 for training, 2022 for testing) and standardized prior to modeling. Across depths, non-linear ensemble models significantly outperform linear baselines. Ridge Regression achieves the highest accuracy at 0–20 cm, SVR performs best at 20–40 cm, and MLP yields consistently optimal performance across deep layers from 60 cm to 300 cm (R² = 0.895–0.978, KGE = 0.826–0.985). Although ensemble models like RF and GBDT exhibit strong fitting ability, their generalization performance under temporal validation is relatively limited. Model interpretability combining SHAP, PDP, and ALE shows that surface soil moisture is the dominant predictor across all depths, with a clear attenuation trend and a critical transition zone between 160 and 200 cm. Precipitation and humidity primarily drive shallow to mid-layers (20–140 cm), whereas temperature variables gain relative importance in deeper profiles (200–300 cm). ALE analysis eliminates feature correlation biases while maintaining high predictive accuracy, confirming surface-to-deep information transmission mechanisms. We propose a depth-adaptive modeling strategy by assigning the best-performing model at each soil layer, enabling practical non-invasive deep soil moisture prediction for precision irrigation and water resource management. Full article

(This article belongs to the Section Agricultural Soils)

►▼ Show Figures

Figure 1

25 pages, 5066 KB

Open AccessArticle

PM2.5: Air Quality Index Prediction Using Machine Learning: Evidence from Kuwait’s Air Quality Monitoring Stations

by Huda Alrashidi, Fadi N. Sibai, Abdullah Abonamah, Mufreh Alrashidi and Ahmad Alsaber

Sustainability 2025, 17(20), 9136; https://doi.org/10.3390/su17209136 (registering DOI) - 15 Oct 2025

Abstract

Air pollution poses a significant threat to public health and the environment, particularly fine particulate matter (PM2.5). Machine learning (ML) models have proven their accuracy in classifying and predicting air pollution levels. This research trains and compares the performance of eight machine learning regression models on a time series air quality dataset containing data from 12 dispersed air quality stations in Kuwait, to predict the PM2.5 Air Quality Index (AQI). After cleaning then trimming the large dataset to about 13.4% of its original size, we performed thorough data visualization and analysis of the dataset to identify important patterns. Next, in a set of five experiments exploring feature pruning, the tree-based models, namely Gradient Boosting and AdaBoost, generated mean square errors below 1.5 and

R^{2}

numbers above 0.998, outperforming the other ML models. By integrating meteorological data, pollution source information, and geographical factors specific to Kuwait, these models provide a precise prediction of air quality levels. This research contributes to a deeper understanding and visualization of Kuwait’s air pollution challenges, and draws some public policy recommendations to mitigate environmental and health impacts. Full article

►▼ Show Figures

Figure 1

22 pages, 12379 KB

Open AccessArticle

Evaluation of Spatial Variability of Soil Nutrients in Saline–Alkali Farmland Using Automatic Machine Learning Model and Hyperspectral Data

by Meiyan Xiang, Qianlong Rao, Xiaohang Yang, Xiaoqian Wu, Dexi Zhan, Jin Zhang, Miao Lu and Yingqiang Song

ISPRS Int. J. Geo-Inf. 2025, 14(10), 403; https://doi.org/10.3390/ijgi14100403 - 15 Oct 2025

Abstract

Saline–alkali soils represent a significant reserve of arable land, playing a vital role in ensuring national food security. Given that saline–alkali soil has low soil organic matter (SOM) and soil nutrient contents, and that soil quality degradation poses a threat to regional high-quality agricultural development and ecological balance, this study took coastal saline–alkali land as a case study. It adopted the extreme gradient boosting (XGB) model optimized by the tree-structured Parzen estimator (TPE) algorithm, combined with in situ hyperspectral (ISH) and spaceborne hyperspectral (SBH) data, to predict and map soil organic matter and four soil nutrients: alkali nitrogen (AN), available phosphorus (AP), and available potassium (AK). From the research outputs, one can deduce that superior predictive efficacy is exhibited by the TPE-XGB construct, employing in situ hyperspectral datasets. Among these, available phosphorus (R² = 0.67) exhibits the highest prediction accuracy, followed by organic matter (R² = 0.65), alkali-hydrolyzable nitrogen (R² = 0.56), and available potassium (R² = 0.51). In addition, the spatial continuity mapping results based on spaceborne hyperspectral data show that SOM, AN, AP, and AK in soil nutrients in the study area are concentrated in the northern, eastern, southern, and riverbank and estuarine delta areas, respectively. The variability of soil nutrients from large to small is phosphorus, potassium, nitrogen, and organic matter. The SHAP (SHapley Additive exPlanations) analysis results reveal that the bands with the greatest contribution to the fitting of SOM, AN, AP, and AK are 612 nm, 571 nm, 1493 nm, and 1308 nm, respectively. Extending into realms of hierarchical partitioning (HP) and variation partitioning (VP), it is discerned that climatic factors (CLI) alongside vegetative aspects (VEG) wield dominant influence upon the spatial differentiation manifest in nutrients. Meanwhile, comparatively diminished are the contributions possessed by terrain (TER) and soil property (SOIL). In summary, this study effectively assessed the significant variation patterns of soil nutrient distribution in coastal saline–alkali soils using the TPE-XGB model, providing scientific basis for the sustainable advancement of agricultural development in saline–alkali coastal regions. Full article

(This article belongs to the Special Issue Advances in AI-Driven Geospatial Analysis and Data Generation (2nd Edition))

►▼ Show Figures

Figure 1

17 pages, 3651 KB

Open AccessArticle

Optofluidic Lens Refractometer

by Yifan Zhang, Qi Wang, Yuxiang Li, Junjie Liu, Ziyue Lin, Mingkai Fan, Yichi Zhang and Xiang Wu

Micromachines 2025, 16(10), 1160; https://doi.org/10.3390/mi16101160 - 13 Oct 2025

Viewed by 190

Abstract

In the face of increasingly severe global environmental challenges, the development of low-cost, high-precision, and easily integrable environmental monitoring sensors is of paramount importance. Existing optical refractive index sensors are often limited in application due to their complex structures and high costs, or their bulky size and difficulty in automation. This paper proposes a novel optical microfluidic refractometer, consisting solely of a laser source, an optical microfluidic lens, and a CCD detector. Through an innovative “simple structure + algorithm” design, the sensor achieves high-precision measurement while significantly reducing cost and size and enhancing robustness. With the aid of signal processing algorithms, the device currently enables the detection of refractive index gradients as low as 1.4 × 10⁻⁵ within a refractive index range of 1.33 to 1.48. Full article

(This article belongs to the Special Issue Optofluidic Devices and Their Applications)

►▼ Show Figures

Figure 1

17 pages, 1106 KB

Open AccessArticle

Calibrated Global Logit Fusion (CGLF) for Fetal Health Classification Using Cardiotocographic Data

by Mehret Ephrem Abraha and Juntae Kim

Electronics 2025, 14(20), 4013; https://doi.org/10.3390/electronics14204013 - 13 Oct 2025

Viewed by 113

Abstract

Accurate detection of fetal distress from cardiotocography (CTG) is clinically critical but remains subjective and error-prone. In this research, we present a leakage-safe Calibrated Global Logit Fusion (CGLF) framework that couples TabNet’s sparse, attention-based feature selection with XGBoost’s gradient-boosted rules and fuses their class probabilities through global logit blending followed by per-class vector temperature calibration. Class imbalance is addressed with SMOTE–Tomek for TabNet and one XGBoost stream (XGB–A), and class-weighted training for a second stream (XGB–B). To prevent information leakage, all preprocessing, resampling, and weighting are fitted only on the training split within each outer fold. Out-of-fold (OOF) predictions from the outer-train split are then used to optimize blend weights and fit calibration parameters, which are subsequently applied once to the corresponding held-out outer-test fold. Our calibration-guided logit fusion (CGLF) matches top-tier discrimination on the public Fetal Health dataset while producing more reliable probability estimates than strong standalone baselines. Under nested cross-validation, CGLF delivers comparable AUROC and overall accuracy to the best tree-based model, with visibly improved calibration and slightly lower balanced accuracy in some splits. We also provide interpretability and overfitting checks via TabNet sparsity, feature stability analysis, and sufficiency (k95) curves. Finally, threshold tuning under a balanced-accuracy floor preserves sensitivity to pathological cases, aligning operating points with risk-aware obstetric decision support. Overall, CGLF is a calibration-centric, leakage-controlled CTG pipeline that is interpretable and suited to threshold-based clinical deployment. Full article

(This article belongs to the Special Issue Advances in Algorithm Optimization and Computational Intelligence)

►▼ Show Figures

Figure 1

25 pages, 1453 KB

Open AccessArticle

Application of Standard Machine Learning Models for Medicare Fraud Detection with Imbalanced Data

by Dorsa Farahmandazad, Kasra Danesh and Hossein Fazel Najaf Abadi

Risks 2025, 13(10), 198; https://doi.org/10.3390/risks13100198 - 13 Oct 2025

Viewed by 112

Abstract

Medicare fraud poses a substantial challenge to healthcare systems, resulting in significant financial losses and undermining the quality of care provided to legitimate beneficiaries. This study investigates the use of machine learning (ML) to enhance Medicare fraud detection, addressing key challenges such as class imbalance, high-dimensional data, and evolving fraud patterns. A dataset comprising inpatient claims, outpatient claims, and beneficiary details was used to train and evaluate five ML models: Random Forest, KNN, LDA, Decision Tree, and AdaBoost. Data preprocessing techniques included resampling SMOTE method to address the class imbalance, feature selection for dimensionality reduction, and aggregation of diagnostic and procedural codes. Random Forest emerged as the best-performing model, achieving a training accuracy of 99.2% and validation accuracy of 98.8%, and F1-score (98.4%). The Decision Tree also performed well, achieving a validation accuracy of 96.3%. KNN and AdaBoost demonstrated moderate performance, with validation accuracies of 79.2% and 81.1%, respectively, while LDA struggled with a validation accuracy of 63.3% and a low recall of 16.6%. The results highlight the importance of advanced resampling techniques, feature engineering, and adaptive learning in detecting Medicare fraud effectively. This study underscores the potential of machine learning in addressing the complexities of fraud detection. Future work should explore explainable AI and hybrid models to improve interpretability and performance, ensuring scalable and reliable fraud detection systems that protect healthcare resources and beneficiaries. Full article

(This article belongs to the Special Issue Artificial Intelligence Risk Management)

►▼ Show Figures

Figure 1

34 pages, 1960 KB

Open AccessArticle

Quantum-Inspired Hybrid Metaheuristic Feature Selection with SHAP for Optimized and Explainable Spam Detection

by Qusai Shambour, Mahran Al-Zyoud and Omar Almomani

Symmetry 2025, 17(10), 1716; https://doi.org/10.3390/sym17101716 - 13 Oct 2025

Viewed by 184

Abstract

The rapid growth of digital communication has intensified spam-related threats, including phishing and malware, which employ advanced evasion tactics. Traditional filtering methods struggle to keep pace, driving the need for sophisticated machine learning (ML) solutions. The effectiveness of ML models hinges on selecting high-quality input features, especially in high-dimensional datasets where irrelevant or redundant attributes impair performance and computational efficiency. Guided by principles of symmetry to achieve an optimal balance between model accuracy, complexity, and interpretability, this study proposes an Enhanced Hybrid Quantum-Inspired Firefly and Artificial Bee Colony (EHQ-FABC) algorithm for feature selection in spam detection. EHQ-FABC leverages the Firefly Algorithm’s local exploitation and the Artificial Bee Colony’s global exploration, augmented with quantum-inspired principles to maintain search space diversity and a symmetrical balance between exploration and exploitation. It eliminates redundant attributes while preserving predictive power. For interpretability, Shapley Additive Explanations (SHAPs) are employed to ensure symmetry in explanation, meaning features with equal contributions are assigned equal importance, providing a fair and consistent interpretation of the model’s decisions. Evaluated on the ISCX-URL2016 dataset, EHQ-FABC reduces features by over 76%, retaining only 17 of 72 features, while matching or outperforming filter, wrapper, embedded, and metaheuristic methods. Tested across ML classifiers like CatBoost, XGBoost, Random Forest, Extra Trees, Decision Tree, K-Nearest Neighbors, Logistic Regression, and Multi-Layer Perceptron, EHQ-FABC achieves a peak accuracy of 99.97% with CatBoost and robust results across tree ensembles, neural, and linear models. SHAP analysis highlights features like domain_token_count and NumberOfDotsinURL as key for spam detection, offering actionable insights for practitioners. EHQ-FABC provides a reliable, transparent, and efficient symmetry-aware solution, advancing both accuracy and explainability in spam detection. Full article

(This article belongs to the Section Computer)

►▼ Show Figures

Figure 1

20 pages, 4096 KB

Open AccessArticle

Transformer Core Loosening Diagnosis Based on Fusion Feature Extraction and CPO-Optimized CatBoost

by Yuanqi Xiao, Yipeng Yin, Jiaqi Xu and Yuxin Zhang

Processes 2025, 13(10), 3247; https://doi.org/10.3390/pr13103247 - 12 Oct 2025

Viewed by 219

Abstract

Transformer reliability is crucial to grid security, with core loosening a common fault. This paper proposes a transformer core loosening fault diagnosis method based on a fusion feature extraction approach and Categorical Boosting (CatBoost) optimized by the Crested Porcupine Optimizer (CPO) algorithm. Firstly, the audio signal is decomposed into six Intrinsic Mode Functions (IMF) components through Variational Mode Decomposition (VMD). This paper utilizes Gaussian membership functions to quantify the energy proportion, central frequency, and kurtosis of IMF and constructs a fuzzy entropy discrimination function. Then, the IMF noise components are removed through an adaptive threshold. Subsequently, the denoised signal undergoes a wavelet packet transform instead of a short-time Fourier transform to optimize Mel-frequency cepstral coefficients (WPT-MFCC), combining time-domain statistical features and frequency-band energy distribution to form a 24-dimensional fusion feature. Finally, the CatBoost algorithm is employed to validate the effects of different feature schemes. The CPO is introduced to optimize its iteration number, learning rate, tree depth, and random strength parameters, thereby enhancing overall performance. The CPO-optimized CatBoost model had 99.0196% fault recognition accuracy in experimental testing, 15% better than the standard CatBoost. Accuracy exceeded 90% even under extreme 0 dB noise. This method makes fault diagnosis more accurate and reliable. Full article

(This article belongs to the Section AI-Enabled Process Engineering)

►▼ Show Figures

Figure 1

23 pages, 1547 KB

Open AccessArticle

Roles, Risks and Responsibility: Foundations of Pro-Environmental Culture in Everyday Choices

by Olena Pavlova, Oksana Liashenko, Olena Mykhailovska, Kostiantyn Pavlov, Krzysztof Posłuszny and Antoni Korcyl

Sustainability 2025, 17(20), 9019; https://doi.org/10.3390/su17209019 (registering DOI) - 11 Oct 2025

Viewed by 196

Abstract

This study explores how contextual framings influence sustainable decision-making in everyday situations. Building on the literature about the intention–behaviour gap, we examine the combined effect of role activation and environmental risk on pro-environmental preferences. A scenario-based behavioural experiment, conducted via oTree, integrated within-subject role framing (citizen, consumer, neutral) with randomised environmental risk conditions. Participants completed repeated binary choice tasks, where Eco-Preference was defined as the frequency with which they chose the sustainable option. The results indicate that activating a citizen role significantly increased Eco-Preference compared to consumer or neutral framings, while high-risk contexts did not directly boost sustainable behaviour. Instead, risk cues had an indirect effect through motivational states, highlighting the mediating role of Eco-Preference. Theoretically, this study advances Eco-Preference as a latent behavioural construct linking identity-based theories of responsibility with decision-based models of sustainability. Practically, the findings underscore the potential of role-based communication strategies to enhance ecological responsibility, suggesting that both policy and organisational interventions can benefit from fostering civic identities. Ultimately, the framework is applicable across cultures by offering a behavioural measure less prone to survey bias, supporting future comparative research on environmental decision-making. Full article

(This article belongs to the Special Issue Quality of Life in the Context of Sustainable Development)

►▼ Show Figures

Figure 1

24 pages, 5244 KB

Open AccessArticle

Optimizing Spatial Scales for Evaluating High-Resolution CO₂ Fossil Fuel Emissions: Multi-Source Data and Machine Learning Approach

by Yujun Fang, Rong Li and Jun Cao

Sustainability 2025, 17(20), 9009; https://doi.org/10.3390/su17209009 (registering DOI) - 11 Oct 2025

Viewed by 174

Abstract

High-resolution CO₂ fossil fuel emission data are critical for developing targeted mitigation policies. As a key approach for estimating spatial distributions of CO₂ emissions, top–down methods typically rely upon spatial proxies to disaggregate administrative-level emission to finer spatial scales. However, conventional linear regression models may fail to capture complex non-linear relationships between proxies and emissions. Furthermore, methods relying on nighttime light data are mostly inadequate in representing emissions for both industrial and rural zones. To address these limitations, this study developed a multiple proxy framework integrating nighttime light, points of interest (POIs), population, road networks, and impervious surface area data. Seven machine learning algorithms—Extra-Trees, Random Forest, XGBoost, CatBoost, Gradient Boosting Decision Trees, LightGBM, and Support Vector Regression—were comprehensively incorporated to estimate high-resolution CO₂ fossil fuel emissions. Comprehensive evaluation revealed that the multiple proxy Extra-Trees model significantly outperformed the single-proxy nighttime light linear regression model at the county scale, achieving R² = 0.96 (RMSE = 0.52 MtCO₂) in cross-validation and R² = 0.92 (RMSE = 0.54 MtCO₂) on the independent test set. Feature importance analysis identified brightness of nighttime light (40.70%) and heavy industrial density (21.11%) as the most critical spatial proxies. The proposed approach also showed strong spatial consistency with the Multi-resolution Emission Inventory for China, exhibiting correlation coefficients of 0.82–0.84. This study demonstrates that integrating local multiple proxy data with machine learning corrects spatial biases inherent in traditional top–down approaches, establishing a transferable framework for high-resolution emissions mapping. Full article

►▼ Show Figures

Figure 1

22 pages, 618 KB

Open AccessArticle

Comparison of Ensemble and Meta-Ensemble Models for Early Risk Prediction of Acute Myocardial Infarction

by Daniel Cristóbal Andrade-Girón, Juana Sandivar-Rosas, William Joel Marin-Rodriguez, Marcelo Gumercindo Zúñiga-Rojas, Abrahán Cesar Neri-Ayala and Ernesto Díaz-Ronceros

Informatics 2025, 12(4), 109; https://doi.org/10.3390/informatics12040109 - 11 Oct 2025

Viewed by 228

Abstract

Cardiovascular disease (CVD) is a major cause of mortality around the world. This underscores the critical need to implement effective predictive tools to inform clinical decision-making. This study aimed to compare the predictive performance of ensemble learning algorithms, including Bagging, Random Forest, Extra Trees, Gradient Boosting, and AdaBoost, when applied to a clinical dataset comprising patients with CVD. The methodology entailed data preprocessing and cross-validation to regulate generalization. The performance of the model was evaluated using a variety of metrics, including accuracy, F1 score, precision, recall, Cohen’s Kappa, and area under the curve (AUC). Among the models evaluated, Bagging demonstrated the best overall performance (accuracy ± SD: 93.36% ± 0.22; F1 score: 0.936; AUC: 0.9686). It also reached the lowest average rank (1.0) in Friedman test and was placed, together with Extra Trees (accuracy ± SD: 90.76% ± 0.18; F1 score: 0.916; AUC: 0.9689), in the superior statistical group (group A) according to Nemenyi post hoc test. The two models demonstrated a high degree of agreement with the actual labels (Kappa: 0.87 and 0.83, respectively), thereby substantiating their reliability in authentic clinical contexts. The findings substantiated the preeminence of aggregation-based ensemble methods in terms of accuracy, stability, and concordance. This underscored the prominence of Bagging and Extra Trees as optimal candidates for cardiovascular diagnostic support systems, where reliability and generalization were paramount. Full article

►▼ Show Figures

Figure 1

Show export options Show export options

Select all

Export citation of selected articles as:

Error

Oops... you haven't selected anything for export.

Displaying article 1-50 on page 1 of 52.

Go to page 1 2 3 4 5

Search Results (2,569)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI