Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (1,057)

Search Parameters:
Keywords = feature selection ensemble

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
16 pages, 625 KB  
Article
Benchmarking Training Emissions of Regression Models for Vehicle CO2 Prediction
by Mahmut Turhan, Murat Emeç and Muzaffer Ertürk
Sustainability 2026, 18(6), 2830; https://doi.org/10.3390/su18062830 - 13 Mar 2026
Viewed by 82
Abstract
The urgency of climate action has intensified the use of machine learning (ML) to predict vehicular CO2 emissions; however, the training of machine learning models also generates computational emissions that are seldom reported. This study addresses a paradox central to Green AI: [...] Read more.
The urgency of climate action has intensified the use of machine learning (ML) to predict vehicular CO2 emissions; however, the training of machine learning models also generates computational emissions that are seldom reported. This study addresses a paradox central to Green AI: can carbon-intensive algorithms be justified for predicting carbon emissions? Using a public dataset of 7385 light-duty vehicles, we trained nine widely used regression models spanning simple linear baselines, polynomial and regularised linear methods, tree-based learners, ensembles, and a neural network. All experiments were instrumented with CodeCarbon to quantify real-time training footprints under a grid carbon intensity of 450 g CO2/kWh. Across models, test performance ranged from R2 = 0.72 to 0.99, yet training emissions varied by four orders of magnitude, from 0.001 g CO2 (simple linear regression) to 2.3 g CO2 (XGBoost). Although XGBoost achieved the highest accuracy (R2 = 0.9947), it emitted approximately 2300× more CO2 than regularised polynomial linear models for only a 0.39-point gain in R2. Pareto analysis identifies Lasso and Ridge regression with degree-4 polynomial features as sustainability-optimal, reaching R2 = 0.9908 at ~0.004 g CO2. To unify predictive and environmental efficiency, we introduce Accuracy-per-Gram (APG = R2/CO2) and Marginal Emissions Cost (MEC = ΔCO2/ΔR2), demonstrating a steep efficiency cliff beyond regularised linear models. At the fleet scale (100 million vehicles with daily retraining), algorithm choice implies ~84 t CO2/year for XGBoost versus ~0.15 t for Lasso, highlighting the potential climate cost of marginal accuracy gains. We provide a reproducible carbon-tracking pipeline, Green-AI evaluation metrics, and deployment guidance, arguing that computational sustainability must co-determine model selection for emissions-related ML systems. Most critically, we identify a clear accuracy–carbon emission Pareto frontier, demonstrating that regularised polynomial linear models lie on the sustainability-optimal boundary, while widely used ensemble methods such as XGBoost sit beyond an “efficiency cliff,” where marginal accuracy improvements incur disproportionately high carbon costs. Full article
Show Figures

Figure 1

28 pages, 3210 KB  
Article
Employee Attrition Prediction: An Explanatory and Statistically Robust Ensemble Learning Model
by Ghalia Nassreddine, Jamil Hammoud, Obada Al-Khatib and Mohamad Al Majzoub
Computers 2026, 15(3), 185; https://doi.org/10.3390/computers15030185 - 12 Mar 2026
Viewed by 192
Abstract
Organizational productivity and workforce management are highly affected by employee attrition. Thus, an employee attrition prediction system may allow human resource management to enhance the workplace by minimizing attrition. This study proposes a new and interpretable ensemble learning framework for employee attrition prediction. [...] Read more.
Organizational productivity and workforce management are highly affected by employee attrition. Thus, an employee attrition prediction system may allow human resource management to enhance the workplace by minimizing attrition. This study proposes a new and interpretable ensemble learning framework for employee attrition prediction. The model integrates SHapley Additive exPlanations (SHAP)-based feature selection, Optuna hyperparameter optimization, and dual explainability using SHAP and Local Interpretable Model-agnostic Explanations (LIME). Random oversampling (ROS) is used to address class imbalance. The proposed framework allows for both global and local interpretability, enabling actionable insights into retention drivers. It was assessed using two benchmark datasets: the Kaggle HR Analytics dataset (14,999 records) and the IBM HR dataset (1470 records). The results revealed that the most impactful factors on employee attrition are promotion history, tenure, job satisfaction, workload, average monthly hours, overtime, and financial incentives. Furthermore, the proposed model achieved exceptional performance on both datasets. On the Kaggle dataset, it reached an accuracy of 98.72%, an F1-score of 97.29%, and an ROC–AUC of 0.994, while on the IBM dataset, it produced an accuracy of 97.72%, an F1-score of 97.74%, and an ROC–AUC of 0.995. Moreover, the proposed approach shows high computational efficiency, demonstrating that it is suitable for real-world deployment. These findings indicate that integrating explainable AI techniques, resampling tools, and automated hyperparameter tuning can achieve robust, accurate, and actionable employee attrition predictions, supporting HR managers’ decision-making. Full article
(This article belongs to the Special Issue Machine Learning: Innovation, Implementation, and Impact)
Show Figures

Graphical abstract

28 pages, 9784 KB  
Article
Bayesian-Optimized Ensemble Learning for Music Popularity Prediction with Shapley-Based Interpretability
by Liang Qiu, Penghui Wang, Jing Zhao, Hong Zhang and Mujiangshan Wang
Mathematics 2026, 14(6), 946; https://doi.org/10.3390/math14060946 - 11 Mar 2026
Viewed by 1375
Abstract
Music popularity prediction is a fundamental problem in music information retrieval, with important implications for digital content dissemination and creative decision-making on streaming platforms. In this study, music popularity prediction is formulated as a supervised regression problem, and six widely-used tree ensemble models [...] Read more.
Music popularity prediction is a fundamental problem in music information retrieval, with important implications for digital content dissemination and creative decision-making on streaming platforms. In this study, music popularity prediction is formulated as a supervised regression problem, and six widely-used tree ensemble models (Random Forest, XGBoost, CatBoost, LightGBM, Extra Trees, and Decision Tree) are systematically evaluated using large-scale Spotify data. Among these models, Random Forest achieves the best predictive performance on this dataset (RMSE = 6.79, MAE = 5.10, and R2 = 0.6658), followed by Extra Trees (R2 = 0.6378) and Decision Tree (R2 = 0.6328). Bayesian hyperparameter optimization based on a Tree-structured Parzen Estimator with an Expected Improvement acquisition function is conducted over 50 trials with 5-fold cross-validation to ensure robust model selection. Shapley value decomposition via SHAP analysis reveals that temporal recency dominates feature importance, far surpassing traditional musical attributes, while acoustic intensity (loudness) exhibits a U-shaped contribution pattern with optimal values at moderate intensity levels. Further SHAP dependence analysis uncovers non-linear relationships, indicating substantial popularity advantages for recent releases and optimal loudness levels around 5 to 0 dB. These findings suggest that streaming popularity is primarily governed by temporal exposure dynamics and production-related characteristics rather than intrinsic musical structure, offering both theoretical insights for music information retrieval research and suggestive empirical patterns that may inform future investigations into digital music ecosystems. Full article
Show Figures

Figure 1

19 pages, 2380 KB  
Article
DTBAffinity: A Multi-Modal Feature Engineering and Gradient-Boosting Framework for Drug–Target Binding Affinity on Davis and KIBA Benchmarks
by Meshari Alazmi
Computers 2026, 15(3), 182; https://doi.org/10.3390/computers15030182 - 10 Mar 2026
Viewed by 163
Abstract
An accurate prediction of how strongly a drug binds to its target (where the drug will have the desired effect) is very important for drug discovery. It helps select the most promising compounds and saves money by doing fewer experiments. We present DTBAffinity, [...] Read more.
An accurate prediction of how strongly a drug binds to its target (where the drug will have the desired effect) is very important for drug discovery. It helps select the most promising compounds and saves money by doing fewer experiments. We present DTBAffinity, a multi-modal regression framework that integrates chemically meaningful ligand descriptors with diverse protein sequence features in a unified gradient-boosting model. The representation of ligands includes physicochemical and topological descriptors (RDKit and Mordred), structural keys (MACCS and FP4), circular fingerprints (ECFP/Morgan), and SMILES-derived features from iFeatureOmega. For proteins, thousands of sequence-derived descriptors (composition, autocorrelations, physicochemical profiles, and evolutionary indices) from iFeatureOmega are used, together with contextual embeddings from large protein language models (ESM-1b, ESM-2). The feature matrices are cleaned up, variance filtered, z-score scaled, and univariate selected before being concatenated and modeled with regularized XGBoost ensembles. We evaluate DTBAffinity on two kinase-centric datasets that are commonly used: Davis (30,056 interactions: pKd values) and KIBA (118,254 interactions: integrated affinity scores). Various metrics are used to measure the performance, such as MSE, R2, Pearson/Spearman correlations, Concordance Index (CI), rm2, and AUPR. On Davis, DTBAffinity yields MSE = 0.1885, CI = 0.9102, and AUPR = 0.8112, and on KIBA, it gives MSE = 0.1540, CI = 0.8686, and AUPR = 0.8361; thus, it is better than the state-of-the-art baselines such as KronRLS, SimBoost, DeepDTA, and GraphDTA. The findings here imply that the combination of interpretable descriptors and contextual embeddings in a robust boosting framework is a great way to realize accurate, interpretable, and generalizable DTBA prediction. Full article
(This article belongs to the Special Issue AI in Bioinformatics)
Show Figures

Figure 1

32 pages, 3089 KB  
Article
Systematic Evaluation of Machine Learning and Deep Learning Models for IoT Malware Detection Across Ransomware, Rootkit, Spyware, Trojan, Botnet, Worm, Virus, and Keylogger
by Mazdak Maghanaki, Soraya Keramati, F. Frank Chen and Mohammad Shahin
Sensors 2026, 26(6), 1750; https://doi.org/10.3390/s26061750 - 10 Mar 2026
Viewed by 269
Abstract
The rapid growth of Internet-of-Things (IoT) deployments has substantially expanded the attack surface of modern cyber–physical systems, making accurate and computationally feasible malware detection essential for enterprise and industrial environments. This study presents a large-scale, systematic comparison of 27 machine learning (ML) and [...] Read more.
The rapid growth of Internet-of-Things (IoT) deployments has substantially expanded the attack surface of modern cyber–physical systems, making accurate and computationally feasible malware detection essential for enterprise and industrial environments. This study presents a large-scale, systematic comparison of 27 machine learning (ML) and 18 deep learning (DL) models for IoT malware detection across eight major malware categories: Trojan, Botnet, Ransomware, Rootkit, Worm, Spyware, Keylogger, and Virus. A realistic dataset was constructed using 50,000 executable samples collected from the Any.Run platform, including 8000 malware instances (1000 per class) and 42,000 benign samples. Each sample was executed in a sandbox to extract detailed static and behavioral telemetry. A targeted feature-selection pipeline reduced the feature space to 47 diagnostic features spanning static properties, behavioral indicators, process/file/registry activity, debug signals, and network telemetry, yielding a compact representation suitable for malware detection in IoT settings. Experimental results demonstrate that ensemble tree-based ML models consistently dominate performance on the engineered tabular feature set as 7 of the top 10 models are ML, with CatBoost and LightGBM achieving near-ceiling accuracy and low false-positive rates. Per-malware analysis further shows that optimal model choice depends on malware behavior. CatBoost is best for Trojan/Spyware, LightGBM for Botnet, XGBoost for Worm, Extra Trees for Rootkit, and Random Forest for Keylogger, while DL models are competitive only for specific categories, with TabNet performing best for Ransomware and FT-Transformer for Virus. In addition, an end-to-end computational time analysis across all 45 models reveals a clear efficiency advantage for boosted tree ensembles relative to most DL architectures, supporting deployment feasibility on commodity CPU hardware. Overall, the study provides actionable guidance for designing adaptive IoT malware detection frameworks, recommending gradient-boosted ensemble ML models as the primary deployment choice, with selective DL models only when category-specific gains justify additional computational cost. Full article
(This article belongs to the Special Issue Intelligent Sensors for Security and Attack Detection)
Show Figures

Figure 1

20 pages, 3757 KB  
Article
Short-Term Photovoltaic Power Forecasting Using a Hybrid RF-ICEEMDAN-SE-RWCE-GRU Model
by Chuang Li, Xiaohuang Huang, Mang Su, Huanhuan Duan, Weile Cao and Guomin Cui
Energies 2026, 19(6), 1386; https://doi.org/10.3390/en19061386 - 10 Mar 2026
Viewed by 191
Abstract
To enhance the accuracy of short-term photovoltaic (PV) power forecasting, this study proposes a novel hybrid model that integrates Random Forest (RF), Improved Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (ICEEMDAN), Sample Entropy (SE), the Random Walk with Compulsory Evolution (RWCE) algorithm, [...] Read more.
To enhance the accuracy of short-term photovoltaic (PV) power forecasting, this study proposes a novel hybrid model that integrates Random Forest (RF), Improved Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (ICEEMDAN), Sample Entropy (SE), the Random Walk with Compulsory Evolution (RWCE) algorithm, and the Gated Recurrent Unit (GRU) network. Initially, RF is applied to select relevant meteorological features, minimizing redundancy and improving both training efficiency and predictive robustness under complex operating conditions. ICEEMDAN is then employed to decompose the PV power series into multiple quasi-stationary components, mitigating the adverse effects of non-stationarity on forecasting accuracy. Following this, SE is applied to quantify the complexity of each component and reconstruct the decomposed signals into high-, mid-, and low-frequency bands, simplifying the inputs to the forecasting model. To further improve performance, the RWCE algorithm optimizes GRU network hyperparameters through global exploration, individual evolution, and enforced evolution strategies. The optimized GRU network then predicts each reconstructed component, and the component-wise forecasts are aggregated to yield the final PV power output. Simulation results from several representative months indicate that the proposed approach reduces RMSE by an average of 9.02% compared to comparison model and by 43.41% relative to the baseline model, demonstrating its superior forecasting capability. Additionally, the model demonstrated scalability across varying climate conditions, confirming its applicability in real-world scenarios. Full article
Show Figures

Figure 1

23 pages, 7301 KB  
Article
Estimation of Complex Heterogeneous Stand Canopy Height Using a Bi-Directional Stacking Model Framework with Multi-Forest-Type Feature Fusion Based on GEDI and Multi-Source Remote Sensing Data
by Zhiyong Wu, Jirong Ding, Juncheng Huang, Yehua Liang, Jianjun Chen and Haotian You
Forests 2026, 17(3), 337; https://doi.org/10.3390/f17030337 - 8 Mar 2026
Viewed by 141
Abstract
Forest canopy height (FCH) is a fundamental parameter for carbon assessment and ecological monitoring. The Global Ecosystem Dynamics Investigation (GEDI) mission provides full-waveform LiDAR for FCH estimation, yet discontinuous sampling and heterogeneous forests increase uncertainty. Conventional ensemble models rarely account for forest-type-specific structures [...] Read more.
Forest canopy height (FCH) is a fundamental parameter for carbon assessment and ecological monitoring. The Global Ecosystem Dynamics Investigation (GEDI) mission provides full-waveform LiDAR for FCH estimation, yet discontinuous sampling and heterogeneous forests increase uncertainty. Conventional ensemble models rarely account for forest-type-specific structures and remote sensing responses, reducing accuracy and stability. We propose a regional framework combining MBF-Contrast feature selection and a bi-directional stacked model with multi-forest-type feature fusion (BS-MFTF). MBF-Contrast integrates model-based importance with feature-distribution diagnostics to remove redundant and multicollinear variables. BS-MFTF leverages complementarity and structural differences among forest types to improve modeling in heterogeneous canopies. MBF-Contrast reduces feature dimensionality by ~35% versus Mutual Information, Boruta, and RFECV, and improves performance across forest types. BS-MFTF overall achieves R2 = 0.68 (RMSE = 3.22 m; MAE = 2.34 m). Airborne LiDAR validation shows high consistency (R2 = 0.59; RMSE = 1.31 m; MAE = 1.00 m) and a 15%–25% R2 gain over conventional ensembles. The framework offers a scalable solution for large-scale FCH estimation in structurally diverse forests. Full article
(This article belongs to the Special Issue Climate-Smart Forestry: Forest Monitoring in a Multi-Sensor Approach)
Show Figures

Figure 1

30 pages, 6906 KB  
Article
A Method for Predicting Alfalfa Biomass Based on Multimodal Data and Ensemble Learning Model
by Yuehua Zhang, Zhaoming Wang, Zhendong Tian, Haotian Deng, Jungang Gao, Chen Chen, Wei Zhao, Xiaoping Ma, Xueqin Ding, Haoran Yan, Liu Yang, Hui Xie, Qing Li and Fengling Shi
Plants 2026, 15(5), 815; https://doi.org/10.3390/plants15050815 - 6 Mar 2026
Viewed by 294
Abstract
Accurate alfalfa biomass prediction is crucial for pasture management and sustainable livestock production. However, traditional methods often perform poorly under complex field conditions. To address the limited prediction accuracy of traditional methods under complex planting environments, this study proposes an alfalfa biomass prediction [...] Read more.
Accurate alfalfa biomass prediction is crucial for pasture management and sustainable livestock production. However, traditional methods often perform poorly under complex field conditions. To address the limited prediction accuracy of traditional methods under complex planting environments, this study proposes an alfalfa biomass prediction method combining multispectral and LiDAR data with ensemble learning model. Based on the multispectral images acquired by unmanned aerial vehicle (UAV) and airborne LiDAR data, the spectral features, three-dimensional structural features, and their interaction features are systematically extracted at the quadrat scale, and a high-quality modeling dataset is constructed by feature selection. Secondly, an ensemble model for alfalfa biomass prediction was constructed, which was composed of random forest, extra trees, and histogram gradient boosting. After model training, the coefficient of determination (R2) of the integrated model on the test set reached 0.813, and the root mean square error (RMSE) and mean absolute error (MAE) were 0.178 kg m−2 and 0.146 kg m−2, which were significantly better than those of similar single models. Under feature combinations, the fusion model was better than that of spectral indices only (R2 = 0.773) and LiDAR traits only (R2 = 0.576), and the model achieved the highest accuracy from bud emergence to early flowering (R2 = 0.917). The overall prediction error of the model was approximately normal distribution, and the absolute error of more than 65% of the samples was less than 0.2. However, there was still a trend of underestimation in the high biomass interval. This research showed that the multimodal data fusion and ensemble learning method could achieve high-precision prediction of alfalfa biomass, which provided reliable technical support for pasture resources monitoring and precision agriculture management. Full article
(This article belongs to the Section Plant Modeling)
Show Figures

Figure 1

28 pages, 5263 KB  
Article
Inversion of Soil Arsenic Concentration in Sanlisha’an Mining Area Based on ZY-02E Hyperspectral Satellite Images
by Yuqin Li, Dan Meng, Qi Yang, Mengru Zhang and Yue Zhao
Remote Sens. 2026, 18(5), 822; https://doi.org/10.3390/rs18050822 - 6 Mar 2026
Viewed by 305
Abstract
Soil heavy metal pollution caused by mineral resource extraction activities poses a serious threat to the ecological environment within and surrounding mining areas. As a highly concealed toxic heavy metal, arsenic (As) urgently requires the establishment of efficient pollution monitoring methods to achieve [...] Read more.
Soil heavy metal pollution caused by mineral resource extraction activities poses a serious threat to the ecological environment within and surrounding mining areas. As a highly concealed toxic heavy metal, arsenic (As) urgently requires the establishment of efficient pollution monitoring methods to achieve pollution prevention and control, as well as environmental remediation in mining areas. This study investigated the feasibility of hyperspectral remote sensing inversion for soil heavy metal arsenic based on ZY-1 02E hyperspectral satellite imagery, focusing on a mining area and its surrounding soils in Sanlisha’an, Wuxuan County, Guangxi. Full Constrained Least Squares (FCLS) was employed to separate mixed pixels and enhance soil spectral contributions in ZY-1 02E imagery, thereby mitigating vegetation interference. Six mathematical transformations, including RT, AT, FD, RTFD, ATFD, and SD, were applied to both the original and enhanced spectra to enhance spectral features. The correlations between the transformed spectra, as well as the original image spectra (S), and soil As concentration were analyzed; then the spectra strongly correlated with soil As concentration were selected to construct Ratio Spectral Index (RSI) and Normalized Difference Spectral Index (NDSI). Correlation matrices were calculated between RSI/NDSI indices and As concentration. Sensitive features were screened using an improved Successive Projection Algorithm (SPA). As concentration inversion was also performed with four models: traditional regression models, PLSR and MLR, and ensemble learning models (RF and XGBoost). In the soil contribution-enhanced spectral modeling results, the optimal transformation–index combination is ATFD-NDSI. The performance indicators of each model are as follows: MLR test set R2 = 0.65, PLSR test set R2 = 0.62, RF test set R2 = 0.7, and XGBoost test set R2 = 0.64. The results indicate that the ATFD-NDSI-RF ensemble model provides the best performance. By integrating multiple decision trees, RF effectively handles complex nonlinear relationships, thus enhancing the accuracy and generalization ability of predication. The analysis of NDSI–ATFD–RF inversion results based on sampling points indicates that model error correlates with the pollution intensity gradient, showing greater errors, especially in high-concentration areas, but still maintaining strong correlations (tailings reservoir: r = 0.92, forested areas: r = 0.96, and cropland: r = 0.83). The spatial distribution reveals that the inversion results are closely similar to the spatial distribution of IDW interpolation. Areas with high As concentrations are concentrated in the tailings reservoir and in the southeastern part of the study area. The correlation coefficient between the inversion results and IDW interpolation is 0.6, which further verifies that the inversion results effectively reproduce the spatial distribution trend of highly polluted areas. Full article
Show Figures

Figure 1

20 pages, 3607 KB  
Article
Forest Aboveground Carbon Storage in the Three Parallel Rivers Region: A Remote Sensing and Machine Learning Perspective
by Qin Xiang, Rong Wei, Chaoguan Qin, Lianjin Fu, Zhengying Li, Hailin He and Qingtai Shu
Remote Sens. 2026, 18(5), 756; https://doi.org/10.3390/rs18050756 - 2 Mar 2026
Viewed by 234
Abstract
Accurate estimation of forest aboveground carbon (AGC) is crucial for understanding the carbon cycle and formulating climate policies, yet it remains challenging in complex mountainous regions. This study used machine learning framework to estimate the spatiotemporal dynamics of AGC in the Three Parallel [...] Read more.
Accurate estimation of forest aboveground carbon (AGC) is crucial for understanding the carbon cycle and formulating climate policies, yet it remains challenging in complex mountainous regions. This study used machine learning framework to estimate the spatiotemporal dynamics of AGC in the Three Parallel Rivers region of China from 2003 to 2024. By integrating China’s National Forest Continuous Inventory (NFCI) data with multispectral satellite imagery, we employed a two-stage feature selection strategy to identify key predictor variables. Among three ensemble algorithms tested, the Random Forest model achieved the optimal performance (R2 = 0.74). The results indicated a net increase of 67.05 Tg in total AGC over the two decades, with a spatial pattern characterized by higher densities in the west and north. Geographical Detector analysis revealed that the driving forces were synergistic, with the interaction between temperature and population density exhibiting the most prominent explanatory capacity. This study provides a high-resolution (30 m) benchmark for AGC in a global biodiversity hotspot and underscores the critical role of ecological protection policies in enhancing carbon sequestration, offering valuable insights for managing similar mountain ecosystems worldwide. Full article
Show Figures

Figure 1

13 pages, 2177 KB  
Article
Urine-Based cfDNA Ensemble Modeling for Early Detection of Bladder Cancer Using Whole-Genome Methylation Sequencing
by Taehoon Kim, Dongju Shin, Hyun Kyu Ahn, Young Joon Moon, Duhee Bang and Kwang Hyun Kim
Cancers 2026, 18(5), 767; https://doi.org/10.3390/cancers18050767 - 27 Feb 2026
Viewed by 347
Abstract
Early detection of bladder cancer poses a major challenge for liquid biopsy due to limited tumor burden and low abundance of tumor-derived DNA. In such low-signal settings, detection sensitivity critically depends on both biofluid selection and effective integration of weak, distributed molecular signals. [...] Read more.
Early detection of bladder cancer poses a major challenge for liquid biopsy due to limited tumor burden and low abundance of tumor-derived DNA. In such low-signal settings, detection sensitivity critically depends on both biofluid selection and effective integration of weak, distributed molecular signals. We analyzed Enzymatic Methyl-seq (EM-seq) data on 41 matched urine–plasma pairs, which demonstrated that urine samples exhibited significantly higher tumor fractions and greater concordance with tissue methylation profiles than plasma. Based on this observation, we developed a urine-based bladder cancer detection framework using EM-seq. We profiled 143 urine samples (68 bladder cancer and 75 healthy controls) and 14 bladder cancer tissues. Methylation markers (113,052 regions) were identified by comparing cancer tissues (n = 14) with urine from healthy individuals (n = 14). Using XGBoost, possible features and their combinations were evaluated, with the combination of methylation and copy number variations (CNV) yielding the best performance as the final ensemble model. When evaluated on an independent test set, the model achieved 91.9% sensitivity at 80% specificity, with an area under the curve (AUC) of 0.932 for bladder cancer detection and 0.928 for non-muscle invasive bladder cancer (NMIBC) detection. Notably, the model successfully detected four of seven mutation-negative cases, demonstrating complementary value to mutation-based approaches. Full article
(This article belongs to the Section Cancer Causes, Screening and Diagnosis)
Show Figures

Figure 1

42 pages, 1422 KB  
Article
Exploring Handwriting-Based Biomarkers for Alzheimer’s Disease: Identifying Discriminative Features and Tasks to Enhance Diagnostic Accuracy
by Cansu Akyürek Anacur, Asuman Günay Yılmaz and Bekir Dizdaroğlu
Diagnostics 2026, 16(5), 697; https://doi.org/10.3390/diagnostics16050697 - 26 Feb 2026
Viewed by 232
Abstract
Background/Objectives: This study proposes a comprehensive classification framework for the automatic detection of Alzheimer’s disease using handwriting data. An enriched feature space is constructed by combining 18 baseline features extracted from raw handwriting signals with 30 additional features derived from established handwriting analysis [...] Read more.
Background/Objectives: This study proposes a comprehensive classification framework for the automatic detection of Alzheimer’s disease using handwriting data. An enriched feature space is constructed by combining 18 baseline features extracted from raw handwriting signals with 30 additional features derived from established handwriting analysis studies, resulting in a total of 48 features. To enhance clinical practicality, a task reduction analysis is conducted by comparing the full dataset containing 25 handwriting tasks with a reduced dataset comprising 14 selected tasks. Methods: The proposed framework employs a two-stage evaluation strategy involving four feature selection methods (Random Forest Feature Importance, Extreme Gradient Boosting Feature Importance, L1 Regularization and Recursive Feature Elimination), three normalization techniques (Unnormalized, Min–Max and Z-Score), and five baseline machine learning classifiers (Random Forest, Logistic Regression, Multilayer Perceptron, XGBoost and Support Vector Machines). In the second stage, a dynamic ensemble learning strategy is introduced, where the most effective classifiers are adaptively selected for each cross-validation fold and integrated using soft and hard voting schemes. Results: The experimental results demonstrate that reducing the number of tasks leads to an improvement in average classification accuracy from 79.47% to 81.03%, while simultaneously decreasing training time and memory consumption by approximately 40% and 35%, respectively. The highest classification performance, achieving an accuracy of 94.20%, is obtained using the Hard Ensemble combined with L1-based feature selection. Conclusions: These findings highlight that the joint use of enriched feature representations, task reduction, and dynamic ensemble learning provides an effective and computationally efficient solution for handwriting-based Alzheimer’s disease detection. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Figure 1

30 pages, 2394 KB  
Article
Machine-Learning-Derived, Mechanistically Informed Transcriptomic Signature to Diagnose Active Tuberculosis and Guide Host-Directed Therapy
by Asif Hassan Syed, Nashwan Alromema, Hatem A. Almazarqi, Jasrah Irfan, Shakeel Ahmad, Altyeb A. Taha and Alhuseen Omar Alsayed
Diagnostics 2026, 16(5), 693; https://doi.org/10.3390/diagnostics16050693 - 26 Feb 2026
Viewed by 290
Abstract
Background/Objectives: An important diagnostic problem is to differentiate between active tuberculosis (TB) and latent TB infection (LTBI). Furthermore, the current biomarkers also offer minimal insight into disease pathogenesis to direct treatment. This triggered us to design a two-mode biomarker signature based on the [...] Read more.
Background/Objectives: An important diagnostic problem is to differentiate between active tuberculosis (TB) and latent TB infection (LTBI). Furthermore, the current biomarkers also offer minimal insight into disease pathogenesis to direct treatment. This triggered us to design a two-mode biomarker signature based on the multicohort analysis using a transcriptomic and stringent machine learning pipeline. Methods: When analyzing active TB, latent TB, and healthy control samples, a rigorous filter (ANOVA, p < 0.001) was used, followed by the selection of features with the help of Boruta-XGBoost and LASSO regression. This determined a small four-gene signature (TAP2, SORT1, WARS, and ANKRD22), which was selectively and highly upregulated in the active TB clinical state (p < 0.001). An ensemble staking classifier based on this signature (Random Forest and XGBoost) had a very high diagnostic performance (ROC-AUC = 0.991 (95% CI: 0.983–0.997)) in the stratification of infection phases, which was strongly confirmed in another cohort (GSE19444). Results: Importantly, the analysis of the functional pathways showed that all the genes are mapped to core dysregulated host pathways in active TB: antigen presentation (TAP2), lipid trafficking (SORT1), interferon response (WARS), and inflammasome signaling (ANKRD22). In such a way, the signature has a dual advantage: (1) high specificity, non-sputum transcriptional diagnostic of active TB, and (2) a mechanistic map of key host pathways, which describes targets of intervention. Conclusions: Thus, the signature provides a two-fold response: a biomarker panel aligned with WHO performance targets for TB triage and a mechanistic plan of therapy, which provides an easy way to implement transcriptomic discovery into clinical action against TB. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Figure 1

21 pages, 6212 KB  
Article
Coastal Soil Salinity Inversion Using UAV Multispectral Imagery and an Interpretable Stacking Algorithm
by Xianfeng Hu, Dongfeng Han, Quan Qin, Yanhong Que, Han Wang, Donghan Feng, Rui Chen, Jinkui Duan, Yanpeng Li and Feng Li
Remote Sens. 2026, 18(5), 671; https://doi.org/10.3390/rs18050671 - 24 Feb 2026
Viewed by 300
Abstract
Accurate and timely monitoring of soil salinity is essential for the sustainable management and remediation of coastal salinization. This study utilized a UAV-based remote sensing platform to collect multispectral imagery and concurrent in situ soil salinity samples from an experimental zone within the [...] Read more.
Accurate and timely monitoring of soil salinity is essential for the sustainable management and remediation of coastal salinization. This study utilized a UAV-based remote sensing platform to collect multispectral imagery and concurrent in situ soil salinity samples from an experimental zone within the Yellow River Delta National Nature Reserve in July 2024. We constructed multiple spectral indices and employed advanced feature selection methods—namely VIP, MultiSURF, and PSO-SFLA—to identify the most informative index combination. We established a soil salinity retrieval model utilizing a stacking ensemble framework. This architecture integrated TabPFN, SVM, and Ridge regression as the base learners, while employing XGBoost as the meta-learner to synthesize the final predictions. Model interpretability was assessed using SHAP (SHapley Additive explanations) values, while predictive performance was evaluated using the coefficient of determination (R2), Standardized Root Mean Square Error (SRMSE), and the Ratio of Performance to Deviation (RPD). Results indicate that the stacking model, when coupled with PSO-SFLA for feature selection, outperformed all other model configurations. It achieved the highest prediction accuracy on the test set, with an R2 of 0.754, SRMSE of 0.310, and RPD of 1.941. The resulting soil salinity distribution map exhibited a high degree of spatial agreement with the ground-truth survey data. This study demonstrates that leveraging a stacking algorithm with UAV multispectral data provides an accurate and reliable method for monitoring soil salinity in coastal wetlands, offering valuable technical support for effective soil salinization management. Full article
Show Figures

Figure 1

25 pages, 1245 KB  
Article
Machine Learning-Driven Intrusion Detection for Securing IoT-Based Wireless Sensor Networks
by Yirga Yayeh Munaye, Abebaw Demelash Gebeyehu, Li-Chia Tai, Zemenu Alem Abebe, Aeneas Bekele Workneh, Robel Berie Tarekegn, Yenework Belayneh Chekol and Getaneh Berie Tarekegn
Future Internet 2026, 18(2), 113; https://doi.org/10.3390/fi18020113 - 21 Feb 2026
Cited by 1 | Viewed by 321
Abstract
Wireless sensor networks (WSNs) have become a critical component of modern Internet of Things (IoT) infrastructures; however, their constrained resources and distributed deployment expose them to various cyber threats. In this work, we present a machine learning-driven intrusion detection framework optimized for WSN-based [...] Read more.
Wireless sensor networks (WSNs) have become a critical component of modern Internet of Things (IoT) infrastructures; however, their constrained resources and distributed deployment expose them to various cyber threats. In this work, we present a machine learning-driven intrusion detection framework optimized for WSN-based IoT environments. The proposed approach employs the WSN-DS benchmark dataset and integrates adaptive synthetic sampling (ADASYN) to address class imbalance, followed by a hybrid feature selection strategy combining Feature Importance Selection (FIS) and Recursive Feature Elimination (RFE) to reduce dimensionality and improve learning efficiency. An XGBoost classifier is then trained using five-fold cross-validation to ensure robust generalization. The experimental results demonstrate that the proposed framework significantly outperforms baseline methods, achieving an overall accuracy of 99.87%, with substantial gains in terms of F1-score, precision, and recall. Comparative analysis against recent WSN-DS studies confirms the effectiveness of combining imbalance correction, optimized feature selection, and ensemble learning. These findings highlight the potential of the proposed model as a lightweight and highly accurate intrusion detection solution for emerging WSN-IoT deployments. Full article
(This article belongs to the Special Issue Machine Learning and Internet of Things in Industry 4.0)
Show Figures

Graphical abstract

Back to TopTop