MDPI - Publisher of Open Access Journals

23 pages, 7266 KiB

Open AccessArticle

Intelligent ESG Evaluation for Construction Enterprises in China: An LLM-Based Model

by Binqing Cai, Zhukai Ye and Shiwei Chen

Buildings 2025, 15(15), 2710; https://doi.org/10.3390/buildings15152710 (registering DOI) - 31 Jul 2025

Environmental, social, and governance (ESG) evaluation has become increasingly critical for company sustainability assessments, especially for enterprises in the construction industry with a high environmental burden. However, existing methods face limitations in subjective evaluation, inconsistent ratings across agencies, and a lack of industry-specificity. [...] Read more.

Environmental, social, and governance (ESG) evaluation has become increasingly critical for company sustainability assessments, especially for enterprises in the construction industry with a high environmental burden. However, existing methods face limitations in subjective evaluation, inconsistent ratings across agencies, and a lack of industry-specificity. To address these limitations, this study proposes a large language model (LLM)-based intelligent ESG evaluation model specifically designed for the construction enterprises in China. The model integrates three modules: (1) an ESG report information extraction module utilizing natural language processing and Chinese pre-trained language models to identify and classify ESG-relevant statements; (2) an ESG rating prediction module employing XGBoost regression with SHAP analysis to predict company ratings and quantify individual statement contributions; and (3) an ESG intelligent evaluation module combining knowledge graph construction with fine-tuned Qwen2.5 language models using Chain-of-Thought (CoT). Empirical validation demonstrates that the model achieves 93.33% accuracy in the ESG rating classification and an R² score of 0.5312. SHAP analysis reveals that environmental factors contribute most significantly to rating predictions (38.7%), followed by governance (32.0%) and social dimensions (29.3%). The fine-tuned LLM integrated with knowledge graph shows improved evaluation consistency, achieving 65% accuracy compared to 53.33% for standalone LLM approaches, constituting a relative improvement of 21.88%. This study contributes to the ESG evaluation methodology by providing an objective, industry-specific, and interpretable framework that enhances rating consistency and provides actionable insights for enterprise sustainability improvement. This research provides guidance for automated and intelligent ESG evaluations for construction enterprises while addressing critical gaps in current ESG practices. Full article

(This article belongs to the Topic Improving Nature-Smart Policies through Innovative Resilient Evaluations)

► Show Figures

Figure 1

32 pages, 17155 KiB

Open AccessArticle

Machine Learning Ensemble Methods for Co-Seismic Landslide Susceptibility: Insights from the 2015 Nepal Earthquake

by Tulasi Ram Bhattarai and Netra Prakash Bhandary

Appl. Sci. 2025, 15(15), 8477; https://doi.org/10.3390/app15158477 (registering DOI) - 30 Jul 2025

Viewed by 153

Abstract

The Mw 7.8 Gorkha Earthquake of 25 April 2015 triggered over 25,000 landslides across central Nepal, with 4775 events concentrated in Gorkha District alone. Despite substantial advances in landslide susceptibility mapping, existing studies often overlook the compound role of post-seismic rainfall and lack [...] Read more.

The Mw 7.8 Gorkha Earthquake of 25 April 2015 triggered over 25,000 landslides across central Nepal, with 4775 events concentrated in Gorkha District alone. Despite substantial advances in landslide susceptibility mapping, existing studies often overlook the compound role of post-seismic rainfall and lack robust spatial validation. To address this gap, we validated an ensemble machine learning framework for co-seismic landslide susceptibility modeling by integrating seismic, geomorphological, hydrological, and anthropogenic variables, including cumulative post-seismic rainfall. Using a balanced dataset of 4775 landslide and non-landslide instances, we evaluated the performance of Logistic Regression (LR), Random Forest (RF), and eXtreme Gradient Boosting (XGBoost) models through spatial cross-validation, SHapley Additive exPlanations (SHAP) explainability, and ablation analysis. The RF model outperformed all others, achieving an accuracy of 87.9% and a Receiver Operating Characteristic (ROC) Area Under the Curve (AUC) value of 0.94, while XGBoost closely followed (AUC = 0.93). Ensemble models collectively classified over 95% of observed landslides into High and Very High susceptibility zones, demonstrating strong spatial reliability. SHAP analysis identified elevation, proximity to fault, peak ground acceleration (PGA), slope, and rainfall as dominant predictors. Notably, the inclusion of post-seismic rainfall substantially improved recall and F1 scores in ablation experiments. Spatial cross-validation revealed the superior generalizability of ensemble models under heterogeneous terrain conditions. The findings underscore the value of integrating post-seismic hydrometeorological factors and spatial validation into susceptibility assessments. We recommend adopting ensemble models, particularly RF, for operational hazard mapping in earthquake-prone mountainous regions. Future research should explore the integration of dynamic rainfall thresholds and physics-informed frameworks to enhance early warning systems and climate resilience. Full article

(This article belongs to the Section Earth Sciences)

► Show Figures

Figure 1

22 pages, 579 KiB

Open AccessArticle

Automated Classification of Crime Narratives Using Machine Learning and Language Models in Official Statistics

by Klaus Lehmann, Elio Villaseñor, Alejandro Pimentel, Javiera Preuss, Nicolás Berhó, Oswaldo Diaz and Ignacio Agloni

Stats 2025, 8(3), 68; https://doi.org/10.3390/stats8030068 - 30 Jul 2025

Viewed by 300

Abstract

This paper presents the implementation of a language model–based strategy for the automatic codification of crime narratives for the production of official statistics. To address the high workload and inconsistencies associated with manual coding, we developed and evaluated three models: an XGBoost classifier [...] Read more.

This paper presents the implementation of a language model–based strategy for the automatic codification of crime narratives for the production of official statistics. To address the high workload and inconsistencies associated with manual coding, we developed and evaluated three models: an XGBoost classifier with bag-of-words features and word embeddings features, an LSTM network using pretrained Spanish word embeddings as a language model, and a fine-tuned BERT language model (BETO). Deep learning models outperformed the traditional baseline, with BETO achieving the highest accuracy. The new ENUSC (Encuesta Nacional Urbana de Seguridad Ciudadana) workflow integrates the selected model into an API for automated classification, incorporating a certainty threshold to distinguish between cases suitable for automation and those requiring expert review. This hybrid strategy led to a 68.4% reduction in manual review workload while preserving high-quality standards. This study represents the first documented application of deep learning for the automated classification of victimization narratives in official statistics, demonstrating its feasibility and impact in a real-world production environment. Our results demonstrate that deep learning can significantly improve the efficiency and consistency of crime statistics coding, offering a scalable solution for other national statistical offices. Full article

(This article belongs to the Section Applied Statistics and Machine Learning Methods)

► Show Figures

Figure 1

20 pages, 732 KiB

Open AccessReview

AI Methods Tailored to Influenza, RSV, HIV, and SARS-CoV-2: A Focused Review

by Achilleas Livieratos, George C. Kagadis, Charalambos Gogos and Karolina Akinosoglou

Pathogens 2025, 14(8), 748; https://doi.org/10.3390/pathogens14080748 - 30 Jul 2025

Viewed by 209

Abstract

Artificial intelligence (AI) techniques—ranging from hybrid mechanistic–machine learning (ML) ensembles to gradient-boosted decision trees, support-vector machines, and deep neural networks—are transforming the management of seasonal influenza, respiratory syncytial virus (RSV), human immunodeficiency virus (HIV), and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Symptom-based [...] Read more.

Artificial intelligence (AI) techniques—ranging from hybrid mechanistic–machine learning (ML) ensembles to gradient-boosted decision trees, support-vector machines, and deep neural networks—are transforming the management of seasonal influenza, respiratory syncytial virus (RSV), human immunodeficiency virus (HIV), and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Symptom-based triage models using eXtreme Gradient Boosting (XGBoost) and Random Forests, as well as imaging classifiers built on convolutional neural networks (CNNs), have improved diagnostic accuracy across respiratory infections. Transformer-based architectures and social media surveillance pipelines have enabled real-time monitoring of COVID-19. In HIV research, support-vector machines (SVMs), logistic regression, and deep neural network (DNN) frameworks advance viral-protein classification and drug-resistance mapping, accelerating antiviral and vaccine discovery. Despite these successes, persistent challenges remain—data heterogeneity, limited model interpretability, hallucinations in large language models (LLMs), and infrastructure gaps in low-resource settings. We recommend standardized open-access data pipelines and integration of explainable-AI methodologies to ensure safe, equitable deployment of AI-driven interventions in future viral-outbreak responses. Full article

(This article belongs to the Section Viral Pathogens)

► Show Figures

Figure 1

33 pages, 7261 KiB

Open AccessFeature PaperArticle

Comparative Analysis of Explainable AI Methods for Manufacturing Defect Prediction: A Mathematical Perspective

by Gabriel Marín Díaz

Mathematics 2025, 13(15), 2436; https://doi.org/10.3390/math13152436 - 29 Jul 2025

Viewed by 304

Abstract

The increasing complexity of manufacturing processes demands accurate defect prediction and interpretable insights into the causes of quality issues. This study proposes a methodology integrating machine learning, clustering, and Explainable Artificial Intelligence (XAI) to support defect analysis and quality control in industrial environments. [...] Read more.

The increasing complexity of manufacturing processes demands accurate defect prediction and interpretable insights into the causes of quality issues. This study proposes a methodology integrating machine learning, clustering, and Explainable Artificial Intelligence (XAI) to support defect analysis and quality control in industrial environments. Using a dataset based on empirical industrial distributions, we train an XGBoost model to classify high- and low-defect scenarios from multidimensional production and quality metrics. The model demonstrates high predictive performance and is analyzed using five XAI techniques (SHAP, LIME, ELI5, PDP, and ICE) to identify the most influential variables linked to defective outcomes. In parallel, we apply Fuzzy C-Means and K-means to segment production data into latent operational profiles, which are also interpreted using XAI to uncover process-level patterns. This approach provides both global and local interpretability, revealing consistent variables across predictive and structural perspectives. After a thorough review, no prior studies have combined supervised learning, unsupervised clustering, and XAI within a unified framework for manufacturing defect analysis. The results demonstrate that this integration enables a transparent, data-driven understanding of production dynamics. The proposed hybrid approach supports the development of intelligent, explainable Industry 4.0 systems. Full article

(This article belongs to the Special Issue Artificial Intelligence and Data Science, 2nd Edition)

► Show Figures

Figure 1

27 pages, 8755 KiB

Open AccessArticle

Mapping Wetlands with High-Resolution Planet SuperDove Satellite Imagery: An Assessment of Machine Learning Models Across the Diverse Waterscapes of New Zealand

by Md. Saiful Islam Khan, Maria C. Vega-Corredor and Matthew D. Wilson

Remote Sens. 2025, 17(15), 2626; https://doi.org/10.3390/rs17152626 - 29 Jul 2025

Viewed by 288

Abstract

(1) Background: Wetlands are ecologically significant ecosystems that support biodiversity and contribute to essential environmental functions such as water purification, carbon storage and flood regulation. However, these ecosystems face increasing pressures from land-use change and degradation, prompting the need for scalable and accurate [...] Read more.

(1) Background: Wetlands are ecologically significant ecosystems that support biodiversity and contribute to essential environmental functions such as water purification, carbon storage and flood regulation. However, these ecosystems face increasing pressures from land-use change and degradation, prompting the need for scalable and accurate classification methods to support conservation and policy efforts. In this research, our motivation was to test whether high-spatial-resolution PlanetScope imagery can be used with pixel-based machine learning to support the mapping and monitoring of wetlands at a national scale. (2) Methods: This study compared four machine learning classification models—Random Forest (RF), XGBoost (XGB), Histogram-Based Gradient Boosting (HGB) and a Multi-Layer Perceptron Classifier (MLPC)—to detect and map wetland areas across New Zealand. All models were trained using eight-band SuperDove satellite imagery from PlanetScope, with a spatial resolution of ~3 m, and ancillary geospatial datasets representing topography and soil drainage characteristics, each of which is available globally. (3) Results: All four machine learning models performed well in detecting wetlands from SuperDove imagery and environmental covariates, with varying strengths. The highest accuracy was achieved using all eight image bands alongside features created from supporting geospatial data. For binary wetland classification, the highest F1 scores were recorded by XGB (0.73) and RF/HGB (both 0.72) when including all covariates. MLPC also showed competitive performance (wetland F1 score of 0.71), despite its relatively lower spatial consistency. However, each model over-predicts total wetland area at a national level, an issue which was able to be reduced by increasing the classification probability threshold and spatial filtering. (4) Conclusions: The comparative analysis highlights the strengths and trade-offs of RF, XGB, HGB and MLPC models for wetland classification. While all four methods are viable, RF offers some key advantages, including ease of deployment and transferability, positioning it as a promising candidate for scalable, high-resolution wetland monitoring across diverse ecological settings. Further work is required for verification of small-scale wetlands (<~0.5 ha) and the addition of fine-spatial-scale covariates. Full article

(This article belongs to the Special Issue Machine Learning and Automation in Remote Sensing Applied in Hydrological Processes)

► Show Figures

Figure 1

18 pages, 1498 KiB

Open AccessArticle

A Proactive Predictive Model for Machine Failure Forecasting

by Olusola O. Ajayi, Anish M. Kurien, Karim Djouani and Lamine Dieng

Machines 2025, 13(8), 663; https://doi.org/10.3390/machines13080663 - 29 Jul 2025

Viewed by 276

Abstract

Unexpected machine failures in industrial environments lead to high maintenance costs, unplanned downtime, and safety risks. This study proposes a proactive predictive model using a hybrid of eXtreme Gradient Boosting (XGBoost) and Neural Networks (NN) to forecast machine failures. A synthetic dataset capturing [...] Read more.

Unexpected machine failures in industrial environments lead to high maintenance costs, unplanned downtime, and safety risks. This study proposes a proactive predictive model using a hybrid of eXtreme Gradient Boosting (XGBoost) and Neural Networks (NN) to forecast machine failures. A synthetic dataset capturing recent breakdown history and time since last failure was used to simulate industrial scenarios. To address class imbalance, SMOTE and class weighting were applied, alongside a focal loss function to emphasize difficult-to-classify failures. The XGBoost model was tuned via GridSearchCV, while the NN model utilized ReLU-activated hidden layers with dropout. Evaluation using stratified 5-fold cross-validation showed that the NN achieved an F1-score of 0.7199 and a recall of 0.9545 for the minority class. XGBoost attained a higher PR AUC of 0.7126 and a more balanced precision–recall trade-off. Sample predictions demonstrated strong recall (100%) for failures, but also a high false positive rate, with most prediction probabilities clustered between 0.50–0.55. Additional benchmarking against Logistic Regression, Random Forest, and SVM further confirmed the superiority of the proposed hybrid model. Model interpretability was enhanced using SHAP and LIME, confirming that recent breakdowns and time since last failure were key predictors. While the model effectively detects failures, further improvements in feature engineering and threshold tuning are recommended to reduce false alarms and boost decision confidence. Full article

(This article belongs to the Section Machines Testing and Maintenance)

► Show Figures

Figure 1

27 pages, 2617 KiB

Open AccessArticle

Monte Carlo Gradient Boosted Trees for Cancer Staging: A Machine Learning Approach

by Audrey Eley, Thu Thu Hlaing, Daniel Breininger, Zarindokht Helforoush and Nezamoddin N. Kachouie

Cancers 2025, 17(15), 2452; https://doi.org/10.3390/cancers17152452 - 24 Jul 2025

Viewed by 290

Abstract

Machine learning algorithms are commonly employed for classification and interpretation of high-dimensional data. The classification task is often broken down into two separate procedures, and different methods are applied to achieve accurate results and produce interpretable outcomes. First, an effective subset of high-dimensional [...] Read more.

Machine learning algorithms are commonly employed for classification and interpretation of high-dimensional data. The classification task is often broken down into two separate procedures, and different methods are applied to achieve accurate results and produce interpretable outcomes. First, an effective subset of high-dimensional features must be extracted and then the selected subset will be used to train a classifier. Gradient Boosted Trees (GBT) is an ensemble model and, particularly due to their robustness, ability to model complex nonlinear interactions, and feature interpretability, they are well suited for complex applications. XGBoost (eXtreme Gradient Boosting) is a high-performance implementation of GBT that incorporates regularization, parallel computation, and efficient tree pruning that makes it a suitable efficient, interpretable, and scalable classifier with potential applications to medical data analysis. In this study, a Monte Carlo Gradient Boosted Trees (MCGBT) model is proposed for both feature reduction and classification. The proposed MCGBT method was applied to a lung cancer dataset for feature identification and classification. The dataset contains 107 radiomics which are quantitative imaging biomarkers extracted from CT scans. A reduced set of 12 radiomics were identified, and patients were classified into different cancer stages. Cancer staging accuracy of 90.3% across 100 independent runs was achieved which was on par with that obtained using the full set of 107 radiomics, enabling lean and deployable classifiers. Full article

(This article belongs to the Section Cancer Informatics and Big Data)

► Show Figures

Figure 1

18 pages, 1154 KiB

Open AccessArticle

Predicting Major Adverse Cardiovascular Events After Cardiac Surgery Using Combined Clinical, Laboratory, and Echocardiographic Parameters: A Machine Learning Approach

by Mladjan Golubovic, Velimir Peric, Marija Stosic, Vladimir Stojiljkovic, Sasa Zivic, Aleksandar Kamenov, Dragan Milic, Vesna Dinic, Dalibor Stojanovic and Milan Lazarevic

Medicina 2025, 61(8), 1323; https://doi.org/10.3390/medicina61081323 - 23 Jul 2025

Viewed by 262

Abstract

Background and Objectives: Despite significant advances in surgical techniques and perioperative care, major adverse cardiovascular events (MACE) remain a leading cause of postoperative morbidity and mortality in patients undergoing coronary artery bypass grafting and/or aortic valve replacement. Accurate preoperative risk stratification is essential [...] Read more.

Background and Objectives: Despite significant advances in surgical techniques and perioperative care, major adverse cardiovascular events (MACE) remain a leading cause of postoperative morbidity and mortality in patients undergoing coronary artery bypass grafting and/or aortic valve replacement. Accurate preoperative risk stratification is essential yet often limited by models that overlook atrial mechanics and underutilized biomarkers. Materials and Methods: This study aimed to develop an interpretable machine learning model for predicting perioperative MACE by integrating clinical, biochemical, and echocardiographic features, with a particular focus on novel physiological markers. A retrospective cohort of 131 patients was analyzed. An Extreme Gradient Boosting (XGBoost) classifier was trained on a comprehensive feature set, and SHapley Additive exPlanations (SHAPs) were used to quantify each variable’s contribution to model predictions. Results: In a stratified 80:20 train–test split, the model initially achieved an AUC of 1.00. Acknowledging the potential for overfitting in small datasets, additional validation was performed using 10 independent random splits and 5-fold cross-validation. These analyses yielded an average AUC of 0.846 ± 0.092 and an F1-score of 0.807 ± 0.096, supporting the model’s stability and generalizability. The most influential predictors included total atrial conduction time, mitral and tricuspid annular orifice areas, and high-density lipoprotein (HDL) cholesterol. These variables, spanning electrophysiological, structural, and metabolic domains, significantly enhanced discriminative performance, even in patients with preserved left ventricular function. The model’s transparency provides clinically intuitive insights into individual risk profiles, emphasizing the significance of non-traditional parameters in perioperative assessments. Conclusions: This study demonstrates the feasibility and potential clinical value of combining advanced echocardiographic, biochemical, and machine learning tools for individualized cardiovascular risk prediction. While promising, these findings require prospective validation in larger, multicenter cohorts before being integrated into routine clinical decision-making. Full article

(This article belongs to the Section Intensive Care/ Anesthesiology)

► Show Figures

Figure 1

20 pages, 22580 KiB

Open AccessArticle

Life-Threatening Ventricular Arrhythmia Identification Based on Multiple Complex Networks

by Zhipeng Cai, Menglin Yu, Jiawen Yu, Xintao Han, Jianqing Li and Yangyang Qu

Electronics 2025, 14(15), 2921; https://doi.org/10.3390/electronics14152921 - 22 Jul 2025

Viewed by 161

Abstract

Ventricular arrhythmias (VAs) are critical cardiovascular diseases that require rapid and accurate detection. Conventional approaches relying on multi-lead ECG or deep learning models have limitations in computational cost, interpretability, and real-time applicability on wearable devices. To address these issues, a lightweight and interpretable [...] Read more.

Ventricular arrhythmias (VAs) are critical cardiovascular diseases that require rapid and accurate detection. Conventional approaches relying on multi-lead ECG or deep learning models have limitations in computational cost, interpretability, and real-time applicability on wearable devices. To address these issues, a lightweight and interpretable framework based on multiple complex networks was proposed for the detection of life-threatening VAs using short-term single-lead ECG signals. The input signals were decomposed using the fixed-frequency-range empirical wavelet transform, and sub-bands were subsequently analyzed through multiscale visibility graphs, recurrence networks, cross-recurrence networks, and joint recurrence networks. Eight topological features were extracted and input into an XGBoost classifier for VA identification. Ten-fold cross-validation results on the MIT-BIH VFDB and CUDB databases demonstrated that the proposed method achieved a sensitivity of 99.02 ± 0.53%, a specificity of 98.44 ± 0.43%, and an accuracy of 98.73 ± 0.02% for 10 s ECG segments. The model also maintained robust performance on shorter segments, with 97.23 ± 0.76% sensitivity, 98.85 ± 0.95% specificity, and 96.62 ± 0.02% accuracy on 2 s segments. The results outperformed existing feature-based and deep learning approaches while preserving model interpretability. Furthermore, the proposed method supports mobile deployment, facilitating real-time use in wearable healthcare applications. Full article

(This article belongs to the Special Issue Smart Bioelectronics, Wearable Systems and E-Health)

► Show Figures

Figure 1

28 pages, 2139 KiB

Open AccessArticle

An Improved Approach to DNS Covert Channel Detection Based on DBM-ENSec

by Xinyu Li, Xiaoying Wang, Guoqing Yang, Jinsha Zhang, Chunhui Li, Fangfang Cui and Ruize Gu

Future Internet 2025, 17(7), 319; https://doi.org/10.3390/fi17070319 - 21 Jul 2025

Viewed by 174

Abstract

The covert nature of DNS covert channels makes them a widely utilized method for data exfiltration by malicious attackers. In response to this challenge, the present study proposes a detection methodology for DNS covert channels that employs a Deep Boltzmann Machine with Enhanced [...] Read more.

The covert nature of DNS covert channels makes them a widely utilized method for data exfiltration by malicious attackers. In response to this challenge, the present study proposes a detection methodology for DNS covert channels that employs a Deep Boltzmann Machine with Enhanced Security (DBM-ENSec). This approach entails the creation of a dataset through the collection of malicious traffic associated with various DNS covert channel attacks. Time-dependent grouping features are excluded, and feature optimization is conducted on individual traffic data through feature selection and normalization to minimize redundancy, enhancing the differentiation and stability of the features. The result of this process is the extraction of 23-dimensional features for each DNS packet. The extracted features are converted to gray scale images to improve the interpretability of the model and then fed into an improved Deep Boltzmann Machine for further optimization. The optimized features are then processed by an ensemble of classifiers (including Random Forest, XGBoost, LightGBM, and CatBoost) for detection purposes. Experimental results show that the proposed method achieves 99.92% accuracy in detecting DNS covert channels, with a validation accuracy of up to 98.52% on publicly available datasets. Full article

(This article belongs to the Section Cybersecurity)

► Show Figures

Figure 1

32 pages, 2182 KiB

Open AccessArticle

Detection of Biased Phrases in the Wiki Neutrality Corpus for Fairer Digital Content Management Using Artificial Intelligence

by Abdullah, Muhammad Ateeb Ather, Olga Kolesnikova and Grigori Sidorov

Big Data Cogn. Comput. 2025, 9(7), 190; https://doi.org/10.3390/bdcc9070190 - 21 Jul 2025

Viewed by 389

Abstract

Detecting biased language in large-scale corpora, such as the Wiki Neutrality Corpus, is essential for promoting neutrality in digital content. This study systematically evaluates a range of machine learning (ML) and deep learning (DL) models for the detection of biased and pre-conditioned phrases. [...] Read more.

Detecting biased language in large-scale corpora, such as the Wiki Neutrality Corpus, is essential for promoting neutrality in digital content. This study systematically evaluates a range of machine learning (ML) and deep learning (DL) models for the detection of biased and pre-conditioned phrases. Conventional classifiers, including Extreme Gradient Boosting (XGBoost), Light Gradient-Boosting Machine (LightGBM), and Categorical Boosting (CatBoost), are compared with advanced neural architectures such as Bidirectional Encoder Representations from Transformers (BERT), Long Short-Term Memory (LSTM) networks, and Generative Adversarial Networks (GANs). A novel hybrid architecture is proposed, integrating DistilBERT, LSTM, and GANs within a unified framework. Extensive experimentation with intermediate variants DistilBERT + LSTM (without GAN) and DistilBERT + GAN (without LSTM) demonstrates that the fully integrated model consistently outperforms all alternatives. The proposed hybrid model achieves a cross-validation accuracy of 99.00%, significantly surpassing traditional baselines such as XGBoost (96.73%) and LightGBM (96.83%). It also exhibits superior stability, statistical significance (paired t-tests), and favorable trade-offs between performance and computational efficiency. The results underscore the potential of hybrid deep learning models for capturing subtle linguistic bias and advancing more objective and reliable automated content moderation systems. Full article

► Show Figures

Figure 1

21 pages, 6005 KiB

Open AccessArticle

Archetype Identification and Energy Consumption Prediction for Old Residential Buildings Based on Multi-Source Datasets

by Chengliang Fan, Rude Liu and Yundan Liao

Buildings 2025, 15(14), 2573; https://doi.org/10.3390/buildings15142573 - 21 Jul 2025

Viewed by 310

Abstract

Assessing energy consumption in existing old residential buildings is key for urban energy conservation and decarbonization. Previous studies on old residential building energy assessment face challenges due to data limitations and inadequate prediction methods. This study develops a novel approach integrating building energy [...] Read more.

Assessing energy consumption in existing old residential buildings is key for urban energy conservation and decarbonization. Previous studies on old residential building energy assessment face challenges due to data limitations and inadequate prediction methods. This study develops a novel approach integrating building energy simulation and machine learning to predict large-scale old residential building energy use using multi-source datasets. Using Guangzhou as a case study, open-source building data was collected to identify 31,209 old residential buildings based on age thresholds and areas of interest (AOIs). Key building form parameters (i.e., long side, short side, number of floors) were then classified to identify residential archetypes. Building energy consumption data for each prototype was generated using EnergyPlus (V23.2.0) simulations. Furthermore, XGBoost and Random Forest machine learning algorithms were used to predict city-scale old residential building energy consumption. Results indicated that five representative prototypes exhibited cooling energy use ranging from 17.32 to 21.05 kWh/m², while annual electricity consumption ranged from 60.10 to 66.53 kWh/m². The XGBoost model demonstrated strong predictive performance (R² = 0.667). SHAP (Shapley Additive Explanations) analysis identified the Building Shape Coefficient (BSC) as the most significant positive predictor of energy consumption (SHAP value = 0.79). This framework enables city-level energy assessment for old residential buildings, providing critical support for retrofitting strategies in sustainable urban renewal planning. Full article

(This article belongs to the Special Issue Enhancing Building Resilience Under Climate Change)

► Show Figures

Figure 1

20 pages, 5236 KiB

Open AccessArticle

Leakage Detection in Subway Tunnels Using 3D Point Cloud Data: Integrating Intensity and Geometric Features with XGBoost Classifier

by Anyin Zhang, Junjun Huang, Zexin Sun, Juju Duan, Yuanai Zhang and Yueqian Shen

Sensors 2025, 25(14), 4475; https://doi.org/10.3390/s25144475 - 18 Jul 2025

Viewed by 335

Abstract

Detecting leakage using a point cloud acquired by mobile laser scanning (MLS) presents significant challenges, particularly from within three-dimensional space. These challenges primarily arise from the prevalence of noise in tunnel point clouds and the difficulty in accurately capturing the three-dimensional morphological characteristics [...] Read more.

Detecting leakage using a point cloud acquired by mobile laser scanning (MLS) presents significant challenges, particularly from within three-dimensional space. These challenges primarily arise from the prevalence of noise in tunnel point clouds and the difficulty in accurately capturing the three-dimensional morphological characteristics of leakage patterns. To address these limitations, this study proposes a classification method based on XGBoost classifier, integrating both intensity and geometric features. The proposed methodology comprises the following steps: First, a RANSAC algorithm is employed to filter out noise from tunnel objects, such as facilities, tracks, and bolt holes, which exhibit intensity values similar to leakage. Next, intensity features are extracted to facilitate the initial separation of leakage regions from the tunnel lining. Subsequently, geometric features derived from the k neighborhood are incorporated to complement the intensity features, enabling more effective segmentation of leakage from the lining structures. The optimal neighborhood scale is determined by selecting the scale that yields the highest F_1-score for leakage across various multiple evaluated scales. Finally, the XGBoost classifier is applied to the binary classification to distinguish leakage from tunnel lining. Experimental results demonstrate that the integration of geometric features significantly enhances leakage detection accuracy, achieving an F_1-score of 91.18% and 97.84% on two evaluated datasets, respectively. The consistent performance across four heterogeneous datasets indicates the robust generalization capability of the proposed methodology. Comparative analysis further shows that XGBoost outperforms other classifiers, such as Random Forest, AdaBoost, LightGBM, and CatBoost, in terms of balance of accuracy and computational efficiency. Moreover, compared to deep learning models, including PointNet, PointNet++, and DGCNN, the proposed method demonstrates superior performance in both detection accuracy and computational efficiency. Full article

(This article belongs to the Special Issue Application of LiDAR Remote Sensing and Mapping)

► Show Figures

Figure 1

15 pages, 3326 KiB

Open AccessArticle

Radiomics and Machine Learning Approaches for the Preoperative Classification of In Situ vs. Invasive Breast Cancer Using Dynamic Contrast-Enhanced Magnetic Resonance Imaging (DCE–MRI)

by Luana Conte, Rocco Rizzo, Alessandra Sallustio, Eleonora Maggiulli, Mariangela Capodieci, Francesco Tramacere, Alessandra Castelluccia, Giuseppe Raso, Ugo De Giorgi, Raffaella Massafra, Maurizio Portaluri, Donato Cascio and Giorgio De Nunzio

Appl. Sci. 2025, 15(14), 7999; https://doi.org/10.3390/app15147999 - 18 Jul 2025

Viewed by 291

Abstract

Accurate preoperative distinction between in situ and invasive Breast Cancer (BC) is critical for clinical decision-making and treatment planning. Radiomics and Machine Learning (ML) have shown promise in enhancing diagnostic performance from breast MRI, yet their application to this specific task remains underexplored. [...] Read more.

Accurate preoperative distinction between in situ and invasive Breast Cancer (BC) is critical for clinical decision-making and treatment planning. Radiomics and Machine Learning (ML) have shown promise in enhancing diagnostic performance from breast MRI, yet their application to this specific task remains underexplored. The aim of this study was to evaluate the performance of several ML classifiers, trained on radiomic features extracted from DCE–MRI and supported by basic clinical information, for the classification of in situ versus invasive BC lesions. In this study, we retrospectively analysed 71 post-contrast DCE–MRI scans (24 in situ, 47 invasive cases). Radiomic features were extracted from manually segmented tumour regions using the PyRadiomics library, and a limited set of basic clinical variables was also included. Several ML classifiers were evaluated in a Leave-One-Out Cross-Validation (LOOCV) scheme. Feature selection was performed using two different strategies: Minimum Redundancy Maximum Relevance (MRMR), mutual information. Axial 3D rotation was used for data augmentation. Support Vector Machine (SVM), K Nearest Neighbors (KNN), Random Forest (RF), and Extreme Gradient Boosting (XGBoost) were the best-performing models, with an Area Under the Curve (AUC) ranging from 0.77 to 0.81. Notably, KNN achieved the best balance between sensitivity and specificity without the need for data augmentation. Our findings confirm that radiomic features extracted from DCE–MRI, combined with well-validated ML models, can effectively support the differentiation of in situ vs. invasive breast cancer. This approach is quite robust even in small datasets and may aid in improving preoperative planning. Further validation on larger cohorts and integration with additional imaging or clinical data are recommended. Full article

(This article belongs to the Special Issue Artificial Intelligence Applications in Healthcare and Precision Medicine, 2nd Edition)

► Show Figures

Figure 1

Search Results (666)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (666)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI