MDPI - Publisher of Open Access Journals

36 pages, 3614 KB

Open AccessArticle

Sentiment Classification of Amazon Product Reviews Based on Machine and Deep Learning Techniques: A Comparative Study

by Eman Daraghmi and Noora Zyadeh

Future Internet 2026, 18(3), 138; https://doi.org/10.3390/fi18030138 (registering DOI) - 7 Mar 2026

Sentiment classification plays a crucial role in analyzing customer feedback to identify market trends, enhance product recommendations, and improve customer satisfaction. This study focuses on sentiment analysis of Amazon reviews using two major datasets—Fine Food Reviews and Unlocked Mobile Reviews—which exhibit label imbalance. [...] Read more.

Sentiment classification plays a crucial role in analyzing customer feedback to identify market trends, enhance product recommendations, and improve customer satisfaction. This study focuses on sentiment analysis of Amazon reviews using two major datasets—Fine Food Reviews and Unlocked Mobile Reviews—which exhibit label imbalance. To address this challenge, both oversampling and undersampling techniques were applied to balance the datasets. Various machine learning (ML) algorithms, including Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), Naïve Bayes (NB), and Gradient Boosting Machine (GBM), as well as deep learning (DL) models such as Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and transformer-based models like RoBERTa, were implemented. After data cleaning and preprocessing, models were trained, and performance was evaluated. The results indicate that oversampling significantly enhances classification accuracy, particularly for the Fine Food dataset. Among ML models, Random Forest achieved the highest accuracy due to its ensemble approach and robustness in handling high-dimensional data. DL models, particularly RoBERTa, also demonstrated superior performance owing to their capacity to capture contextual dependencies. The findings emphasize the importance of data balancing for optimal sentiment analysis and contribute valuable insights toward advancing automated opinion classification in e-commerce applications. Full article

(This article belongs to the Section Big Data and Augmented Intelligence)

30 pages, 2628 KB

Open AccessArticle

Predicting Bond Defaults in China: A Double-Ensemble Model Leveraging SMOTE for Class Imbalance

by Chongwen Tian and Rong Li

Big Data Cogn. Comput. 2026, 10(3), 81; https://doi.org/10.3390/bdcc10030081 - 6 Mar 2026

Abstract

This study proposes the Double-Ensemble Learning Classification with SMOTE (DELC-SMOTE), a novel hierarchical framework designed to address the critical challenge of severe class imbalance in financial bond default prediction. The model integrates the Synthetic Minority Over-sampling Technique (SMOTE) into a two-phase ensemble architecture. [...] Read more.

This study proposes the Double-Ensemble Learning Classification with SMOTE (DELC-SMOTE), a novel hierarchical framework designed to address the critical challenge of severe class imbalance in financial bond default prediction. The model integrates the Synthetic Minority Over-sampling Technique (SMOTE) into a two-phase ensemble architecture. The first phase employs introspective stacking, where six heterogeneous base learners are individually enhanced through algorithm-specific balancing and meta-learning. The second phase fuses these optimized experts via performance-weighted voting. Empirical analysis utilizes a comprehensive dataset of 10,440 Chinese corporate bonds (522 defaults, ~5% default rate) sourced from Wind and CSMAR databases. Given the high cost of both false negatives and false positives in risk assessment, the Geometric Mean (G-mean) and Specificity are employed as primary evaluation metrics. Results demonstrate that the proposed DELC-SMOTE model significantly outperforms individual base classifiers and benchmark ensemble variants, achieving a G-mean of 0.9152 and a Specificity of 0.8715 under the primary experimental setting. The model exhibits robust performance across varying imbalance ratios (2%, 10%, 20%) and strong resilience against data noise, perturbations, and outliers. These findings indicate that the synergistic integration of data-level resampling within a diversified, two-tiered ensemble structure effectively mitigates class imbalance bias and enhances predictive reliability. The framework offers a robust and generalizable tool for actionable default risk assessment in imbalanced financial datasets. Full article

(This article belongs to the Section Data Mining and Machine Learning)

► Show Figures

Figure 1

18 pages, 308 KB

Open AccessArticle

Can We Predict Adductor Strain? A Predictive Analysis of a Major League Soccer (MLS) Cohort Spanning from 2019 to 2022

by Rebecca Davis, Benjamin C. Brewer, Martha Hall and Jill S. Higginson

J. Funct. Morphol. Kinesiol. 2026, 11(1), 108; https://doi.org/10.3390/jfmk11010108 - 5 Mar 2026

Viewed by 86

Abstract

Background: Despite the high prevalence of adductor injury in soccer, there is limited injury-specific predictive modeling to identify common risk factors. The objective of this study was to create an adductor strain prediction model utilizing injury, game, and performance data collected from a [...] Read more.

Background: Despite the high prevalence of adductor injury in soccer, there is limited injury-specific predictive modeling to identify common risk factors. The objective of this study was to create an adductor strain prediction model utilizing injury, game, and performance data collected from a cohort of professional Major League Soccer (MLS) players. Methods: We identified potential risk factors for soft tissue, non-contact adductor strain using a predictive machine learning model framework. Performance and injury data were collected between the 2019 to 2022 seasons of one professional MLS team. We utilized Random Forest (RF) machine learning models with Synthetic Minority Oversampling (SMOTE) to predict soft tissue, non-contact adductor strain injury amongst the cohort. Features chosen to be implemented in the model included injury, game, and performance data. Results: From the four models constructed in this study, the best performing model included Catapult Global Position System (GPS)/Internal Measurement Unit (IMU), strength, injury, and game data using a weekly structure determined by F1 score. Multiple models indicated that not having a previous injury lowers the odds of a future injury in the following week or month. Forwards had greater odds of injury whereas defenders had lower odds of injury. Greater hamstring max force lowered odds of injury whereas a greater amount of change of direction efforts increased the odds of injury in the following week or month. Adductor-to-abductor max strength ratio showed conflicting results regarding the odds of future injury. Conclusions: Through the utilization of RF and SMOTE, we were able to successfully predict adductor injuries in an MLS cohort utilizing injury, game, and performance metrics. Validation in a larger cohort would be highly recommended before utilizing the findings of this study in the design of injury prevention protocols. Full article

(This article belongs to the Special Issue Applications of Machine Learning in Sports Medicine, Physical Activity, Posture, and Rehabilitation: 2nd Edition)

21 pages, 938 KB

Open AccessArticle

Beyond Linear Statistics: A Machine Learning Ecosystem for Early Screening of School Bullying

by Carlos Alberto Espinosa-Pinos, Paúl Bladimir Acosta-Pérez, Aitor Larzabal-Fernández and Francisco Sebastián Vaca-Pinto

Information 2026, 17(3), 260; https://doi.org/10.3390/info17030260 - 5 Mar 2026

Viewed by 125

Abstract

This study developed and validated a Machine Learning (ML) ecosystem for the early screening of school victimization among Ecuadorian adolescents, a phenomenon that poses a critical barrier to educational equity. Addressing previous methodological limitations, this research intentionally eliminated circular reasoning by excluding all [...] Read more.

This study developed and validated a Machine Learning (ML) ecosystem for the early screening of school victimization among Ecuadorian adolescents, a phenomenon that poses a critical barrier to educational equity. Addressing previous methodological limitations, this research intentionally eliminated circular reasoning by excluding all internal psychometric items from the feature set, focusing strictly on sixteen socio-environmental and demographic predictors. A quantitative study was conducted with 1413 students in the province of Tungurahua, utilizing the Synthetic Minority Over-sampling Technique (SMOTE) to correct class imbalance. Supervised classification algorithms, including SVM, Random Forest, and XGBoost, were compared. The results demonstrated that the Random Forest model achieved the most balanced performance, reaching an Accuracy of 60.3% and a Macro F1-score of 0.382. Feature importance analysis identified household structure (Living_With_Monoparental) and Family_Coping_Capacity as the most significant predictors of high-risk profiles. These findings provided a statistically honest and ecologically valid tool for Student Counseling Departments (DECE), enabling a transition toward proactive risk identification grounded in observable social vulnerability rather than reactive symptom reporting. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

28 pages, 953 KB

Open AccessArticle

Proactive Proctoring: A Critical Analysis of Machine Learning Architectures and Custom Temporal Data Sets for Moodle Fraud Detection

by Andrei-Nicolae Vacariu, Marian Bucos, Marius Otesteanu and Bogdan Dragulescu

Appl. Sci. 2026, 16(5), 2381; https://doi.org/10.3390/app16052381 - 28 Feb 2026

Viewed by 163

Abstract

This paper examines the use of Machine Learning (ML) approaches in maintaining academic integrity using the information provided in the Moodle system logs. The paper focuses on data set construction, handling the issue of class imbalance, and the assessment of the performance of [...] Read more.

This paper examines the use of Machine Learning (ML) approaches in maintaining academic integrity using the information provided in the Moodle system logs. The paper focuses on data set construction, handling the issue of class imbalance, and the assessment of the performance of different ML models in uncovering academic fraud. Twelve different data sets were created by using the concept of temporal windows (e.g., one-day and three-day windows) during the feature extraction stage from the Moodle system logs. The manual labeling of the data sets was done based on a predefined set of rules that outline the fraudulent activities. The issue of class imbalance was treated using eleven different resampling approaches, such as SMOTE, ADASYN, Tomek Links, and NearMiss. We evaluated six classification algorithms, thus resulting in a total of 792 experiments based on the interactions between the data sets, resampling methods, and classification algorithms. The results from the experiment show that the Random Forest and AdaBoost models performed the best in the experiment. Furthermore, we observed a trade-off between fraud detection rates and model precision based on the temporal windows and resampling methods. The shortest temporal windows and hybrid undersampling approaches resulted in the maximum recall value in this study and could identify the greatest number of at-risk students. On the other hand, the longest temporal windows and hybrid oversampling approaches with data cleaning resulted in the best results in terms of F1-Score and Cohen’s Kappa statistics. The results provide conclusive evidence that the models can identify fraud; however, they should be used as predictive models for the improvement of proctoring approaches, such as random selection for verification or seating arrangement strategies, instead of judgment models. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

34 pages, 7649 KB

Open AccessArticle

SMOTE-Data-Augmented Machine Learning for Enhancing Individual Tree Biomass Estimation Using UAV LiDAR

by Sina Jarahizadeh and Bahram Salehi

Remote Sens. 2026, 18(5), 729; https://doi.org/10.3390/rs18050729 - 28 Feb 2026

Viewed by 200

Abstract

Estimating individual tree Above-Ground Biomass (AGB) is essential for assessing ecological functions and carbon storage in both forest and urban environments. Traditional field-based methods, such as plot measurements, are costly and impractical for large-scale applications. However, satellite- and aerial-based techniques lack the spatial [...] Read more.

Estimating individual tree Above-Ground Biomass (AGB) is essential for assessing ecological functions and carbon storage in both forest and urban environments. Traditional field-based methods, such as plot measurements, are costly and impractical for large-scale applications. However, satellite- and aerial-based techniques lack the spatial resolution for individual-tree-level analysis. Unmanned Aerial Vehicle (UAV) Light Detection and Ranging (LiDAR) data, combined with machine learning (ML), offers a powerful alternative for detailed tree structure measurement and AGB estimation. Leveraging advances in deep-learning-based individual tree detection and geometric structure estimation including Height (H), Surface Area (SA), Volume (V), and Crown Width (CW), this study develops ML regression models for estimating individual tree AGB. We explore three objectives: (1) evaluating four regression models including Random Forest (RF), Extreme Gradient Boosting (XGBoost), Support Vector Machine (SVM), and Feed-Forward Neural Network (FFNN); (2) sensitivity assessment of different geometric feature combinations on model accuracy; and (3) improving model robustness using Synthetic Minority Over-sampling Technique (SMOTE) data augmentation for addressing imbalanced data. Results show that the RF model outperforms others that achieved the lowest RMSE and most balanced residual distribution. CW was the strongest single predictor of AGB and, in combination with H, yielded to the most accurate results. This combination improved RMSE and R² by 14.2% and 89.3% with respect to single-variable-based models. The integration of SMOTE and RF further improved model performance since it lowered RMSE by 225.6 kg (~22.1%) and increased R² by 0.76 (~49.0%). This was particularly evident in underrepresented low and high AGB ranges. The proposed RF-SMOTE approach is a cost-effective and scalable approach for generating high-quality ground truth data to enable large-scale satellite-based biomass estimation and help forest carbon accounting and planning in cities and forests. Full article

(This article belongs to the Special Issue UAV Applications for Forest Management: Wood Volume, Biomass, and Mapping (Second Edition))

► Show Figures

Figure 1

11 pages, 590 KB

Open AccessArticle

Design and Performance Evaluation of Communication Systems Based on Non-Orthogonal Overlapped Chirp Modulation

by Guoping Liu, Jiaju Zhang, Qiusheng Gao, Wenjiang Pei, Junpeng Zhang and Sinuo Jiao

Symmetry 2026, 18(3), 412; https://doi.org/10.3390/sym18030412 - 27 Feb 2026

Viewed by 138

Abstract

With the evolution of smart grids, power communication networks are increasingly required to support high-bandwidth and diversified services such as high-definition video, real-time control, and positioning—services that impose dual challenges of communication capacity and spectrum constraints—under severe resource limitations. Conventional orthogonal modulation schemes [...] Read more.

With the evolution of smart grids, power communication networks are increasingly required to support high-bandwidth and diversified services such as high-definition video, real-time control, and positioning—services that impose dual challenges of communication capacity and spectrum constraints—under severe resource limitations. Conventional orthogonal modulation schemes exhibit significant limitations in spectral efficiency and concurrent access capabilities, particularly in supporting high-density user environments. To address this, we propose a communication system based on non-orthogonal overlapped chirp modulation, in which the intrinsic symmetry properties of chirp waveforms are utilized to enhance system design and performance. We first construct the system architecture with a multi-symbol concurrent transmission scheme and introduce continuous orthogonal phase modulation to improve symbol distinguishability and mitigate inter-symbol interference—an approach that effectively harnesses signal symmetry for interference suppression. At the receiver, a low-complexity demodulation algorithm based on correlation matrix computation is developed, further improved through oversampling techniques that exploit temporal and spectral symmetry in signal design. Monte Carlo simulations confirm that the proposed system outperforms traditional orthogonal chirp and orthogonal frequency division multiplexing systems in bit error rate performance and spectral efficiency across varying signal-to-noise ratios and modulation schemes. The proposed NOOC system achieves spectral efficiency scaling linearly with concurrency level K, reaching up to 16 bits/s/Hz for K = 16 with BPSK, compared to 1 bit/s/Hz in orthogonal systems. The study provides both a theoretical foundation and practical insights for developing symmetry-aware, efficient, and reliable air interface technologies suitable for future power-private networks. Full article

(This article belongs to the Section Engineering and Materials)

► Show Figures

Figure 1

17 pages, 4515 KB

Open AccessArticle

Lightweight, Compact, and High-Sensitivity Passive Fourier Transform Infrared Spectroscopy-Based Gas Detection System

by Xiangning Lu, Min Huang, Wenbin Ge, Lulu Qian, Zhanchao Wang, Yan Sun, Jinlin Chen and Wei Han

Sensors 2026, 26(5), 1493; https://doi.org/10.3390/s26051493 - 27 Feb 2026

Viewed by 125

Abstract

With the intensification of environmental pollution and the increasingly prominent problem of industrial harmful gas emissions, existing mainstream gas detection technologies still have obvious limitations in terms of real-time performance, non-contact capability, detection accuracy, and multi-component identification. To address this demand, this paper [...] Read more.

With the intensification of environmental pollution and the increasingly prominent problem of industrial harmful gas emissions, existing mainstream gas detection technologies still have obvious limitations in terms of real-time performance, non-contact capability, detection accuracy, and multi-component identification. To address this demand, this paper proposes a lightweight and compact gas detection system based on passive Fourier Transform Infrared Spectroscopy (FTIR). The system innovatively integrates an improved parallel pendulum mirror interferometer and a low-noise signal preprocessing module, and simultaneously presents a novel oversampling method fusing equal time, equal optical path difference, and digital filtering, which effectively enhances the operational stability and sampling accuracy of the spectrometer. The system features excellent platform adaptability and can be flexibly mounted on various operation carriers. Combined with a two-dimensional rotating platform and an inertial navigation module, its monitoring range and application scenarios can be further expanded. Indoor sensitivity test results show that the detection limit of the system for sulfur hexafluoride (SF₆) is less than 20 ppm; flight tests under real-world scenarios have successfully achieved accurate detection of SF₆ gas, fully verifying the practical application effectiveness of the system. Based on the comprehensive results of indoor and outdoor tests, the system demonstrates core technical advantages of high sensitivity, strong flexibility, and excellent real-time performance. It is expected to be widely applied in gas monitoring tasks across multiple fields such as industrial safety monitoring, ecological environment monitoring, and transportation support in the future. Full article

(This article belongs to the Section Physical Sensors)

► Show Figures

Figure 1

17 pages, 876 KB

Open AccessArticle

Transformer-Enhanced Localization via Adaptive PDP Representation Under Dynamic Bandwidths

by Lei Cao, Tianqi Xiang, Weiyan Chen, Yicheng Wang, Yuehong Gao and Xin Zhang

Sensors 2026, 26(5), 1486; https://doi.org/10.3390/s26051486 - 27 Feb 2026

Viewed by 161

Abstract

Accurate wireless positioning has remained challenging under dynamic bandwidth conditions and outdoor multipath environments that are typical in Internet of Things (IoT) and autonomous aerial vehicle (AAV) applications. Conventional learning-based localization methods rely on bandwidth-specific channel state information (CSI) representations, which causes the [...] Read more.

Accurate wireless positioning has remained challenging under dynamic bandwidth conditions and outdoor multipath environments that are typical in Internet of Things (IoT) and autonomous aerial vehicle (AAV) applications. Conventional learning-based localization methods rely on bandwidth-specific channel state information (CSI) representations, which causes the trained models to be inapplicable or less adaptive when the signal bandwidth differs from that used during training. To overcome this limitation, a unified and neural network-oriented framework is proposed, which constructs bandwidth-adaptive power delay profile (PDP) representations for learning-based models. A PDP preprocessing scheme through adaptive zero-padding and oversampled IFFT of heterogeneous CSI is introduced to generate dimension-consistent and delay-aligned neural network inputs. To enhance robustness, a sub-band-sliced PDP representation is developed to enhance model robustness, where each bandwidth is divided into equal-width sub-bands whose PDPs are independently processed and organized as Transformer tokens. A dedicated Transformer is designed to get the location estimation from PDPs of multi-access points. Simulation results have demonstrated that the proposed preprocessing-PDP-plus-Transformer framework achieves superior cross-bandwidth generalization and localization accuracy, compared to analytical and learning-based baselines. Full article

(This article belongs to the Special Issue Channel Characterization and Modeling for Future Wireless Communication Systems)

► Show Figures

Figure 1

28 pages, 2771 KB

Open AccessArticle

Improving Tree-Based Lung Disease Classification from Chest X-Ray Images Using Deep Feature Representations

by Abdulaziz A. Alsulami, Qasem Abu Al-Haija, Rayed Alakhtar, Huda Alsobhi, Rayan A. Alsemmeari, Badraddin Alturki and Ahmad J. Tayeb

Bioengineering 2026, 13(3), 267; https://doi.org/10.3390/bioengineering13030267 - 25 Feb 2026

Viewed by 280

Abstract

Healthcare systems worldwide face increasing pressure to deliver accurate, affordable, and scalable diagnostic services while maintaining long-term sustainability. Chest X-ray screening is considered one of the most cost-effective methods for detecting lung disease. However, many deep learning approaches are computationally intensive and difficult [...] Read more.

Healthcare systems worldwide face increasing pressure to deliver accurate, affordable, and scalable diagnostic services while maintaining long-term sustainability. Chest X-ray screening is considered one of the most cost-effective methods for detecting lung disease. However, many deep learning approaches are computationally intensive and difficult to interpret, which limits their adoption in high-throughput, resource-constrained clinical settings. This study proposes a hybrid CNN–tree framework for automated lung disease classification from chest X-ray images, which targets COVID-19, pneumonia, tuberculosis, lung cancer, and normal cases. To ensure robustness and generalization, four publicly available chest X-ray datasets from different sources are merged into a unified five-class dataset, which introduces realistic variations in imaging conditions and patient populations. A ResNet-18 model is fine-tuned to extract domain-specific deep feature representations. Feature dimensionality and redundancy are reduced using Principal Component Analysis, while class imbalance is addressed through the Synthetic Minority Over-sampling Technique. The resulting compact feature vectors are used to train interpretable tree-based classifiers, which include Decision Tree, Random Forest, and XGBoost. Experiments conducted using five-fold stratified cross-validation demonstrate substantial and consistent performance gains. When trained on fine-tuned and preprocessed deep features, all evaluated tree-based classifiers achieve weighted F1-scores between 0.977 and 0.982 using five-fold cross-validation, with a significant reduction in inter-class confusion. In addition, the proposed framework maintains low per-sample inference latency, which supports energy-efficient and scalable deployment. These results indicate that combining deep feature learning with interpretable tree-based models provides a practical and reliable solution for sustainable chest X-ray screening in real-world clinical environments. Full article

(This article belongs to the Section Biosignal Processing)

► Show Figures

Figure 1

15 pages, 1323 KB

Open AccessArticle

Identification of Predictors of Adaptability in Older Adults Based on the Roy Adaptation Model Using Machine Learning

by Javier Gaviria Chavarro, Miguel Ángel Gómez García, Jose Manuel Alcaide Leyva, Alfonsina del Cristo Martínez Gutiérrez and Rosa Nury Zambrano Bermeo

J. Clin. Med. 2026, 15(5), 1709; https://doi.org/10.3390/jcm15051709 - 24 Feb 2026

Viewed by 190

Abstract

Background: The Callista Roy Adaptation Model posits that adaptation in later life emerges from the interaction among physical, psychological, and social dimensions. However, empirical evidence integrating these domains through predictive approaches remains limited. The aim of this study was to identify the [...] Read more.

Background: The Callista Roy Adaptation Model posits that adaptation in later life emerges from the interaction among physical, psychological, and social dimensions. However, empirical evidence integrating these domains through predictive approaches remains limited. The aim of this study was to identify the main predictors of adaptive classification in older adult women using functional and subjective well-being measures. Methods: A predictive study was conducted in older adult women enrolled in community-based exercise programs. Assessments included the Senior Fitness Test and the SF-12 and WHO-5 questionnaires. Multiclass classification models were trained, with Random Forest selected due to superior performance. Model evaluation incorporated oversampling strategies and robustness analyses without oversampling, using metrics resilient to class imbalance (macro-F1 and balanced accuracy). Model interpretability was examined through variable importance analysis, partial dependence, and ICE plots. Results: Under the oversampling framework, the Random Forest model achieved an overall accuracy of 74% and a macro-F1 score of 0.73, with reduced performance observed in robustness analyses, particularly for the minority “High” class. The most influential predictors were the physical component of the SF-12, the 2 min step test, the mental component of the SF-12, and the chair sit-and-reach test. Conclusions: The findings highlight the joint contribution of physical and psychosocial factors to adaptive processes, in alignment with the Roy Adaptation Model. This study provides exploratory evidence supporting the integrated use of the SFT, SF-12, and WHO-5; however, external validation and longitudinal evaluation are required prior to clinical implementation. Full article

(This article belongs to the Section Epidemiology & Public Health)

► Show Figures

Figure 1

16 pages, 1606 KB

Open AccessArticle

GenReP: An Ensemble Model for Predicting TP53 in Response to Pharmaceutical Compounds

by Austin Spadaro, Alok Sharma and Iman Dehzangi

Molecules 2026, 31(4), 739; https://doi.org/10.3390/molecules31040739 - 21 Feb 2026

Viewed by 263

Abstract

TP53 is a tumor-suppressor gene involved in regulating apoptosis, DNA repair, and genomic stability. Mutations in TP53 are implicated in approximately half of all detected cancers, including breast, lung, colorectal, and ovarian cancers, making it a significant target for therapeutic interventions. Many pharmaceutical [...] Read more.

TP53 is a tumor-suppressor gene involved in regulating apoptosis, DNA repair, and genomic stability. Mutations in TP53 are implicated in approximately half of all detected cancers, including breast, lung, colorectal, and ovarian cancers, making it a significant target for therapeutic interventions. Many pharmaceutical drugs aim to restore TP53 function, and there is a need for predictive tools to assess how compounds may affect TP53 expression. In this study, we propose a new ensemble machine-learning model to predict the direction of TP53 relative gene expression in response to pharmaceutical compounds. Our model utilizes molecular fingerprints, descriptors, and scaffold-based features extracted from SMILES representations of compounds concatenated into a single feature vector. Trained using our newly generated benchmark dataset based on the Connectivity Map (CMap) database and addressing class imbalance with the Synthetic Minority Over-sampling Technique (SMOTE), our model achieves 62.9%, 93.9%, 40.3%, and 0.39 in terms of accuracy, sensitivity, specificity, and Matthews Correlation Coefficient (MCC), respectively. As the first-of-its-kind TP53 gene regulation prediction, our study serves as a convincing proof-of-concept that paves the way for future investigation. GenReP as a stand-alone predictor, its source code, and our newly generated benchmark dataset are publicly available. Full article

(This article belongs to the Special Issue Computational Insights into Protein Engineering and Molecular Design)

► Show Figures

Figure 1

32 pages, 9123 KB

Open AccessArticle

AI-Based Classification of IT Support Requests in Enterprise Service Management Systems

by Audrius Razma and Robertas Jurkus

Systems 2026, 14(2), 223; https://doi.org/10.3390/systems14020223 - 21 Feb 2026

Viewed by 271

Abstract

In modern organizations, IT Service Management (ITSM) relies on the efficient handling of large volumes of unstructured textual data, such as support tickets and incident reports. This study investigates the automated classification of IT support requests as a data-driven decision-support task within a [...] Read more.

In modern organizations, IT Service Management (ITSM) relies on the efficient handling of large volumes of unstructured textual data, such as support tickets and incident reports. This study investigates the automated classification of IT support requests as a data-driven decision-support task within a real-world enterprise ITSM context, addressing challenges posed by multilingual content and severe class imbalance. We propose an applied machine-learning and natural language processing (NLP) pipeline combining text cleaning, stratified data splitting, and supervised model training under realistic evaluation conditions. Multiple classification models were evaluated on historical enterprise ticket data, including a Logistic Regression baseline and transformer-based architectures (multilingual BERT and XLM-RoBERTa). Model validation distinguishes between deployment-oriented evaluation on naturally imbalanced data and diagnostic analysis using training-time class balancing to examine minority-class behavior. Results indicate that Logistic Regression performs reliably for high-frequency, well-defined request categories, while transformer-based models achieve consistently higher macro-averaged F1-scores and improved recognition of semantically complex and underrepresented classes. Training-time oversampling increases sensitivity to minority request types without improving overall accuracy on unbalanced test data, highlighting the importance of metric selection in ITSM evaluation. The findings provide an applied empirical comparison of established text-classification models in ITSM, incorporating both predictive performance and computational efficiency considerations, and offer practical guidance for supporting IT support agents during ticket triage and automated request classification. Full article

(This article belongs to the Section Artificial Intelligence and Digital Systems Engineering)

► Show Figures

Figure 1

23 pages, 527 KB

Open AccessArticle

Time-Domain Oversampling-Enabled Multi-NS Reception for MoCDMA

by Weidong Gao, Yuanhui Wang and Jun Li

Symmetry 2026, 18(2), 380; https://doi.org/10.3390/sym18020380 - 20 Feb 2026

Viewed by 179

Abstract

In molecular communication via diffusion (MCvD) uplinks where multiple nano-sensors report concurrently to a fusion center (FC), the long channel memory and the near–far imbalance jointly create strong multiple access interference (MAI) coupled with residual inter-symbol/inter-chip effects. This paper studies an oversampling-enabled time-domain [...] Read more.

In molecular communication via diffusion (MCvD) uplinks where multiple nano-sensors report concurrently to a fusion center (FC), the long channel memory and the near–far imbalance jointly create strong multiple access interference (MAI) coupled with residual inter-symbol/inter-chip effects. This paper studies an oversampling-enabled time-domain reception for an uplink molecular code-division multiple-access (MoCDMA) system employing bipolar molecular signalling. By exploiting intra-chip oversampling at the FC, three linear detectors following the principles of maximum ratio combining (MRC), zero-forcing (ZF), and minimum mean-square error (MMSE) are developed and further enhanced through a feedback-assisted interference subtraction (FAIS) scheme that combines single-tap ISI feedback equalization with near-to-far successive MAI subtraction. Owing to the complementary structure of bipolar molecular emissions, the signal-dependent counting noise corresponding to the two molecule types can be jointly modeled in a symmetric and information-independent manner to support unified linear detection and FAIS processing. Numerical results demonstrate that oversampling effectively improves detection reliability, while increasing the molecular emission budget alone is insufficient to mitigate near–far effects. Moreover, FAIS provides significant performance gains, particularly for far NSs. Full article

(This article belongs to the Section Computer)

► Show Figures

Figure 1

27 pages, 1628 KB

Open AccessArticle

Synthetic Data Augmentation for Imbalanced Tabular Data: A Comparative Study of Generation Methods

by Dong-Hyun Won, Kwang-Seong Shin and Sungkwan Youm

Electronics 2026, 15(4), 883; https://doi.org/10.3390/electronics15040883 - 20 Feb 2026

Viewed by 335

Abstract

Class imbalance in tabular datasets poses a challenge for machine learning classification tasks, often leading to biased models that underperform in predicting minority class instances. This study presents a comparative analysis of synthetic data generation methods for addressing class imbalance in tabular data. [...] Read more.

Class imbalance in tabular datasets poses a challenge for machine learning classification tasks, often leading to biased models that underperform in predicting minority class instances. This study presents a comparative analysis of synthetic data generation methods for addressing class imbalance in tabular data. We evaluate four augmentation approaches—Synthetic Minority Over-sampling Technique (SMOTE), Gaussian Copula, Tabular Variational Autoencoder (TVAE), and Conditional Tabular Generative Adversarial Network (CTGAN)—using the University of California Irvine (UCI) Bank Marketing dataset, which exhibits a class imbalance ratio of approximately 7.88:1. Our experimental framework assesses each method across three dimensions: statistical fidelity to the original data distribution evaluated through four complementary metrics (marginal numerical similarity, categorical distribution similarity, correlation structure preservation, and Kolmogorov–Smirnov test), machine learning utility measured through classification performance, and minority class detection capability. Results indicate that all augmentation methods achieved statistically significant improvements over the baseline (

p < 0.05

). SMOTE achieved the highest recall (54.2%, a 117.6% relative improvement over the baseline) and F1-Score (0.437, +22.4% over the baseline) for minority class detection, while Gaussian Copula provided the highest composite fidelity score (0.930) with competitive predictive performance. A weak negative correlation (

ρ = - 0.30

) between composite fidelity and classification performance was observed, suggesting that higher statistical fidelity does not necessarily translate to better downstream task performance. Deep learning-based methods (TVAE, CTGAN) showed statistically significant improvements over the baseline (recall: +58% to +63%) but underperformed compared to simpler methods under default configurations, suggesting the need for larger training samples or more extensive hyperparameter tuning. These findings offer reference points for practitioners working with moderately imbalanced tabular data with limited minority class samples, supporting the selection of generation strategies based on specific requirements regarding data fidelity and classification objectives. Full article

(This article belongs to the Special Issue Data-Related Challenges in Machine Learning: Theory and Application)

► Show Figures

Figure 1

Search Results (1,045)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (1,045)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI