Submit to Special Issue Submit Abstract to Special Issue Review for Information Propose a Special Issue

Journal Menu

Journal Browser

► Journal Browser

Machine Learning Approaches for Imbalanced Domains: Emerging Trends and Applications

Special Issue Editors
Special Issue Information
Keywords
Benefits of Publishing in a Special Issue
Published Papers

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Artificial Intelligence".

Deadline for manuscript submissions: 31 August 2025 | Viewed by 22030

Share This Special Issue

Special Issue Editors

Dr. Barbara Pes

E-Mail Website
Guest Editor

Department of Mathematics and Computer Science, University of Cagliari, 09124 Cagliari, Italy
Interests: data mining and machine learning; high-dimensional data analysis; feature selection
Special Issues, Collections and Topics in MDPI journals

Dr. Andrea Loddo

E-Mail Website
Guest Editor

Department of Mathematics and Computer Science, University of Cagliari, 09124 Cagliari, Italy
Interests: computer vision; image processing; machine learning; deep learning; artificial intelligence; medical image analysis; biomedical image analysis
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

In many real-world domains, the data distribution is highly imbalanced since instances of some classes appear much more frequently than others. This poses a difficulty for machine learning algorithms as they tend to be biased towards the majority class. At the same time, the minority class is typically the most important from a data mining perspective as it may carry valuable knowledge.

Despite more than two decades of continuous research, several open issues remain in the field of imbalance learning, and recent trends increasingly focus on the interaction between class imbalance and other difficulties embedded in the nature of the data, such as the fast-growing data volume and dimensionality, the variability of concepts in time, or the presence of noise and data quality issues. New real-world problems continue to emerge that motivate researchers to focus on advanced learning strategies, which can involve data-level and algorithm-level approaches, to effectively deal with imbalanced datasets.

The aim of this Special Issue is to bring together contributions that discuss problems and solutions in this area, both from a methodological and an application-oriented perspective. Topics of interest include but are not limited to:

Data-level, algorithm-level, and hybrid approaches;
Machine learning, ensemble learning, and deep learning methods;
Multi-label and multi-class imbalanced learning;
Learning strategies for high-dimensional imbalanced data;
Learning strategies for imbalanced data streams;
Learning strategies for imbalanced visual data;
Noise robustness of learning methods in imbalanced settings;
Metrics and methodologies for model evaluation in imbalanced settings;
Real-world applications: industrial monitoring systems, fraud detection, intrusion detection, software defect prediction, medical diagnosis, object detection and image classification, computer vision, text mining, sentiment analysis, anomaly detection, and behavior analysis in social media.

Dr. Barbara Pes
Dr. Andrea Loddo
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

data mining and knowledge discovery
machine learning
deep learning
imbalance learning
case studies and real-world applications

Benefits of Publishing in a Special Issue

Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (12 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

20 pages, 4098 KiB

Open AccessArticle

Hierarchical Deep Learning for Comprehensive Epileptic Seizure Analysis: From Detection to Fine-Grained Classification

by Peter Akor, Godwin Enemali, Usman Muhammad, Rajiv Ranjan Singh and Hadi Larijani

Information 2025, 16(7), 532; https://doi.org/10.3390/info16070532 - 24 Jun 2025

Viewed by 539

Abstract

Epileptic seizure detection and classification from EEG recordings faces significant challenges due to extreme class imbalance. Analysis of the Temple University Hospital Seizure (TUSZ) dataset reveals imbalance ratios of 150:1 between common and rare seizure types, with high temporal heterogeneity (seizure durations of 1–1638 s). We propose a cascaded deep learning architecture with two specialized CNNs: a binary detector followed by a multi-class classifier. This approach decomposes the classification problem, reducing the maximum imbalance from 150:1 to manageable levels (9:1 binary, 5:1 type). The architecture implements a high-confidence filtering mechanism (threshold = 0.9), creating a 99.5% pure dataset for type classification, dynamic class-weighted optimization proportional to inverse class frequencies, and information flow refinement through progressive stages. Loss dynamics analysis reveals that our weighting scheme strategically redistributes optimization attention, reducing variance by 90.7% for majority classes while increasing variance for minority classes, ensuring all seizure types receive proportional learning signals regardless of representation. The binary classifier achieves 99.64% specificity and 98.23% sensitivity (ROC-AUC = 0.995). The type classifier demonstrates >99% accuracy across seven seizure categories with perfect (100%) classification for three seizure types despite minimal representation. Cross-dataset validation on the University of Bonn dataset confirms robust generalization (96.0% accuracy) for binary seizure detection. This framework effectively addresses multi-level imbalance in neurophysiological signal classification with hierarchical class structures. Full article

(This article belongs to the Special Issue Machine Learning Approaches for Imbalanced Domains: Emerging Trends and Applications)

► Show Figures

Graphical abstract

24 pages, 689 KiB

Open AccessArticle

Topic Classification of Interviews on Emergency Remote Teaching

by Spyridon Tzimiris, Stefanos Nikiforos, Maria Nefeli Nikiforos, Despoina Mouratidis and Katia Lida Kermanidis

Information 2025, 16(4), 253; https://doi.org/10.3390/info16040253 - 21 Mar 2025

Cited by 1 | Viewed by 662

Abstract

This study explores the application of transformer-based language models for automated Topic Classification in qualitative datasets from interviews conducted in Modern Greek. The interviews captured the views of parents, teachers, and school directors regarding Emergency Remote Teaching. Identifying key themes in this kind of interview is crucial for informed decision-making in educational policies. Each dataset was segmented into sentences and labeled with one out of four topics. The dataset was imbalanced, presenting additional complexity for the classification task. The GreekBERT model was fine-tuned for Topic Classification, with preprocessing including accent stripping, lowercasing, and tokenization. The findings revealed GreekBERT’s effectiveness in achieving balanced performance across all themes, outperforming conventional machine learning models. The highest evaluation metric achieved was a macro-F1-score of 0.76, averaged across all classes, highlighting the effectiveness of the proposed approach. This study contributes the following: (i) datasets capturing diverse educational community perspectives in Modern Greek, (ii) a comparative evaluation of conventional ML models versus transformer-based models, (iii) an investigation of how domain-specific language enhances the performance and accuracy of Topic Classification models, showcasing their effectiveness in specialized datasets and the benefits of fine-tuned GreekBERT for such tasks, and (iv) capturing the complexities of ERT through an empirical investigation of the relationships between extracted topics and relevant variables. These contributions offer reliable, scalable solutions for policymakers, enabling data-driven educational policies to address challenges in remote learning and enhance decision-making based on comprehensive qualitative evidence. Full article

(This article belongs to the Special Issue Machine Learning Approaches for Imbalanced Domains: Emerging Trends and Applications)

► Show Figures

Figure 1

29 pages, 6722 KiB

Open AccessArticle

Framework for Addressing Imbalanced Data in Aviation with Federated Learning

by Igor Kabashkin

Information 2025, 16(2), 147; https://doi.org/10.3390/info16020147 - 16 Feb 2025

Cited by 1 | Viewed by 1414

Abstract

The aviation industry generates vast amounts of data across multiple stakeholders, but critical faults and anomalies occur rarely, creating inherently imbalanced datasets that complicate machine learning applications. Traditional centralized approaches are further constrained by privacy concerns and regulatory requirements that limit data sharing among stakeholders. This paper presents a novel framework for addressing imbalanced data challenges in aviation through federated learning, focusing on fault detection, predictive maintenance, and safety management. The proposed framework combines specialized techniques for handling imbalanced data with privacy-preserving federated learning to enable effective collaboration while maintaining data security. The framework incorporates local resampling methods, cost-sensitive learning, and weighted aggregation mechanisms to improve minority class detection performance. The framework is validated through extensive experiments involving multiple aviation stakeholders, demonstrating a 23% improvement in fault detection accuracy and a 17% reduction in remaining useful life prediction error compared to conventional models. Results show the enhanced detection of rare but critical faults, improved maintenance scheduling accuracy, and effective risk assessment across distributed aviation datasets. The proposed framework provides a scalable and practical solution for using distributed aviation data while addressing both class imbalance and privacy concerns, contributing to improved safety and operational efficiency in the aviation industry. Full article

(This article belongs to the Special Issue Machine Learning Approaches for Imbalanced Domains: Emerging Trends and Applications)

► Show Figures

Graphical abstract

29 pages, 6998 KiB

Open AccessArticle

Property Graph Framework for Geographical Routes in Sports Training

by Alen Rajšp and Iztok Fister, Jr.

Information 2025, 16(1), 30; https://doi.org/10.3390/info16010030 - 7 Jan 2025

Viewed by 722

Abstract

Presenting real-world paths in property graphs is a complex challenge of identifying and representing the properties of routes and their environments. These property graphs serve as foundational datasets for generating smart sports training routes, where route features such as terrain, bends, and hills critically influence the route design. This paper outlines a method for identifying key parameters of real-world paths and encoding them into property graphs. The proposed method has significant implications for sports event planning, particularly in designing route-based training that meets specific athletic challenges. The research concludes by presenting a case study in which a property graph that enables cycling route generation was created for the country of Slovenia, and a sample training route was generated. Full article

(This article belongs to the Special Issue Machine Learning Approaches for Imbalanced Domains: Emerging Trends and Applications)

► Show Figures

Graphical abstract

17 pages, 6223 KiB

Open AccessArticle

Adaptive Oversampling via Density Estimation for Online Imbalanced Classification

by Daeun Lee and Hyunjoong Kim

Information 2025, 16(1), 23; https://doi.org/10.3390/info16010023 - 5 Jan 2025

Cited by 1 | Viewed by 757

Abstract

Online learning is a framework for processing and learning from sequential data in real time, offering benefits such as promptness and low memory usage. However, it faces critical challenges, including concept drift, where data distributions evolve over time, and class imbalance, which significantly hinders the accurate classification of minority classes. Addressing these issues simultaneously remains a challenging research problem. This study introduces a novel algorithm that integrates adaptive weighted kernel density estimation (awKDE) and a conscious biasing mechanism to efficiently manage memory, while enhancing the classification performance. The proposed method dynamically detects the minority class and employs a biasing strategy to prioritize its representation during training. By generating synthetic minority samples using awKDE, the algorithm adaptively balances class distributions, ensuring robustness in evolving environments. Experimental evaluations across synthetic and real-world datasets demonstrated that the proposed method achieved up to a 13.3 times improvement in classification performance over established oversampling methods and up to a 1.66 times better performance over adaptive rebalancing approaches, while requiring significantly less memory. These results underscore the method’s scalability and practicality for real-time online learning applications. Full article

(This article belongs to the Special Issue Machine Learning Approaches for Imbalanced Domains: Emerging Trends and Applications)

► Show Figures

Figure 1

26 pages, 590 KiB

Open AccessArticle

SINNER: A Reward-Sensitive Algorithm for Imbalanced Malware Classification Using Neural Networks with Experience Replay

by Antonio Coscia, Andrea Iannacone, Antonio Maci and Alessandro Stamerra

Information 2024, 15(8), 425; https://doi.org/10.3390/info15080425 - 23 Jul 2024

Cited by 2 | Viewed by 1928

Abstract

Reports produced by popular malware analysis services showed a disparity in samples available for different malware families. The unequal distribution between such classes can be attributed to several factors, such as technological advances and the application domain that seeks to infect a computer virus. Recent studies have demonstrated the effectiveness of deep learning (DL) algorithms when learning multi-class classification tasks using imbalanced datasets. This can be achieved by updating the learning function such that correct and incorrect predictions performed on the minority class are more rewarded or penalized, respectively. This procedure can be logically implemented by leveraging the deep reinforcement learning (DRL) paradigm through a proper formulation of the Markov decision process (MDP). This paper proposes SINNER, i.e., a DRL-based multi-class classifier that approaches the data imbalance problem at the algorithmic level by exploiting a redesigned reward function, which modifies the traditional MDP model used to learn this task. Based on the experimental results, the proposed formula appears to be successful. In addition, SINNER has been compared to several DL-based models that can handle class skew without relying on data-level techniques. Using three out of four datasets sourced from the existing literature, the proposed model achieved state-of-the-art classification performance. Full article

(This article belongs to the Special Issue Machine Learning Approaches for Imbalanced Domains: Emerging Trends and Applications)

► Show Figures

Figure 1

19 pages, 25362 KiB

Open AccessArticle

An Anomaly Detection Approach to Determine Optimal Cutting Time in Cheese Formation

by Andrea Loddo, Davide Ghiani, Alessandra Perniciano, Luca Zedda, Barbara Pes and Cecilia Di Ruberto

Information 2024, 15(6), 360; https://doi.org/10.3390/info15060360 - 18 Jun 2024

Cited by 1 | Viewed by 1930

Abstract

The production of cheese, a beloved culinary delight worldwide, faces challenges in maintaining consistent product quality and operational efficiency. One crucial stage in this process is determining the precise cutting time during curd formation, which significantly impacts the quality of the cheese. Misjudging this timing can lead to the production of inferior products, harming a company’s reputation and revenue. Conventional methods often fall short of accurately assessing variations in coagulation conditions due to the inherent potential for human error. To address this issue, we propose an anomaly-detection-based approach. In this approach, we treat the class representing curd formation as the anomaly to be identified. Our proposed solution involves utilizing a one-class, fully convolutional data description network, which we compared against several state-of-the-art methods to detect deviations from the standard coagulation patterns. Encouragingly, our results show F1 scores of up to 0.92, indicating the effectiveness of our approach. Full article

(This article belongs to the Special Issue Machine Learning Approaches for Imbalanced Domains: Emerging Trends and Applications)

► Show Figures

Figure 1

15 pages, 1172 KiB

Open AccessArticle

Prediction of Disk Failure Based on Classification Intensity Resampling

by Sheng Wu and Jihong Guan

Information 2024, 15(6), 322; https://doi.org/10.3390/info15060322 - 31 May 2024

Cited by 1 | Viewed by 1095

Abstract

With the rapid growth of the data scale in data centers, the high reliability of storage is facing various challenges. Specifically, hardware failures such as disk faults occur frequently, causing serious system availability issues. In this context, hardware fault prediction based on AI and big data technologies has become a research hotspot, aiming to guide operation and maintenance personnel to implement preventive replacement through accurate prediction to reduce hardware failure rates. However, existing methods still have weaknesses in terms of accuracy due to the impacts of data quality issues such as the sample imbalance. This article proposes a disk fault prediction method based on classification intensity resampling, which fills the gap between the degree of data imbalance and the actual classification intensity of the task by introducing a base classifier to calculate the classification intensity, thus better preserving the data features of the original dataset. In addition, using ensemble learning methods such as random forests, combined with resampling, an integrated classifier for imbalanced data is developed to further improve the prediction accuracy. Experimental verification shows that compared with traditional methods, the F1-score of disk fault prediction is improved by 6%, and the model training time is also greatly reduced. The fault prediction method proposed in this paper has been applied to approximately 80 disk drives and nearly 40,000 disks in the production environment of a large bank’s data center to guide preventive replacements. Compared to traditional methods, the number of preventive replacements based on our method has decreased by approximately 21%, while the overall disk failure rate remains unchanged, thus demonstrating the effectiveness of our method. Full article

(This article belongs to the Special Issue Machine Learning Approaches for Imbalanced Domains: Emerging Trends and Applications)

► Show Figures

Figure 1

13 pages, 1512 KiB

Open AccessArticle

A Framework Model of Mining Potential Public Opinion Events Pertaining to Suspected Research Integrity Issues with the Text Convolutional Neural Network model and a Mixed Event Extractor

by Zongfeng Zou, Xiaochen Ji and Yingying Li

Information 2024, 15(6), 303; https://doi.org/10.3390/info15060303 - 24 May 2024

Cited by 1 | Viewed by 1016

Abstract

With the development of the Internet, the oversight of research integrity issues has extended beyond the scientific community to encompass the whole of society. If these issues are not addressed promptly, they can significantly impact the research credibility of both institutions and scholars. This article proposes a text convolutional neural network based on SMOTE to identify short texts of potential public opinion events related to suspected scientific integrity issues from common short texts. The SMOTE comprehensive sampling technique is employed to handle imbalanced datasets. To mitigate the impact of short text length on text representation quality, the Doc2vec embedding model is utilized to represent short text, yielding a one-dimensional dense vector. Additionally, the dimensions of the input layer and convolution kernel of TextCNN are adjusted. Subsequently, a short text event extraction model based on TF-IDF and TextRank is proposed to extract crucial information, for instance, names and research-related institutions, from events and facilitate the identification of potential public opinion events related to suspected scientific integrity issues. Results of experiments have demonstrated that utilizing SMOTE to balance the dataset is able to improve the classification results of TextCNN classifiers. Compared to traditional classifiers, TextCNN exhibits greater robustness in addressing the problems of imbalanced datasets. However, challenges such as low information content, non-standard writing, and polysemy in short texts may impact the accuracy of event extraction. The framework can be further optimized to address these issues in the future. Full article

(This article belongs to the Special Issue Machine Learning Approaches for Imbalanced Domains: Emerging Trends and Applications)

► Show Figures

Figure 1

19 pages, 4471 KiB

Open AccessArticle

Detection of Korean Phishing Messages Using Biased Discriminant Analysis under Extreme Class Imbalance Problem

by Siyoon Kim, Jeongmin Park, Hyun Ahn and Yonggeol Lee

Information 2024, 15(5), 265; https://doi.org/10.3390/info15050265 - 7 May 2024

Cited by 2 | Viewed by 2864

Abstract

In South Korea, the rapid proliferation of smartphones has led to an uptick in messenger phishing attacks associated with electronic communication financial scams. In response to this, various phishing detection algorithms have been proposed. However, collecting messenger phishing data poses challenges due to concerns about its potential use in criminal activities. Consequently, a Korean phishing dataset can be composed of imbalanced data, where the number of general messages might outnumber the phishing ones. This class imbalance problem and data scarcity can lead to overfitting issues, making it difficult to achieve high performance. To solve this problem, this paper proposes a phishing messages classification method using Biased Discriminant Analysis without resorting to data augmentation techniques. In this paper, by optimizing the parameters for BDA, we achieved exceptionally high performances in the phishing messages classification experiment, with 95.45% for Recall and 96.85% for the BA metric. Moreover, when compared with other algorithms, the proposed method demonstrated robustness against overfitting due to the class imbalance problem and exhibited minimal performance disparity between training and testing datasets. Full article

(This article belongs to the Special Issue Machine Learning Approaches for Imbalanced Domains: Emerging Trends and Applications)

► Show Figures

Figure 1

20 pages, 541 KiB

Open AccessArticle

An Extensive Performance Comparison between Feature Reduction and Feature Selection Preprocessing Algorithms on Imbalanced Wide Data

by Ismael Ramos-Pérez, José Antonio Barbero-Aparicio, Antonio Canepa-Oneto, Álvar Arnaiz-González and Jesús Maudes-Raedo

Information 2024, 15(4), 223; https://doi.org/10.3390/info15040223 - 16 Apr 2024

Cited by 2 | Viewed by 3011

Abstract

The most common preprocessing techniques used to deal with datasets having high dimensionality and a low number of instances—or wide data—are feature reduction (FR), feature selection (FS), and resampling. This study explores the use of FR and resampling techniques, expanding the limited comparisons between FR and filter FS methods in the existing literature, especially in the context of wide data. We compare the optimal outcomes from a previous comprehensive study of FS against new experiments conducted using FR methods. Two specific challenges associated with the use of FR are outlined in detail: finding FR methods that are compatible with wide data and the need for a reduction estimator of nonlinear approaches to process out-of-sample data. The experimental study compares 17 techniques, including supervised, unsupervised, linear, and nonlinear approaches, using 7 resampling strategies and 5 classifiers. The results demonstrate which configurations are optimal, according to their performance and computation time. Moreover, the best configuration—namely, k Nearest Neighbor (KNN) + the Maximal Margin Criterion (MMC) feature reducer with no resampling—is shown to outperform state-of-the-art algorithms. Full article

(This article belongs to the Special Issue Machine Learning Approaches for Imbalanced Domains: Emerging Trends and Applications)

► Show Figures

Figure 1

19 pages, 2175 KiB

Open AccessArticle

An Evaluation of Feature Selection Robustness on Class Noisy Data

by Simone Pau, Alessandra Perniciano, Barbara Pes and Dario Rubattu

Information 2023, 14(8), 438; https://doi.org/10.3390/info14080438 - 3 Aug 2023

Cited by 3 | Viewed by 3292

Abstract

With the increasing growth of data dimensionality, feature selection has become a crucial step in a variety of machine learning and data mining applications. In fact, it allows identifying the most important attributes of the task at hand, improving the efficiency, interpretability, and final performance of the induced models. In recent literature, several studies have examined the strengths and weaknesses of the available feature selection methods from different points of view. Still, little work has been performed to investigate how sensitive they are to the presence of noisy instances in the input data. This is the specific field in which our work wants to make a contribution. Indeed, since noise is arguably inevitable in several application scenarios, it would be important to understand the extent to which the different selection heuristics can be affected by noise, in particular class noise (which is more harmful in supervised learning tasks). Such an evaluation may be especially important in the context of class-imbalanced problems, where any perturbation in the set of training records can strongly affect the final selection outcome. In this regard, we provide here a two-fold contribution by presenting (i) a general methodology to evaluate feature selection robustness on class noisy data and (ii) an experimental study that involves different selection methods, both univariate and multivariate. The experiments have been conducted on eight high-dimensional datasets chosen to be representative of different real-world domains, with interesting insights into the intrinsic degree of robustness of the considered selection approaches. Full article

(This article belongs to the Special Issue Machine Learning Approaches for Imbalanced Domains: Emerging Trends and Applications)

► Show Figures

Journal Menu

Journal Browser

Machine Learning Approaches for Imbalanced Domains: Emerging Trends and Applications

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (12 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI