Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (34)

Search Parameters:
Keywords = SMOTE + Tomek Link

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
25 pages, 8863 KB  
Article
A Multi-Scale Residual Convolutional Neural Network for Fault Diagnosis of Progressive Cavity Pump Systems in Coalbed Methane Wells with Imbalanced and Differentiated Data
by Jiaojiao Yu, Yajie Ou, Ying Gao, Youwu Li, Feng Gu, Jinhuang You, Bin Liu, Xiaoyong Gao and Chaodong Tan
Processes 2026, 14(2), 383; https://doi.org/10.3390/pr14020383 - 22 Jan 2026
Viewed by 44
Abstract
Coalbed methane, an abundant clean energy resource in China, is gaining significant attention. Electric submersible progressive cavity pumps, ideal for downhole extraction with high solids content, are vital in coalbed methane operations. Current fault diagnosis research for these pumps mainly relies on machine [...] Read more.
Coalbed methane, an abundant clean energy resource in China, is gaining significant attention. Electric submersible progressive cavity pumps, ideal for downhole extraction with high solids content, are vital in coalbed methane operations. Current fault diagnosis research for these pumps mainly relies on machine learning algorithms to identify fault features, but complex working conditions and imbalanced sample distributions challenge these models’ ability to perceive multi-scale and multi-dimensional features. To enhance the model’s perception of deep abnormal data in complex multi-case industrial datasets, this study proposes a deep learning model based on a multi-scale extraction and residual module convolutional neural network. Innovatively, a cross-attention module using global autocorrelation and local cross-correlation is introduced to constrain the multi-scale feature extraction process, making the model better suited to specific and differentiated data environments. Post feature extraction, the model employs Borderline-SMOTE to augment minority class samples and uses Tomek Links for noise removal. These enhancements improve the comprehensive perception of fault types with significant differences in period, amplitude, and dimension, as well as the learning capability for rare faults. Based on field-collected fault data and using enhanced and cleaned features for classifier training, tests on a real industrial dataset show the proposed model achieves an F1 Measure of 90.7%—an improvement of 13.38% over the unimproved model and 9.15–31.64% over other common fault diagnosis models. Experimental results confirm the method’s effectiveness in adapting to extremely imbalanced sample distributions and complex, variable field data characteristics. Full article
(This article belongs to the Special Issue Coalbed Methane Development Process)
Show Figures

Figure 1

34 pages, 4013 KB  
Article
Machine Learning-Based Cyber Fraud Detection: A Comparative Study of Resampling Methods for Imbalanced Credit Card Data
by Eyad Btoush, Thaeer Kobbaey, Hatem Tamimi and Xujuan Zhou
Appl. Sci. 2026, 16(2), 850; https://doi.org/10.3390/app16020850 - 14 Jan 2026
Viewed by 170
Abstract
The prevalence of online transactions and extensive adoption of credit card payments have contributed to the escalation of credit card cyber fraud in modern society. These trends are propelled by technological advancements, which provide fraudulent actors with more opportunities. Fraudsters exploit victims’ financial [...] Read more.
The prevalence of online transactions and extensive adoption of credit card payments have contributed to the escalation of credit card cyber fraud in modern society. These trends are propelled by technological advancements, which provide fraudulent actors with more opportunities. Fraudsters exploit victims’ financial vulnerabilities by obtaining illegal access to sensitive credit card information through deceptive means, such as phishing, fraudulent phone calls, and fraudulent SMS messages. This study predicts and detects potential instances of cyber fraud in credit card transactions by employing Machine Learning (ML) techniques, including Decision Tree (DT); Random Forest (RF); Logistic Regression (LR); Support Vector Machine (SVM); K-Nearest Neighbors (KNN); XGBoost; CatBoost; and sampling techniques such as Tomek Link, Synthetic Minority oversampling technique (SMOTE), Edited Nearest Neighbor (ENN), Tomek+ENN, and SMOTE+ENN. To determine the performance of the algorithms in terms of accuracy, precision, recall, F1 score, and ROC-AUC for credit card cyber fraud detection, we conducted a comparative analysis of the extant ML techniques. Full article
Show Figures

Figure 1

20 pages, 1504 KB  
Article
Early Prediction of Acute Respiratory Distress Syndrome in Critically Ill Polytrauma Patients Using Balanced Random Forest ML: A Retrospective Cohort Study
by Nesrine Ben El Hadj Hassine, Sabri Barbaria, Omayma Najah, Halil İbrahim Ceylan, Muhammad Bilal, Lotfi Rebai, Raul Ioan Muntean, Ismail Dergaa and Hanene Boussi Rahmouni
J. Clin. Med. 2025, 14(24), 8934; https://doi.org/10.3390/jcm14248934 - 17 Dec 2025
Viewed by 808
Abstract
Background/Objectives: Acute respiratory distress syndrome (ARDS) represents a critical complication in polytrauma patients, characterized by diffuse lung inflammation and bilateral pulmonary infiltrates with mortality rates reaching 45% in intensive care units (ICU). The heterogeneous nature of ARDS and complex clinical presentation in severely [...] Read more.
Background/Objectives: Acute respiratory distress syndrome (ARDS) represents a critical complication in polytrauma patients, characterized by diffuse lung inflammation and bilateral pulmonary infiltrates with mortality rates reaching 45% in intensive care units (ICU). The heterogeneous nature of ARDS and complex clinical presentation in severely injured patients poses substantial diagnostic challenges, necessitating early prediction tools to guide timely interventions. Machine learning (ML) algorithms have emerged as promising approaches for clinical decision support, demonstrating superior performance compared to traditional scoring systems in capturing complex patterns within high-dimensional medical data. Based on the identified research gaps in early ARDS prediction for polytrauma populations, our study aimed to: (i) develop a balanced random forest (BRF) ML model for early ARDS prediction in critically ill polytrauma patients, (ii) identify the most predictive clinical features using ANOVA-based feature selection, and (iii) evaluate model performance using comprehensive metrics addressing class imbalance challenges. Methods: This retrospective cohort study analyzed 407 polytrauma patients admitted to the ICU of the Center of Traumatology and Major Burns of Ben Arous, Tunisia, between 2017 and 2021. We implemented a comprehensive ML pipeline that incorporates Tomek Links undersampling, ANOVA F-test feature selection for the top 10 predictive variables, and SMOTE oversampling with a conservative sampling rate of 0.3. The BRF classifier was trained with class weighting and evaluated using stratified 5-fold cross-validation. Performance metrics included AUROC, PR-AUC, sensitivity, specificity, F1-score, and Matthews correlation coefficient. Results: Among 407 patients, 43 developed ARDS according to the Berlin definition, representing a 10.57% incidence. The BRF model demonstrated exceptional predictive performance with an AUROC of 0.98, a sensitivity of 0.91, a specificity of 0.80, an F1-score of 0.84, and an MCC of 0.70. Precision–recall AUC reached 0.86, demonstrating robust performance despite class imbalance. During stratified cross-validation, AUROC values ranged from 0.93 to 0.99 across folds, indicating consistent model stability. The top 10 selected features included procalcitonin, PaO2 at ICU admission, 24-h pH, massive transfusion, total fluid resuscitation, presence of pneumothorax, alveolar hemorrhage, pulmonary contusion, hemothorax, and flail chest injury. Conclusions: Our BRF model provides a robust, clinically applicable tool for early prediction of ARDS in polytrauma patients using readily available clinical parameters. The comprehensive two-step resampling approach, combined with ANOVA-based feature selection, successfully addressed class imbalance while maintaining high predictive accuracy. These findings support integrating ML approaches into critical care decision-making to improve patient outcomes and resource allocation. External validation in diverse populations remains essential for confirming generalizability and clinical implementation. Full article
(This article belongs to the Section Respiratory Medicine)
Show Figures

Graphical abstract

22 pages, 1178 KB  
Article
Identification of Potential Biomarkers in Prostate Cancer Microarray Gene Expression Leveraging Explainable Machine Learning Classifiers
by Ahmed Al Marouf, Jon George Rokne and Reda Alhajj
Cancers 2025, 17(23), 3853; https://doi.org/10.3390/cancers17233853 - 30 Nov 2025
Cited by 1 | Viewed by 511
Abstract
Background and Objective: Prostate cancer remains one of the most prevalent and potentially lethal malignancies among men worldwide, and timely and accurate diagnosis, along with the stratification of patients by disease severity, is critical for personalized treatment and improved outcomes for this cancer. [...] Read more.
Background and Objective: Prostate cancer remains one of the most prevalent and potentially lethal malignancies among men worldwide, and timely and accurate diagnosis, along with the stratification of patients by disease severity, is critical for personalized treatment and improved outcomes for this cancer. One of the tools used for diagnosis is bioinformatics. However, traditional biomarker discovery methods often lack transparency and interpretability, which means that clinicians find it difficult to trust biomarkers for their application in a clinical setting. Methods: This paper introduces a novel approach that leverages Explainable Machine Learning (XML) techniques to identify and prioritize biomarkers associated with different levels of severity of prostate cancer. The proposed XML approach presented in this study incorporates some traditional machine learning (ML) algorithms with transparent models to facilitate understanding of the importance of the characteristics for bioinformatics analysis, allowing for more informed clinical decisions. The proposed method contains the implementation of several ML classifiers, such as Naive Bayes (NB), Random Forest (RF), Decision Tree (DT), Support Vector Machine (SVM), Logistic Regression (LR), and Bagging (Bg); followed by SHAPly values for the XML pipeline. In this study, for pre-processing of missing values, imputation was applied; SMOTE (Synthetic Minority Oversampling Technique) and the Tomek link method were applied to handle the class imbalance problem. The k-fold stratified validation of machine learning (ML) models and SHAP values (SHapley Additive explanations) were used for explainability. Results: This study utilized a novel tissue microarray data set that has 102 patient data comprising prostate cancer and healthy patients. The proposed model satisfactorily identifies genes as biomarkers, with highest accuracy obtained being 81.01% using RF. The top 10 potential biomarkers identified in this study are DEGS1, HPN, ERG, CFD, TMPRSS2, PDLIM5, XBP1, AJAP1, NPM1 and C7. Conclusions: As XML continues to unravel the complexities within prostate cancer datasets, the identification of severity-specific biomarkers is poised at the forefront of precision oncology. This integration paves the way for targeted interventions, improving patient outcomes, and heralding a new era of individualized care in the fight against prostate cancer. Full article
Show Figures

Figure 1

27 pages, 2953 KB  
Article
A Machine Learning Approach to Valve Plate Failure Prediction in Piston Pumps Under Imbalanced Data Conditions: Comparison of Data Balancing Methods
by Marcin Rojek and Marcin Blachnik
Appl. Sci. 2025, 15(21), 11542; https://doi.org/10.3390/app152111542 - 29 Oct 2025
Viewed by 591
Abstract
This article focuses on the problem of building a real-world predictive maintenance system for hydraulic piston pumps. Particular attention is given to the issue of limited data availability regarding the failure state of systems with a damaged valve plate. The main objective of [...] Read more.
This article focuses on the problem of building a real-world predictive maintenance system for hydraulic piston pumps. Particular attention is given to the issue of limited data availability regarding the failure state of systems with a damaged valve plate. The main objective of this work was to analyze the impact of imbalanced data on the quality of the failure prediction system. Several data balancing techniques, including oversampling, undersampling, and combined methods, were evaluated to overcome the limitations. The dataset used for evaluation includes recordings from eleven sensors, such as pressure, flow, and temperature, registered at various points in the hydraulic system. It also includes data from three additional vibration sensors. The experiments were conducted with imbalance ratios ranging from 0.5% to a fully balanced dataset. The results indicate that two methods, Borderline SMOTE and SMOTE+Tomek Links, dominate. These methods allowed the system to achieve the highest performance on a completely new dataset with different levels of damaged valve plates, for the balance rate larger than three percent. Furthermore, for balance rates below one percent, the use of data balancing methods may adversely affect the model. Finally, our results indicate the limitations of the use of cross-validation procedures when assessing data balancing methods. Full article
Show Figures

Figure 1

23 pages, 2612 KB  
Article
Leveraging Machine Learning for Severity Level-Wise Biomarker Identification in Prostate Cancer Microarray Gene Expression Data
by Ahmed Al Marouf, Tarek A. Bismar, Sunita Ghosh, Jon G. Rokne and Reda Alhajj
Biomedicines 2025, 13(10), 2350; https://doi.org/10.3390/biomedicines13102350 - 25 Sep 2025
Viewed by 662
Abstract
Background: Prostate cancer is the most commonly occurring cancer amongst men. The detection and treatment of this cancer is therefore of great importance. The severity level of this cancer, which is established as a score in the Gleason Grading Group (GGC), guides the [...] Read more.
Background: Prostate cancer is the most commonly occurring cancer amongst men. The detection and treatment of this cancer is therefore of great importance. The severity level of this cancer, which is established as a score in the Gleason Grading Group (GGC), guides the treatment of the cancer. Methods: In this paper, traditional machine learning (ML) classification methods such as Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and XGBoost (XGB), which have recently been shown to accurately identifying biomarkers for computational biology, are leveraged to find potential biomarkers for the different GGC scores. A ML framework that maps the Gleason Grading Group (GGG) into five severity levels—low, intermediate-low, intermediate, intermediate-high, and high—has been developed using the above methods. The microarray data for this ML method have been derived from immunohistochemical tests. The study includes severity level-wise biomarker identification, incorporating missing value imputation, class imbalance handling using the SMOTE-Tomek link method, and stratified k-fold validation to ensure robust biomarker selection. Results: The framework is evaluated on prostate cancer tissue microarray gene expression data from 1119 samples. A combination of high-aggressive and low-aggressive signatures are used in four experimental setups. The results demonstrate the effectiveness of the approach in distinguishing between critical biomarkers with highly accurate models, obtaining 96.85% accuracy using the XGBoost method. Conclusions: Leveraging ML gives a potential ground to involve the domain experts and the satisfactory results have approved that. For the future physician-in-the-loop approach can be tested to ensure further diagnosis impact. Full article
(This article belongs to the Section Cancer Biology and Oncology)
Show Figures

Figure 1

21 pages, 3919 KB  
Article
Comparative Analysis of Resampling Techniques for Class Imbalance in Financial Distress Prediction Using XGBoost
by Guodong Hou, Dong Ling Tong, Soung Yue Liew and Peng Yin Choo
Mathematics 2025, 13(13), 2186; https://doi.org/10.3390/math13132186 - 4 Jul 2025
Cited by 4 | Viewed by 3045
Abstract
One of the key challenges in financial distress data is class imbalance, where the data are characterized by a highly imbalanced ratio between the number of distressed and non-distressed samples. This study examines eight resampling techniques for improving distress prediction using the XGBoost [...] Read more.
One of the key challenges in financial distress data is class imbalance, where the data are characterized by a highly imbalanced ratio between the number of distressed and non-distressed samples. This study examines eight resampling techniques for improving distress prediction using the XGBoost algorithm. The study was performed on a dataset acquired from the CSMAR database, containing 26,383 firm-quarter samples from 639 Chinese A-share listed companies (2007–2024), with only 12.1% of the cases being distressed. Results show that standard Synthetic Minority Oversampling Technique (SMOTE) enhanced F1-score (up to 0.73) and Matthews Correlation Coefficient (MCC, up to 0.70), while SMOTE-Tomek and Borderline-SMOTE further boosted recall, slightly sacrificing precision. These oversampling and hybrid methods also maintained reasonable computational efficiency. However, Random Undersampling (RUS), though yielding high recall (0.85), suffered from low precision (0.46) and weaker generalization, but was the fastest method. Among all techniques, Bagging-SMOTE achieved balanced performance (AUC 0.96, F1 0.72, PR-AUC 0.80, MCC 0.68) using a minority-to-majority ratio of 0.15, demonstrating that ensemble-based resampling can improve robustness with minimal impact on the original class distribution, albeit with higher computational cost. The compared findings highlight that no single approach fits all use cases, and technique selection should align with specific goals. Techniques favoring recall (e.g., Bagging-SMOTE, SMOTE-Tomek) are suited for early warning, while conservative techniques (e.g., Tomek Links) help reduce false positives in risk-sensitive applications, and efficient methods such as RUS are preferable when computational speed is a priority. Full article
Show Figures

Figure 1

22 pages, 3438 KB  
Article
A High-Accuracy Advanced Persistent Threat Detection Model: Integrating Convolutional Neural Networks with Kepler-Optimized Bidirectional Gated Recurrent Units
by Guangwu Hu, Maoqi Sun and Chaoqin Zhang
Electronics 2025, 14(9), 1772; https://doi.org/10.3390/electronics14091772 - 27 Apr 2025
Cited by 3 | Viewed by 2066
Abstract
Advanced Persistent Threat (APT) refers to a highly targeted, sophisticated, and prolonged form of cyberattack, typically directed at specific organizations or individuals. The primary objective of such attacks is the theft of sensitive information or the disruption of critical operations. APT attacks are [...] Read more.
Advanced Persistent Threat (APT) refers to a highly targeted, sophisticated, and prolonged form of cyberattack, typically directed at specific organizations or individuals. The primary objective of such attacks is the theft of sensitive information or the disruption of critical operations. APT attacks are characterized by their stealth and complexity, often resulting in significant economic losses. Furthermore, these attacks may lead to intelligence breaches, operational interruptions, and even jeopardize national security and political stability. Given the covert nature and extended durations of APT attacks, current detection solutions encounter challenges such as high detection difficulty and insufficient accuracy. To address these limitations, this paper proposes an innovative high-accuracy APT attack detection model, CNN-KOA-BiGRU, which integrates Convolutional Neural Networks (CNN), Bidirectional Gated Recurrent Units (BiGRU), and the Kepler optimization algorithm (KOA). The model first utilizes CNN to extract spatial features from network traffic data, followed by the application of BiGRU to capture temporal dependencies and long-term memory, thereby forming comprehensive temporal features. Simultaneously, the Kepler optimization algorithm is employed to optimize the BiGRU network structure, achieving globally optimal feature weights and enhancing detection accuracy. Additionally, this study employs a combination of sampling techniques, including Synthetic Minority Over-sampling Technique (SMOTE) and Tomek links, to mitigate classification bias caused by dataset imbalance. Evaluation results on the CSE-CIC-IDS2018 experimental dataset demonstrate that the CNN-KOA-BiGRU model achieves superior performance in detecting APT attacks, with an average accuracy of 98.68%. This surpasses existing methods, including CNN (93.01%), CNN-BiGRU (97.77%), and Graph Convolutional Network (GCN) (95.96%) on the same dataset. Specifically, the proposed model demonstrates an accuracy improvement of 5.67% over CNN, 0.91% over CNN-BiGRU, and 2.72% over GCN. Overall, the proposed model achieves an average improvement of 3.1% compared to existing methods. Full article
(This article belongs to the Special Issue Advanced Technologies in Edge Computing and Applications)
Show Figures

Figure 1

9 pages, 470 KB  
Proceeding Paper
Applying a Parameterized Quantum Circuit to Anomaly Detection
by Jehn-Ruey Jiang and Jyun-Sian Li
Eng. Proc. 2025, 92(1), 3; https://doi.org/10.3390/engproc2025092003 - 10 Apr 2025
Viewed by 1818
Abstract
In this study, a parameterized quantum circuit (PQC) is applied for anomaly detection, a crucial process to identify unusual patterns or outliers in data. PQC is a quantum circuit with trainable parameters linked to quantum gates, which are iteratively optimized by classical optimizers [...] Read more.
In this study, a parameterized quantum circuit (PQC) is applied for anomaly detection, a crucial process to identify unusual patterns or outliers in data. PQC is a quantum circuit with trainable parameters linked to quantum gates, which are iteratively optimized by classical optimizers to ensure that the circuit’s output fulfills its objectives. This is analogous to the way of using trainable parameters, such as weights adjusted in classical machine learning and neural network models. We used the amplitude−embedding mechanism with classical data into quantum states of qubits. These states are fed into PQC, which contains strongly entangled layers, and the circuit is trained to determine whether an anomaly exists. As anomaly detection datasets are often imbalanced, resampling techniques, such as random oversampling, the synthetic minority oversampling technique (SMOTE), random undersampling, and Tomek-Link undersampling, are applied to reduce the imbalance. The proposed PQC and various resampling techniques were compared using the public Musk dataset for anomaly detection. Their combination was also compared with the combination of the classical autoencoder and the classical isolation forest model in terms of the F1 score. By analyzing the comparison results, the advantages and disadvantages of PQC for future research studies were determined. Full article
(This article belongs to the Proceedings of 2024 IEEE 6th Eurasia Conference on IoT, Communication and Engineering)
Show Figures

Figure 1

23 pages, 2539 KB  
Article
Ensemble Learning for Network Intrusion Detection Based on Correlation and Embedded Feature Selection Techniques
by Ghalia Nassreddine, Mohamad Nassereddine and Obada Al-Khatib
Computers 2025, 14(3), 82; https://doi.org/10.3390/computers14030082 - 25 Feb 2025
Cited by 16 | Viewed by 6553
Abstract
Recent advancements across various sectors have resulted in a significant increase in the utilization of smart gadgets. This augmentation has resulted in an expansion of the network and the devices linked to it. Nevertheless, the development of the network has concurrently resulted in [...] Read more.
Recent advancements across various sectors have resulted in a significant increase in the utilization of smart gadgets. This augmentation has resulted in an expansion of the network and the devices linked to it. Nevertheless, the development of the network has concurrently resulted in a rise in policy infractions impacting information security. Finding intruders immediately is a critical component of maintaining network security. The intrusion detection system is useful for network security because it can quickly identify threats and give alarms. In this paper, a new approach for network intrusion detection was proposed. Combining the results of machine learning models like the random forest, decision tree, k-nearest neighbors, and XGBoost with logistic regression as a meta-model is what this method is based on. For the feature selection technique, the proposed approach creates an advanced method that combines the correlation-based feature selection with an embedded technique based on XGBoost. For handling the challenge of an imbalanced dataset, a SMOTE-TOMEK technique is used. The suggested algorithm is tested on the NSL-KDD and CIC-IDS datasets. It shows a high performance with an accuracy of 99.99% for both datasets. These results prove the effectiveness of the proposed approach. Full article
(This article belongs to the Special Issue Using New Technologies in Cyber Security Solutions (2nd Edition))
Show Figures

Figure 1

17 pages, 3748 KB  
Article
Kick Risk Diagnosis Method Based on Ensemble Learning Models
by Liwei Wu, Detao Zhou, Gensheng Li, Ning Gong, Xianzhi Song, Qilong Zhang, Zhi Yan, Tao Pan and Ziyue Zhang
Processes 2024, 12(12), 2704; https://doi.org/10.3390/pr12122704 - 30 Nov 2024
Cited by 4 | Viewed by 1068
Abstract
As oil and gas exploration and development gradually advance into deeper and offshore fields, the geological environment and formation pressure conditions become increasingly complex, leading to a higher risk of drilling incidents such as kicks. Timely diagnosis of kick risk is crucial for [...] Read more.
As oil and gas exploration and development gradually advance into deeper and offshore fields, the geological environment and formation pressure conditions become increasingly complex, leading to a higher risk of drilling incidents such as kicks. Timely diagnosis of kick risk is crucial for ensuring safety and efficiency. This study proposes a kick risk diagnosis method based on ensemble learning models, which integrates various time-series analysis algorithms to construct and optimize multiple kick diagnosis models, accurately fitting the relationship between integrated logging parameters and kick events. By incorporating high-performance ensemble models such as Stacking and Bagging, the accuracy and F1 score of the models were significantly improved. Furthermore, the application of the Synthetic Minority Over-sampling Technique and Tomek Links (SMOTE-Tomek) data balancing technique effectively addressed the issue of data imbalance, contributing to a more robust and balanced model performance. The results demonstrate that integrating time-series analysis with ensemble learning methods significantly enhances the predictive reliability and stability of kick monitoring models. This approach provides a dependable solution for addressing complex kick monitoring tasks in offshore and deepwater drilling operations, ensuring greater safety and efficiency. The findings offer valuable insights that can guide future research and practical implementation in kick risk diagnosis. Full article
(This article belongs to the Special Issue Modeling, Control, and Optimization of Drilling Techniques)
Show Figures

Figure 1

33 pages, 5826 KB  
Article
Improving Churn Detection in the Banking Sector: A Machine Learning Approach with Probability Calibration Techniques
by Alin-Gabriel Văduva, Simona-Vasilica Oprea, Andreea-Mihaela Niculae, Adela Bâra and Anca-Ioana Andreescu
Electronics 2024, 13(22), 4527; https://doi.org/10.3390/electronics13224527 - 18 Nov 2024
Cited by 11 | Viewed by 8981
Abstract
Identifying and reducing customer churn have become a priority for financial institutions seeking to retain clients. Our research focuses on customer churn rate analysis using advanced machine learning (ML) techniques, leveraging a synthetic dataset sourced from the Kaggle platform. The dataset undergoes a [...] Read more.
Identifying and reducing customer churn have become a priority for financial institutions seeking to retain clients. Our research focuses on customer churn rate analysis using advanced machine learning (ML) techniques, leveraging a synthetic dataset sourced from the Kaggle platform. The dataset undergoes a preprocessing phase to select variables directly impacting customer churn behavior. SMOTETomek, a hybrid technique that combines oversampling of the minority class (churn) with SMOTE and the removal of noisy or borderline instances through Tomek links, is applied to balance the dataset and improve class separability. Two cutting-edge ML models are applied—random forest (RF) and the Light Gradient-Boosting Machine (LGBM) Classifier. To evaluate the effectiveness of these models, several key performance metrics are utilized, including precision, sensitivity, F1 score, accuracy, and Brier score, which helps assess the calibration of the predicted probabilities. A particular contribution of our research is on calibrating classification probabilities, as many ML models tend to produce uncalibrated probabilities due to the complexity of their internal mechanisms. Probability calibration techniques are employed to adjust the predicted probabilities, enhancing their reliability and interpretability. Furthermore, the Shapley Additive Explanations (SHAP) method, an explainable artificial intelligence (XAI) technique, is further implemented to increase the transparency and credibility of the model’s decision-making process. SHAP provides insights into the importance of individual features in predicting churn, providing knowledge to banking institutions for the development of personalized customer retention strategies. Full article
(This article belongs to the Special Issue Applied Machine Learning in Intelligent Systems)
Show Figures

Figure 1

32 pages, 5045 KB  
Article
Ensemble-Based Machine Learning Algorithm for Loan Default Risk Prediction
by Abisola Akinjole, Olamilekan Shobayo, Jumoke Popoola, Obinna Okoyeigbo and Bayode Ogunleye
Mathematics 2024, 12(21), 3423; https://doi.org/10.3390/math12213423 - 31 Oct 2024
Cited by 7 | Viewed by 11852
Abstract
Predicting credit default risk is important to financial institutions, as accurately predicting the likelihood of a borrower defaulting on their loans will help to reduce financial losses, thereby maintaining profitability and stability. Although machine learning models have been used in assessing large applications [...] Read more.
Predicting credit default risk is important to financial institutions, as accurately predicting the likelihood of a borrower defaulting on their loans will help to reduce financial losses, thereby maintaining profitability and stability. Although machine learning models have been used in assessing large applications with complex attributes for these predictions, there is still a need to identify the most effective techniques for the model development process, including the technique to address the issue of data imbalance. In this research, we conducted a comparative analysis of random forest, decision tree, SVMs (Support Vector Machines), XGBoost (Extreme Gradient Boosting), ADABoost (Adaptive Boosting) and the multi-layered perceptron, to predict credit defaults using loan data from LendingClub. Additionally, XGBoost was used as a framework for testing and evaluating various techniques. Moreover, we applied this XGBoost framework to handle the issue of class imbalance observed, by testing various resampling methods such as Random Over-Sampling (ROS), the Synthetic Minority Over-Sampling Technique (SMOTE), Adaptive Synthetic Sampling (ADASYN), Random Under-Sampling (RUS), and hybrid approaches like the SMOTE with Tomek Links and the SMOTE with Edited Nearest Neighbours (SMOTE + ENNs). The results showed that balanced datasets significantly outperformed the imbalanced dataset, with the SMOTE + ENNs delivering the best overall performance, achieving an accuracy of 90.49%, a precision of 94.61% and a recall of 92.02%. Furthermore, ensemble methods such as voting and stacking were employed to enhance performance further. Our proposed model achieved an accuracy of 93.7%, a precision of 95.6% and a recall of 95.5%, which shows the potential of ensemble methods in improving credit default predictions and can provide lending platforms with the tool to reduce default rates and financial losses. In conclusion, the findings from this study have broader implications for financial institutions, offering a robust approach to risk assessment beyond the LendingClub dataset. Full article
(This article belongs to the Special Issue Data-Driven Approaches in Revenue Management and Pricing Analytics)
Show Figures

Figure 1

24 pages, 2452 KB  
Article
Cancer Diagnosis by Gene-Environment Interactions via Combination of SMOTE-Tomek and Overlapped Group Screening Approaches with Application to Imbalanced TCGA Clinical and Genomic Data
by Jie-Huei Wang, Cheng-Yu Liu, You-Ruei Min, Zih-Han Wu and Po-Lin Hou
Mathematics 2024, 12(14), 2209; https://doi.org/10.3390/math12142209 - 15 Jul 2024
Cited by 8 | Viewed by 2771
Abstract
The complexity of cancer development involves intricate interactions among multiple biomarkers, such as gene-environment interactions. Utilizing microarray gene expression profile data for cancer classification is anticipated to be effective, thus drawing considerable interest in the fields of bioinformatics and computational biology. Due to [...] Read more.
The complexity of cancer development involves intricate interactions among multiple biomarkers, such as gene-environment interactions. Utilizing microarray gene expression profile data for cancer classification is anticipated to be effective, thus drawing considerable interest in the fields of bioinformatics and computational biology. Due to the characteristics of genomic data, problems of high-dimensional interactions and noise interference do exist during the analysis process. When building cancer diagnosis models, we often face the dilemma of model adaptation errors due to an imbalance of data types. To mitigate the issues, we apply the SMOTE-Tomek procedure to rectify the imbalance problem. Following this, we utilize the overlapping group screening method alongside a binary logistic regression model to integrate gene pathway information, facilitating the identification of significant biomarkers associated with clinically imbalanced cancer or normal outcomes. Simulation studies across different imbalanced rates and gene structures validate our proposed method’s effectiveness, surpassing common machine learning techniques in terms of classification prediction accuracy. We also demonstrate that prediction performance improves with SMOTE-Tomek treatment compared to no imbalance treatment and SMOTE treatment across various imbalance rates. In the real-world application, we integrate clinical and gene expression data with prior pathway information. We employ SMOTE-Tomek and our proposed methods to identify critical biomarkers and gene-environment interactions linked to the imbalanced binary outcomes (cancer or normal) in patients from the Cancer Genome Atlas datasets of lung adenocarcinoma and breast invasive carcinoma. Our proposed method consistently achieves satisfactory classification accuracy. Additionally, we have identified biomarkers indicative of gene-environment interactions relevant to cancer and have provided corresponding estimates of odds ratios. Moreover, in high-dimensional imbalanced data, for achieving good prediction results, we recommend considering the order of balancing processing and feature screening. Full article
(This article belongs to the Special Issue Statistical Analysis and Data Science for Complex Data)
Show Figures

Figure 1

25 pages, 752 KB  
Article
A Machine Learning-Based Framework with Enhanced Feature Selection and Resampling for Improved Intrusion Detection
by Fazila Malik, Qazi Waqas Khan, Atif Rizwan, Rana Alnashwan and Ghada Atteia
Mathematics 2024, 12(12), 1799; https://doi.org/10.3390/math12121799 - 9 Jun 2024
Cited by 6 | Viewed by 2721
Abstract
Intrusion Detection Systems (IDSs) play a crucial role in safeguarding network infrastructures from cyber threats and ensuring the integrity of highly sensitive data. Conventional IDS technologies, although successful in achieving high levels of accuracy, frequently encounter substantial model bias. This bias is primarily [...] Read more.
Intrusion Detection Systems (IDSs) play a crucial role in safeguarding network infrastructures from cyber threats and ensuring the integrity of highly sensitive data. Conventional IDS technologies, although successful in achieving high levels of accuracy, frequently encounter substantial model bias. This bias is primarily caused by imbalances in the data and the lack of relevance of certain features. This study aims to tackle these challenges by proposing an advanced machine learning (ML) based IDS that minimizes misclassification errors and corrects model bias. As a result, the predictive accuracy and generalizability of the IDS are significantly improved. The proposed system employs advanced feature selection techniques, such as Recursive Feature Elimination (RFE), sequential feature selection (SFS), and statistical feature selection, to refine the input feature set and minimize the impact of non-predictive attributes. In addition, this work incorporates data resampling methods such as Synthetic Minority Oversampling Technique and Edited Nearest Neighbor (SMOTE_ENN), Adaptive Synthetic Sampling (ADASYN), and Synthetic Minority Oversampling Technique–Tomek Links (SMOTE_Tomek) to address class imbalance and improve the accuracy of the model. The experimental results indicate that our proposed model, especially when utilizing the random forest (RF) algorithm, surpasses existing models regarding accuracy, precision, recall, and F Score across different data resampling methods. Using the ADASYN resampling method, the RF model achieves an accuracy of 99.9985% for botnet attacks and 99.9777% for Man-in-the-Middle (MITM) attacks, demonstrating the effectiveness of our approach in dealing with imbalanced data distributions. This research not only improves the abilities of IDS to identify botnet and MITM attacks but also provides a scalable and efficient solution that can be used in other areas where data imbalance is a recurring problem. This work has implications beyond IDS, offering valuable insights into using ML techniques in complex real-world scenarios. Full article
(This article belongs to the Special Issue Artificial Intelligence and Data Science)
Show Figures

Figure 1

Back to TopTop