Next Article in Journal
Hydrothermal Carbonization of Lemon Peel Waste: Preliminary Results on the Effects of Temperature during Process Water Recirculation
Next Article in Special Issue
Quality Properties of Execution Tracing, an Empirical Study
Previous Article in Journal
Feature Learning for Stock Price Prediction Shows a Significant Role of Analyst Rating
Previous Article in Special Issue
Text Mining of Stocktwits Data for Predicting Stock Prices
Article

SMOTE-ENC: A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features

School of Computer Science, University of Sydney, Sydney, NSW 2006, Australia
*
Author to whom correspondence should be addressed.
Appl. Syst. Innov. 2021, 4(1), 18; https://doi.org/10.3390/asi4010018
Received: 25 December 2020 / Revised: 17 February 2021 / Accepted: 18 February 2021 / Published: 2 March 2021
(This article belongs to the Collection Feature Paper Collection in Applied System Innovation)
Real-world datasets are heavily skewed where some classes are significantly outnumbered by the other classes. In these situations, machine learning algorithms fail to achieve substantial efficacy while predicting these underrepresented instances. To solve this problem, many variations of synthetic minority oversampling methods (SMOTE) have been proposed to balance datasets which deal with continuous features. However, for datasets with both nominal and continuous features, SMOTE-NC is the only SMOTE-based oversampling technique to balance the data. In this paper, we present a novel minority oversampling method, SMOTE-ENC (SMOTE—Encoded Nominal and Continuous), in which nominal features are encoded as numeric values and the difference between two such numeric values reflects the amount of change of association with the minority class. Our experiments show that classification models using the SMOTE-ENC method offer better prediction than models using SMOTE-NC when the dataset has a substantial number of nominal features and also when there is some association between the categorical features and the target class. Additionally, our proposed method addressed one of the major limitations of the SMOTE-NC algorithm. SMOTE-NC can be applied only on mixed datasets that have features consisting of both continuous and nominal features and cannot function if all the features of the dataset are nominal. Our novel method has been generalized to be applied to both mixed datasets and nominal-only datasets. View Full-Text
Keywords: SMOTE; nominal feature; continuous feature; class imbalance; precision; recall; area under receiver operating characteristic curve (ROC-AUC); area under precision-recall curve (PR-AUC) SMOTE; nominal feature; continuous feature; class imbalance; precision; recall; area under receiver operating characteristic curve (ROC-AUC); area under precision-recall curve (PR-AUC)
Show Figures

Figure 1

MDPI and ACS Style

Mukherjee, M.; Khushi, M. SMOTE-ENC: A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features. Appl. Syst. Innov. 2021, 4, 18. https://doi.org/10.3390/asi4010018

AMA Style

Mukherjee M, Khushi M. SMOTE-ENC: A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features. Applied System Innovation. 2021; 4(1):18. https://doi.org/10.3390/asi4010018

Chicago/Turabian Style

Mukherjee, Mimi, and Matloob Khushi. 2021. "SMOTE-ENC: A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features" Applied System Innovation 4, no. 1: 18. https://doi.org/10.3390/asi4010018

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop