Equally partitioned data are essential for prediction. However, in some important cases, the data distribution is severely unbalanced. In this study, several algorithms are utilized to maximize the learning accuracy when dealing with a highly unbalanced dataset. A linguistic algorithm is applied to evaluate the input and output relationship, namely Fuzzy c-Means (FCM), which is applied as a clustering algorithm for the majority class to balance the minority class data from about 3 million cases. Each cluster is used to train several artificial neural network (ANN) models. Different techniques are applied to generate an ensemble genetic fuzzy neuro model (EGFNM) in order to select the models. The first ensemble technique, the intra-cluster EGFNM, works by evaluating the best combination from all the models generated by each cluster. Another ensemble technique is the inter-cluster model EGFNM, which is based on selecting the best model from each cluster. The accuracy of these techniques is evaluated using the receiver operating characteristic (ROC) via its area under the curve (AUC
). Results show that the AUC
of the unbalanced data is 0.67974. The random cluster and best ANN single model have AUCs
of 0.7177 and 0.72806, respectively. For the ensemble evaluations, the intra-cluster and the inter-cluster EGFNMs produce 0.7293 and 0.73038, respectively. In conclusion, this study achieved improved results by performing the EGFNM method compared with the unbalanced training. This study concludes that selecting several best models will produce a better result compared with all models combined.
This is an open access article distributed under the Creative Commons Attribution License
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited