1. Introduction
With strong growth recently, machine learning has been applied in so many application domains such as economics [
1,
2,
3], medical diagnosis [
4,
5,
6,
7,
8,
9,
10] and biomedical application [
11,
12]. In medical systems, many intelligence systems have been proposed to improve medical services. For disease diagnosis, Malmir, Amini & Chang [
13] proposed a decision support system for medical services with uncertain input values. The main purpose of this system is to suggest physicians make faster and more accurate medical diagnoses on several specific diseases. They only enterer their associated symptoms’ values and waiting for the conclusion suggestions of this system. Their tasks are to check the accuracy of that suggestion and make the disease diagnosis. Next, Eshtay, Faris & Obeid [
14] proposed an improvement of an extreme learning machine based on competitive swarm optimization to classify the medical dataset which has 15 medical classes. The experiment in this study demonstrates that their model can obtain better generalization performance with only a smaller number of hidden neurons as well as with higher stability. Moreover, to support the services of hospitals, Turgeman, May & Sciulli [
15] proposed a machine learning model based on Cubist tree to predict the hospital length of stay (denoted by LOS) for patients at the time of admission. The model was constructed and validated on the de-identified administrative dataset collected from the Veterans Health Administration (VHA) hospitals. This dataset consists of 20,321 inpatient admissions of 4,840 Congestive Heart Failure (CHF) patients. As another application, Liu, Chen & Tzeng [
16] proposed a model to identify the main factors in consumers’ adoption behavior of intelligent medical terminals to improve the quality of this system. Detection of key influential factors (and their interrelationships) of consumer adoption behavior is useful for improving and promoting intelligent medical terminals toward achieving a set aspiration level in each dimension and criterion. Later, Mustaqeem et al. [
17] introduced a dataset from heart disease patients in a local hospital and proposed a model to predict for heart disease diagnosis. That system promises to be a useful contribution in the field of e-health and medical informatics. Next, Lucini et al. [
18] proposed a text mining approach to predict future hospitalizations and discharges based on early ED patient records using the SOAP framework. The average F1-score of Nu-SVC reported in this study reached 77.70% with a standard deviation of 0.66%.
Disabilities including physical disability and motor disability are disorders that restrict individual activities [
19]. In reality, disability diagnosis is a complex process that requires expert occupational therapists. Therefore, the International Classification of Functioning, Disability and Health for Children and Youth (ICF-CY) was proposed to collect characteristics of children. Based on this framework, machine learning algorithms can learn and predict the disability. ICF-CY is the most widely used framework for disability diagnosis [
20,
21,
22], which is a multipurpose classification conceptual framework and is a branch of International Classification of Functioning, Disability and Health (denoted by ICF) [
23]. ICF and ICF-CY are frequently used as conceptual frameworks in disability evaluation, assessment, as well as classification in machine learning. In the ICF-CY framework, a number of self-care activities including drinking, eating, caring for body parts, washing oneself, etc are considered; in which, self-care activities are defined as a subset of activities and a participation component in ICF-CY and usually are expanded from birth to 7 years old [
23]. For instance, children with cerebral palsy usually have problems in self-care skills due to their limitations in thinking [
24]. Detection for self-care problems of children is an important problem that supports the process of choosing treatment approaches. Due to the diversity and complexity of self-care problems and lack of professional therapists, the expert system for self-care problem classification can help therapists in choosing the best treatment approach for specific cases.
In practical datasets, data has an uneven distribution between classes, this is called the class imbalance problem (CIP). This problem leads to difficulty in prediction and reduces the performance of the traditional classifiers. There are many approaches for dealing with class imbalance problems in which the most advanced is the sampling technique. Le et al. [
2] utilized several oversampling techniques to improve the performance of the bankruptcy prediction task. Ijaz et al. [
25] used DBSCAN-Based Outlier Detection, Synthetic Minority Over Sampling Technique (SMOTE), and Random Forest for Type 2 Diabetes and Hypertension identification. Recently, Bang et al. [
26] utilized the adaptive data-boosting technique for speech emotion detection in emotionally-imbalanced small-sample environments.
SCADI (Self-Care Activities Dataset based on ICF-CY), the first standard dataset for self-care activities of children with physical and motor disabilities, which is based on the ICF-CY framework, was introduced by Zarchi, Bushehri & Dehghanizadeh [
27] in 2018. In this dataset, the authors collected self-care actives of 70 children in seven classes. Each child had been carefully investigated through 29 self-care activities based on ICF-CY. In this study, the authors also proposed the ANN-based system for the self-care problem classification of children with physical and motor disability. However, this dataset has three issues that reduce the performance of the system. Therefore, this study proposed a robust framework for dealing with these issues to obtain good prediction performance. Briefly, the main contributions of this study can be summarized as follows. (1) A preprocessing module to convert the SCADI with 205 features to SCADI_2 with only 31 features was proposed. (2) An FSX framework using an oversampling technique for dealing with the class imbalance problem of identifying the self-care problem of children with physical and motor disabilities was proposed. (3) The importance feature analysis was conducted to identify the important features. The results of the study achieved 85.4% in terms of accuracy which is expected to make the problems of diagnosing as well as counseling and treatment better.
The rest of this paper is structured as follows.
Section 2 gives the related works on the class imbalance problem. As stated in the proposal,
Section 3 presents the materials and methods of this study; in which, the summary of the SCADI dataset is introduced. Next, we propose a robust framework with three modules including preprocessing, over-sampling and classification modules. The experiment is conducted and presented in
Section 4 to show the effectiveness of the developed framework for the self-care problem classification. Finally, the conclusion of this study is presented in
Section 5.
2. Related Works
Due to the class imbalance problem facing the SCADI dataset, the peformance of the classifier will be reduced. Therefore, a specific technique should be used for handling this probelm. Currently, many methods are given to deal with class imbalance problem which are grouped into four following categories [
28]. (1) Algorithm level approaches that modify the existing classifier learning algorithms to bias the learning toward the minority class [
29,
30] without adjusting training data. (2) Data level approaches change the class distribution by resampling the data space [
31,
32] to improve the predictive performance. Data level approaches consist of three stratergies including under-sampling, oversampling and hybrids techniques. This way, the modification of the learning algorithm is avoided since the effect caused by imbalance is decreased with the resampling step. (3) The cost-sensitive learning framework falls between the data and algorithm-level approaches. Both data-level transformations (by adding costs to instances) and algorithm-level modifications (by modifying the learning process to accept costs) [
33,
34] are combined together. The classifier is biased toward the minority class by assuming higher misclassification costs for this class and seeking to minimize the total cost errors of both classes. (4) Ensemble-based methods usually consist of a combination between an ensemble learning algorithm and one of the techniques above, specifically, data-level and cost-sensitive ones [
35]. By combining the data level approach to the ensemble learning algorithm, the new hybrid method usually preprocesses the data before training each classifier, whereas cost-sensitive ensembles instead of modifying the base classifier in order to accept costs in the learning process, guide the cost minimization via the ensemble learning algorithm. The above four methods are used depending on the dataset to improve performance.
Among the above approaches for handling the class imbalance problem, the data-level approaches used are quite popular. There are some famous data-level approaches such as SMOTE [
31], SMOTE-ENN [
36] and SMOTE-Tomek [
36]. SMOTE generates synthetic data randomly. SMOTE-ENN and SMOTE-Tomek are the combination of the oversampling technique and undersampling technique. They use SMOTE for generating the synthetic data samples and then they use ENN or Tomek to remove the noise sampling in the balanced dataset.
5. Conclusions
This study proposed a robust framework, namely FSX, to improve the prediction performance for SCADI, which is a standard dataset of self-care problem classification in medical diagnosis. This framework consists of three modules including the preprocessing module, oversampling module using SMOTE and classification module using the extreme gradient-boosting model. The preprocessing module is used to convert the original dataset to the modified dataset which only has 31 features. The second module is used to balance the training dataset to improve the prediction performance. The last module used a fashion classification model namely XGBoost to achieve the best performance. The first experiment was conducted by utilizing several famous oversampling techniques including SMOTE, SMOTE-ENN and SMOTE-Tomek with various balancing ratios to find the best oversampling method as well as the optimal balancing ratio for the oversampling module. The experimental results show that our proposed framework outperforms the state-of-the-art methods for the SCADI dataset. FSX reaches 85.4% in accuracy for self-care problem classification in medical diagnosis.
For future works, some feature selection approaches will be developed to improve the performance of our framework. Second, we will study sampling techniques including over-sampling and under-sampling techniques to enhance the performance as well.