A Robust Framework for Self-Care Problem Identification for Children with Disability

Le, Tuong; Baik, Sung Wook

doi:10.3390/sym11010089

Open AccessArticle

A Robust Framework for Self-Care Problem Identification for Children with Disability

by

Tuong Le

and

Sung Wook Baik

^*

Digital Contents Research Institute, Sejong University, Seoul 05006, Korea

^*

Author to whom correspondence should be addressed.

Symmetry 2019, 11(1), 89; https://doi.org/10.3390/sym11010089

Submission received: 29 November 2018 / Revised: 7 January 2019 / Accepted: 11 January 2019 / Published: 15 January 2019

Download

Browse Figures

Versions Notes

Abstract

:

Recently, a standard dataset namely SCADI (Self-Care Activities Dataset) based on the International Classification of Functioning, Disability, and Health for Children and Youth framework for self-care problems identification of children with physical and motor disabilities was introduced. This is a very interesting, important and challenging topic due to its usefulness in medical diagnosis. This study proposes a robust framework using a sampling technique and extreme gradient boosting (FSX) to improve the prediction performance for the SCADI dataset. The proposed framework first converts the original dataset to a new dataset with a smaller number of dimensions. Then, our proposed framework balances the new dataset in the previous step using oversampling techniques with different ratios. Next, extreme gradient boosting was used to diagnose the problems. The experiments in terms of prediction performance and feature importance were conducted to show the effectiveness of FSX as well as to analyse the results. The experimental results show that FSX that uses the Synthetic Minority Over-sampling Technique (SMOTE) for the oversampling module outperforms the ANN (Artificial Neural Network) -based approach, Support vector machine (SVM) and Random Forest for the SCADI dataset. The overall accuracy of the proposed framework reaches 85.4%, a pretty high performance, which can be used for self-care problem classification in medical diagnosis.

Keywords:

self-care problem classification; class imbalance problem; sampling technique; extreme gradient boosting

1. Introduction

With strong growth recently, machine learning has been applied in so many application domains such as economics [1,2,3], medical diagnosis [4,5,6,7,8,9,10] and biomedical application [11,12]. In medical systems, many intelligence systems have been proposed to improve medical services. For disease diagnosis, Malmir, Amini & Chang [13] proposed a decision support system for medical services with uncertain input values. The main purpose of this system is to suggest physicians make faster and more accurate medical diagnoses on several specific diseases. They only enterer their associated symptoms’ values and waiting for the conclusion suggestions of this system. Their tasks are to check the accuracy of that suggestion and make the disease diagnosis. Next, Eshtay, Faris & Obeid [14] proposed an improvement of an extreme learning machine based on competitive swarm optimization to classify the medical dataset which has 15 medical classes. The experiment in this study demonstrates that their model can obtain better generalization performance with only a smaller number of hidden neurons as well as with higher stability. Moreover, to support the services of hospitals, Turgeman, May & Sciulli [15] proposed a machine learning model based on Cubist tree to predict the hospital length of stay (denoted by LOS) for patients at the time of admission. The model was constructed and validated on the de-identified administrative dataset collected from the Veterans Health Administration (VHA) hospitals. This dataset consists of 20,321 inpatient admissions of 4,840 Congestive Heart Failure (CHF) patients. As another application, Liu, Chen & Tzeng [16] proposed a model to identify the main factors in consumers’ adoption behavior of intelligent medical terminals to improve the quality of this system. Detection of key influential factors (and their interrelationships) of consumer adoption behavior is useful for improving and promoting intelligent medical terminals toward achieving a set aspiration level in each dimension and criterion. Later, Mustaqeem et al. [17] introduced a dataset from heart disease patients in a local hospital and proposed a model to predict for heart disease diagnosis. That system promises to be a useful contribution in the field of e-health and medical informatics. Next, Lucini et al. [18] proposed a text mining approach to predict future hospitalizations and discharges based on early ED patient records using the SOAP framework. The average F1-score of Nu-SVC reported in this study reached 77.70% with a standard deviation of 0.66%.

Disabilities including physical disability and motor disability are disorders that restrict individual activities [19]. In reality, disability diagnosis is a complex process that requires expert occupational therapists. Therefore, the International Classification of Functioning, Disability and Health for Children and Youth (ICF-CY) was proposed to collect characteristics of children. Based on this framework, machine learning algorithms can learn and predict the disability. ICF-CY is the most widely used framework for disability diagnosis [20,21,22], which is a multipurpose classification conceptual framework and is a branch of International Classification of Functioning, Disability and Health (denoted by ICF) [23]. ICF and ICF-CY are frequently used as conceptual frameworks in disability evaluation, assessment, as well as classification in machine learning. In the ICF-CY framework, a number of self-care activities including drinking, eating, caring for body parts, washing oneself, etc are considered; in which, self-care activities are defined as a subset of activities and a participation component in ICF-CY and usually are expanded from birth to 7 years old [23]. For instance, children with cerebral palsy usually have problems in self-care skills due to their limitations in thinking [24]. Detection for self-care problems of children is an important problem that supports the process of choosing treatment approaches. Due to the diversity and complexity of self-care problems and lack of professional therapists, the expert system for self-care problem classification can help therapists in choosing the best treatment approach for specific cases.

In practical datasets, data has an uneven distribution between classes, this is called the class imbalance problem (CIP). This problem leads to difficulty in prediction and reduces the performance of the traditional classifiers. There are many approaches for dealing with class imbalance problems in which the most advanced is the sampling technique. Le et al. [2] utilized several oversampling techniques to improve the performance of the bankruptcy prediction task. Ijaz et al. [25] used DBSCAN-Based Outlier Detection, Synthetic Minority Over Sampling Technique (SMOTE), and Random Forest for Type 2 Diabetes and Hypertension identification. Recently, Bang et al. [26] utilized the adaptive data-boosting technique for speech emotion detection in emotionally-imbalanced small-sample environments.

SCADI (Self-Care Activities Dataset based on ICF-CY), the first standard dataset for self-care activities of children with physical and motor disabilities, which is based on the ICF-CY framework, was introduced by Zarchi, Bushehri & Dehghanizadeh [27] in 2018. In this dataset, the authors collected self-care actives of 70 children in seven classes. Each child had been carefully investigated through 29 self-care activities based on ICF-CY. In this study, the authors also proposed the ANN-based system for the self-care problem classification of children with physical and motor disability. However, this dataset has three issues that reduce the performance of the system. Therefore, this study proposed a robust framework for dealing with these issues to obtain good prediction performance. Briefly, the main contributions of this study can be summarized as follows. (1) A preprocessing module to convert the SCADI with 205 features to SCADI_2 with only 31 features was proposed. (2) An FSX framework using an oversampling technique for dealing with the class imbalance problem of identifying the self-care problem of children with physical and motor disabilities was proposed. (3) The importance feature analysis was conducted to identify the important features. The results of the study achieved 85.4% in terms of accuracy which is expected to make the problems of diagnosing as well as counseling and treatment better.

The rest of this paper is structured as follows. Section 2 gives the related works on the class imbalance problem. As stated in the proposal, Section 3 presents the materials and methods of this study; in which, the summary of the SCADI dataset is introduced. Next, we propose a robust framework with three modules including preprocessing, over-sampling and classification modules. The experiment is conducted and presented in Section 4 to show the effectiveness of the developed framework for the self-care problem classification. Finally, the conclusion of this study is presented in Section 5.

2. Related Works

Due to the class imbalance problem facing the SCADI dataset, the peformance of the classifier will be reduced. Therefore, a specific technique should be used for handling this probelm. Currently, many methods are given to deal with class imbalance problem which are grouped into four following categories [28]. (1) Algorithm level approaches that modify the existing classifier learning algorithms to bias the learning toward the minority class [29,30] without adjusting training data. (2) Data level approaches change the class distribution by resampling the data space [31,32] to improve the predictive performance. Data level approaches consist of three stratergies including under-sampling, oversampling and hybrids techniques. This way, the modification of the learning algorithm is avoided since the effect caused by imbalance is decreased with the resampling step. (3) The cost-sensitive learning framework falls between the data and algorithm-level approaches. Both data-level transformations (by adding costs to instances) and algorithm-level modifications (by modifying the learning process to accept costs) [33,34] are combined together. The classifier is biased toward the minority class by assuming higher misclassification costs for this class and seeking to minimize the total cost errors of both classes. (4) Ensemble-based methods usually consist of a combination between an ensemble learning algorithm and one of the techniques above, specifically, data-level and cost-sensitive ones [35]. By combining the data level approach to the ensemble learning algorithm, the new hybrid method usually preprocesses the data before training each classifier, whereas cost-sensitive ensembles instead of modifying the base classifier in order to accept costs in the learning process, guide the cost minimization via the ensemble learning algorithm. The above four methods are used depending on the dataset to improve performance.

Among the above approaches for handling the class imbalance problem, the data-level approaches used are quite popular. There are some famous data-level approaches such as SMOTE [31], SMOTE-ENN [36] and SMOTE-Tomek [36]. SMOTE generates synthetic data randomly. SMOTE-ENN and SMOTE-Tomek are the combination of the oversampling technique and undersampling technique. They use SMOTE for generating the synthetic data samples and then they use ENN or Tomek to remove the noise sampling in the balanced dataset.

3. Materials and Methods

As per the discussion in the introduction, this section summarises the standard dataset, namely SCADI, for self-care problem classification in Section 3.1. Next, we analyse three problems of this dataset and give three solutions for solving these problems in Section 3.2 and Section 3.3. Then, the extreme gradient boosting model will be summarized in Section 3.4. The overall framework as the main contribution of this study will be presented in Section 3.5.

3.1. SCADI Dataset

The experimental dataset, namely SCADI [27], was collected by Zarchi et al. in collaboration with two professional therapists. SCADI can be considered as the first standard dataset in self-care activities of children with physical and motor disability as affirmed by this article [27]. Twenty-nine activities in eight categories including washing oneself, caring for body parts, toileting, dressing, eating, drinking, looking after one’s health and looking after one’s safety were considered as self-care activities (see Table 1). Each self-care activity listed above has one of the seven values represented for the level including NO impairment, MILD impairment, MODERATE impairment, SEVERE impairment, COMPLETE impairment, NOT Specified and NOT Applicable. Each above status is one binary feature in the SCADI dataset. Hence, the total number of features in the SCADI dataset is 203 (29 × 7). Additionally, the age and gender of participants are added to these features namely F1 and F2 respectively. In the gender feature, a male participant is represented by 0 and a female participant is represented by 1. In summary, SCADI consists of 205 features for each child.

To collect the records of the SCADI dataset, 70 students who study in the three educational and health centers belonging to Yazd, Iran at the period from 2016 to 2017 were invested. The age of these children was from 6 to 18 years old. Seventy children were categorized into seven groups based on their self-care activities by professional therapists as shown in Table 2, which will be used as the target class in the medical diagnosis.

3.2. Preprocessing

In the SCADI dataset, we realize the first problem is that the use of seven binary features for each activity will make the data have more dimensions and be susceptible to interference. Therefore, we proposed a modification of this dataset that uses only one feature for one activity. In the modified dataset, we use seven levels from 0 to 6 to represent seven statuses including NO impairment, MILD impairment, MODERATE impairment, SEVERE impairment, COMPLETE impairment, NOT Specified and NOT Applicable, respectively. Therefore, the modified dataset has 31 features including 29 activities, age and gender. Figure 1 shows the visualization of the SCADI_2 dataset using the PCA technique.

In addition, the distribution of seven classes has the problem of class imbalance. This issue will reduce the performance of the predicted model. For handling this issue, we use the SMOTE algorithm which is introduced in the next subsection to balance the dataset.

3.3. SMOTE Algorithm

To deal with the class imbalance problem occurring in the experimental dataset, we use the SMOTE algorithm for resampling SCADI_2 first. The SMOTE algorithm [31] generates synthetic samples for the minority classes based on the feature space similarities between the real samples. Fundamentally, SMOTE will consider k-nearest neighbors using the Euclidean distance denoted by

K_{x_{i}}

for each sample xi that the algorithm wants to generate. Next, SMOTE randomly selects an element

{\hat{x}}_{i}

in

K_{x_{i}}

so that

{\hat{x}}_{i}

is also in the same class with

x_{i}

. The feature vector of the new synthetic sample denoted x_new is determined as follows.

x_{n e w} = x_{i} + ({\hat{x}}_{i} - x_{i}) \times δ

(1)

where δ ∈ [0, 1] is a random number. According to Equation (1), the new synthetic sample is a point along the line segment joining

x_{i}

and

{\hat{x}}_{i}

. The algorithm will repeat this strategy until the target ratio is in reach. In that case, the resample dataset is nearly balanced. Note that if the SMOTE algorithm did not find any nearest neighbor of sample

x_{i}

or the number of nearest neighbors is smaller than k, we modified this algorithm by only duplicating this data point.

Figure 2 presents the flowchart of the SMOTE algorithm with our modification for dealing with the problem of a very small number of nearest neighbors. Figure 3 shows the visualization of the SCADI_2 dataset using Principal component analysis (PCA) after resampling data using SMOTE with different balancing ratios including 0.4, 0.6, 0.8 and 1.0 respectively.

3.4. Extreme Gradient Boosting

Chen and Guestrin [37] proposed Extreme Gradient Boosting (XGBoost) that is summarized as follows. Let D = {(x_i, y_i)}_i_=1.._n, where x_i is a vector of m features and y_i ∈

ℝ

is the label value. XGBoost is the tree ensemble model i.e., a set of classification and regression trees (CARTs). The prediction function for a sample x_i is as follows.

{\hat{y}}_{i} = \sum_{f \in ℱ} f (x_{i})

(2)

where

ℱ

and

{\hat{y}}_{i}

are the set of all possible CARTs and the predicted value respectively. The objective function can be written as follows.

ℒ (θ) = \sum_{i} l (y_{i}, {\hat{y}}_{i}) + \sum_{k} Ω (f_{k})

(3)

where

θ

represents a CART,

\sum_{i} l (y_{i}, {\hat{y}}_{i})

is the summation of the loss function between true and predicted labels and

\sum Ω (f)

is the regularization used to measure the complexity of the model.

At each iteration t, this model searches a function

f_{t} \in ℱ

that optimizes the objective and adds this function to the ensemble. Let

{\hat{y}}_{i}^{(t)}

be the prediction label of vector x_i at the t-th iteration. The model finds

f_{t}

to optimize the following objective:

\begin{matrix} ℒ^{(t)} = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{(t)}) + \sum_{i = 1}^{t} Ω (f_{i}) \\ = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i})) + \sum_{i = 1}^{t} Ω (f_{i}) \end{matrix}

(4)

The second order Taylor expansion was used to approximate the objective at step t as follows.

{\tilde{ℒ}}^{(t)} = \sum_{n = 1}^{N} [ρ_{i} f_{t} (x_{i}) + \frac{1}{2} ϱ_{i} f_{t}^{2} (x_{i})] + Ω (f_{t})

(5)

where

ρ_{i} = \partial_{{\hat{y}}_{i}^{(t - 1)}} l (y_{i}, {\hat{y}}_{i}^{(t - 1)})

and

ϱ_{i} = \partial_{{\hat{y}}_{i}^{(t - 1)}}^{2} l (y_{i}, {\hat{y}}_{i}^{(t - 1)})

. The model will iteratively add functions that optimize Equation (4) for several user-specified iterations. In addition, the author defined the regularization as follows:

Ω (f_{t}) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} w_{j}^{2}

(6)

Using Equations (4) and (6), XGBoost will effectively determine the objective to improve the learning time.

3.5. The Robust Framework

Figure 4 shows the architect of the FSX framework. Firstly, the SCADI dataset is passed through to the preprocessing module which will modify the SCADI dataset to give the SCADI_2 dataset as shown in Section 3.2. Then, this dataset is divided into training and testing datasets by 10-fold cross validation. Then, the training set will be resampled by the SMOTE algorithm, which was summarized in Section 3.3. This process helps improve the prediction performance. The balance dataset, which is the result of previous step, will be used to training the XGBoost Classifier summarized in Section 3.4. In the testing phrase, the testing set will be used to predict the problem for the new cases in the testing set. Based on this prediction, the doctor will give counseling and treatment to patients. The pseudocode of FSX is introduced in Algorithm 1.

Algorithm 1. FSX framework
Input: The SCADI dataset
Output: The best model for self-care problem identification and the best AUC value
1	Let R = {0.4, 0.6, 0.8, 1.0}
2	Let the ACC_best be 0 and M_best is ermty
3	For each r in R:
4	Spliting the SCADI dataset using k-fold cross validation.
5	Balancing the training set using SMOTE with the balancing ratio equals to r.
6	Using the balanced dataset in the previous step for training the XGB_r.
7	Evaluating XGB_r using the testing set and obtaining the ACC_r.
8	If ACC_r > ACC_best then
9	ACC_best = ACC_r
10	M_best = XGB_r
11	End for
12	Return M_best and ACC_best

4. Results

4.1. Oversampling Comparision

To choose the best oversampling technique for the SCADI dataset, we perform the oversampling comparision experiment that examines serveral oversampling techniques including SMOTE [31], SMOTE-ENN [36] and SMOTE-Tomek [36] with various balancing ratios for the oversampling module in the FSX framework. Based on the results shown in Table 3, SMOTE with the balancing ratio = 0.8 is the best oversampling technique for the SCADI dataset which yields 85.4% in terms of accuracy.

4.2. Performance Evaluation

This subsection reports on the performances of the proposed method with the balancing ratio = 0.8 compared with the performance of the ANN-based system in Reference [37], SVM and Random Forest (denoted by RF). According to Table 4, we can easily see that using our proposed framework can improve the prediction performance compared with the state-of-the art methods including ANN, SVM and RF for self-care problem classification in medical diagnosis. The proposed framework reaches the best performance at 85.4% while ANN, SVM and RF achieve 0.831, 0.780 and 0.834 respectively. The paired t-test was used to compare the performances of ANN and FSX. We ran two methods for the SCADI dataset 30 times and achieved a p-value = 0.0002 for two sets of performance, which is smaller than the threshold (0.05). Therefore, the proposed approach is better than ANN in terms of accuracy.

Table 5 shows the sensitivity and specificity of the experimental approaches in 10-fold cross validation. Random Forest achieved the best results at 0.786 in terms of sensitivity while FSX got the best results at 0.963 in terms of specificity.

4.3. Feature Importance

In this subsection, the feature importance of each feature over the 10-fold cross validation is reported in Figure 5. Based on these results, we can divide all features into three groups. The first group includes F2, F1, F4, F17 and F23, which are all very important features. Missing one of these values, the model may not have the good performance. The second group consists of F28, F6, F25, F12, F3, F27, F31, F16, F29, F13, F5, F20, F9, F7 and F26, which are still important for improving the performance classification. However, their influence is not equal to the first group. The rest of features including F18, F14, F8, F21, F11, F24, F15, F30, F22, F19 and F10 can be removed. It means that during data collection, we do not need to identify the features in the third group.

5. Conclusions

This study proposed a robust framework, namely FSX, to improve the prediction performance for SCADI, which is a standard dataset of self-care problem classification in medical diagnosis. This framework consists of three modules including the preprocessing module, oversampling module using SMOTE and classification module using the extreme gradient-boosting model. The preprocessing module is used to convert the original dataset to the modified dataset which only has 31 features. The second module is used to balance the training dataset to improve the prediction performance. The last module used a fashion classification model namely XGBoost to achieve the best performance. The first experiment was conducted by utilizing several famous oversampling techniques including SMOTE, SMOTE-ENN and SMOTE-Tomek with various balancing ratios to find the best oversampling method as well as the optimal balancing ratio for the oversampling module. The experimental results show that our proposed framework outperforms the state-of-the-art methods for the SCADI dataset. FSX reaches 85.4% in accuracy for self-care problem classification in medical diagnosis.

For future works, some feature selection approaches will be developed to improve the performance of our framework. Second, we will study sampling techniques including over-sampling and under-sampling techniques to enhance the performance as well.

Author Contributions

S.W.B. proposed the topic and obtained funding; T.L. proposed and implemented the framework. T.L. wrote the paper. S.W.B. improved the quality of the manuscript.

Acknowledgments

This research was supported by the Korean MSIT (Ministry of Science and ICT) under the National Program for Excellence in SW (2015-0-00938), supervised by the IITP (Institute for Information & communications Technology Promotion).

Conflicts of Interest

The authors declare no conflict of interest.

References

Le, T.; Le, H.S.; Vo, M.T.; Lee, M.Y.; Baik, S.W. A Cluster-Based Boosting Algorithm for Bankruptcy Prediction in a Highly Imbalanced Dataset. Symmetry 2018, 10, 250. [Google Scholar] [CrossRef]
Le, T.; Lee, M.Y.; Park, J.R.; Baik, S.W. Oversampling techniques for bankruptcy prediction: Novel features from a transaction dataset. Symmetry 2018, 10, 79. [Google Scholar] [CrossRef]
Le, T.; Vo, B.; Baik, S.W. Efficient algorithms for mining top-rank-k erasable patterns using pruning strategies and the subsume concept. Eng. Appl. Artif. Intell. 2018, 68, 1–9. [Google Scholar] [CrossRef]
Roan, T.N.; Ali, M.; Le, H.S. δ-equality of intuitionistic fuzzy sets: A new proximity measure and applications in medical diagnosis. Appl. Intell. 2018, 48, 499–525. [Google Scholar]
Le, H.S.; Tran, M.T.; Fujita, H.; Dey, N.; Ashour, A.S.; Vo, T.N.N.; Le, Q.A.; Chu, D.T. Dental diagnosis from X-Ray images: An expert system based on fuzzy computing. Biomed. Signal Process. Control 2018, 39, 64–73. [Google Scholar]
Ali, M.; Le, H.S.; Khan, M.; Nguyen, T.T. Segmentation of dental X-ray images in medical imaging using neutrosophic orthogonal matrices. Expert Syst. Appl. 2018, 91, 434–441. [Google Scholar] [CrossRef]
Vajda, S.; Karargyris, A.; Jäger, S.; Santosh, K.C.; Candemir, S.; Xue, Z.; Antani, S.K.; Thoma, G.R. Feature Selection for Automatic Tuberculosis Screening in Frontal Chest Radiographs. J. Med. Syst. 2018, 42, 146. [Google Scholar] [CrossRef]
Lan, K.; Wang, D.; Fong, S.; Liu, L.; Wong, K.; Dey, N. A Survey of Data Mining and Deep Learning in Bioinformatics. J. Med. Syst. 2018, 42, 139. [Google Scholar] [CrossRef]
Goshvarpour, A.; Goshvarpour, A. A Novel Feature Level Fusion for Heart Rate Variability Classification Using Correntropy and Cauchy-Schwarz Divergence. J. Med. Syst. 2018, 42, 109. [Google Scholar] [CrossRef]
Pham, N.T.; Lee, J.W.; Kwon, G.R.; Park, C.S. Efficient image splicing detection algorithm based on markov features. Multimed. Tools Appl. 2018. [Google Scholar] [CrossRef]
Le, D.H.; Pham, V.H. HGPEC: A Cytoscape app for prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network. BMC Syst. Biol. 2017, 11, 61. [Google Scholar] [CrossRef] [PubMed]
Le, D.H.; Dao, L.T.M. Annotating Diseases Using Human Phenotype Ontology Improves Prediction of Disease-Associated Long Non-coding RNAs. J. Mol. Biol. 2018, 430, 2219–2230. [Google Scholar] [CrossRef] [PubMed]
Malmir, B.; Amini, M.; Chang, S.I. A medical decision support system for disease diagnosis under uncertainty. Expert Syst. Appl. 2017, 88, 95–108. [Google Scholar] [CrossRef]
Eshtay, M.; Faris, H.; Obeid, N. Improving Extreme Learning Machine by Competitive Swarm Optimization and its application for medical diagnosis problems. Expert Syst. Appl. 2018, 104, 134–152. [Google Scholar] [CrossRef]
Turgeman, L.; May, J.; Sciulli, R. Insights from a machine learning model for predicting the hospital length of stay (los) at the time of admission. Expert Syst. Appl. 2017, 78, 376–385. [Google Scholar] [CrossRef]
Liu, Y.; Chen, Y.; Tzeng, G.H. Identification of key factors in consumers’ adoption behavior of intelligent medical terminals based on a hybrid modified MADM model for product improvement. Int. J. Med. Inform. 2017, 105, 68–82. [Google Scholar] [CrossRef]
Mustaqeem, A.; Anwar, S.M.; Khan, A.R.; Majid, M. A statistical analysis-based recommender model for heart disease patients. Int. J. Med. Inform. 2017, 108, 134–145. [Google Scholar] [CrossRef]
Lucini, F.R.; Fogliatto, F.S.; Silveira, G.J.C.; Neyeloff, J.; Anzanello, M.J.; Kuchenbecker, R.S.; Schaan, B.D. Text mining approach to predict hospital admissions using early medical records from the emergency department. Int. J. Med. Inform. 2017, 100, 1–8. [Google Scholar] [CrossRef] [PubMed]
Lewis Brown, R.; Turner, R.J. Physical disability and depression: Clarifying racial/ ethnic contrasts. J. Aging Health 2010, 22, 977–1000. [Google Scholar] [CrossRef]
Lollar, D.J.; Simeonsson, R.J. Diagnosis to function: Classification for children and youths. J. Dev. Behav. Pediatrics 2005, 26, 323–330. [Google Scholar] [CrossRef]
Lee, A.M. Using the ICF-CY to organise characteristics of children’s functioning. Disabil. Rehabil. 2011, 33, 605–616. [Google Scholar] [CrossRef] [PubMed]
Ståhl, Y.; Granlund, M.; Gäre-Andersson, B.; Enskär, K. Review article: Mapping of children’s health and development data on population level using the classification system ICF-CY. Scand. J. Public Health 2011, 39, 51–57. [Google Scholar] [CrossRef] [PubMed]
Organization, W.H. International Classification of Functioning, Disability, and Health: Children & Youth Version: ICF-CY; World Health Organization: Geneva, Switzerland, 2007. [Google Scholar]
Case-Smith, J. Self-care Strategies for Children with Developmental Disabilities. In Ways of Living: Self-Care Strategies for Special Needs, 2nd ed.; Christiansen, C., Ed.; American Occupational Therapy Association: Bethesda, MD, USA, 2000; pp. 83–121. [Google Scholar]
Ijaz, M.; Alfian, G.; Syafrudin, M.; Rhee, J. Hybrid Prediction Model for Type 2 Diabetes and Hypertension Using DBSCAN-Based Outlier Detection, Synthetic Minority Over Sampling Technique (SMOTE), and Random Forest. Appl. Sci. 2018, 8, 1325. [Google Scholar] [CrossRef]
Bang, J.; Hur, T.; Kim, D.; Lee, J.; Han, Y.; Banos, O.; Kim, J.I.; Lee, S. Adaptive Data Boosting Technique for Robust Personalized Speech Emotion in Emotionally-Imbalanced Small-Sample Environments. Sensors 2018, 18, 3744. [Google Scholar] [CrossRef]
Zarchi, M.S.; Fatemi Bushehri, S.M.M.; Dehghanizadeh, M. SCADI: A standard dataset for self-care problems classification of children with physical and motor disability. Int. J. Med. Inform. 2018, 114, 81–87. [Google Scholar] [CrossRef] [PubMed]
Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Learning from Imbalanced Data Sets; Springer: Berlin, Germany, 2018; pp. 1–377. ISBN 978-3-319-98073-7. [Google Scholar]
Lin, Y.; Lee, Y.; Wahba, G. Support vector machines for classification in nonstandard situations. Mach. Learn. 2002, 46, 191–202. [Google Scholar] [CrossRef]
Liu, B.; Ma, Y.; Wong, C. Improving an association rule-based classifier. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery, PKDD, Lyon, France, 13–16 September 2000; pp. 293–317. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Lemaitre, G.; Nogueira, F.; Aridas, C.K. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. J. Mach. Learn. Res. 2017, 18, 1–5. [Google Scholar]
Chawla, N.; Cieslak, D.; Hall, L.; Joshi, A. Automatically countering imbalance and its empirical relationship to cost. Data Min. Knowl. Discov. 2008, 17, 225–252. [Google Scholar] [CrossRef] [Green Version]
Ling, C.; Sheng, V.; Yang, Q. Test strategies for cost-sensitive decision trees. IEEE Trans. Knowl. Data Eng. 2006, 18, 1055–1067. [Google Scholar] [CrossRef] [Green Version]
Galar, M.; Fernández, A.; Barrenechea, E.; Bustince, H.; Herrera, F. A review on ensembles for class imbalance problem: Bagging, boosting and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C 2012, 42, 463–484. [Google Scholar] [CrossRef]
Batista, G.; Prati, R.C.; Monard, M.C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, T. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]

Figure 1. Visualization of SCADI dataset.

Figure 2. The flowchart of SMOTE algorithm.

Figure 3. Visualization of SCADI_2 dataset after resampling data using SMOTE with different balancing ratios.

Figure 4. The FSX framework.

Figure 5. Feature importance.

Table 1. The SCADI dataset of self-care activities.

Cat No.	Self-Care Category	Activity No.	Description	Feature Name
I	Washing oneself	1	Washing body parts	F3
		2	Washing whole body	F4
		3	Drying oneself	F5
II	Caring for body parts	4	Caring for skin	F6
		5	Caring for teeth	F7
		6	Caring for hair	F8
		7	Caring for fingernails	F9
		8	Caring for toenails	F10
		9	Caring for nose	F11
III	Toileting	10	Indicating need for urination	F12
		11	Carrying out urination appropriately	F13
		12	Indicating need for defecation	F14
		13	Carrying out defecation appropriately	F15
		14	Menstrual care	F16
IV	Dressing	15	Putting on clothes	F17
		16	Taking off clothes	F18
		17	Putting on footwear	F19
		18	Taking off footwear	F20
		19	Choosing appropriate clothing	F21
V	Eating	20	Indicating need for eating	F22
V	Eating	21	Carrying out eating appropriately	F23
VI	Drinking	22	Indicating need for drinking	F24
VI	Drinking	23	Indicating need for drinking	F25
VII	Looking after one’s health	24	Ensuring one’s physical comfort	F26
		25	Managing diet and fitness	F27
		26	Managing medications and following health advice	F28
		27	Seeking advice or assistance from caregivers or professionals	F29
		28	Avoiding risks of abuse of drugs or alcohol	F30
VIII	Looking after one’s safety	29	Looking after one’s safety	F31

Table 2. The SCADI dataset of self-care activities.

No.	Description	Notation
1	Caring for body parts problem	Class #1
2	Toileting problem	Class #2
3	Dressing problem	Class #3
4	Washing oneself and caring for body parts and dressing problem	Class #4
5	Washing oneself, caring for body parts, toileting and dressing problem	Class #5
6	Eating, Drinking, washing oneself, caring for body parts, toileting, dressing, looking after one’s health and looking after one’s safety problem	Class #6
7	No Problem	Class #7

Table 3. Performance results of several oversampling techniques with various balancing ratios.

Oversampling Technique	Balancing Ratio	Accuracy
SMOTE	0.4	0.837
	0.6	0.837
	0.8	0.854
	1.0	0.839
SMOTE-ENN	0.4	0.724
	0.6	0.812
	0.8	0.822
	1.0	0.807
SMOTE-Tomek	0.4	0.837
	0.6	0.837
	0.8	0.853
	1.0	0.839

Table 4. Performance results in terms of the accuracy of the experimental approaches in 10-fold cross validation.

Fold	ANN	FSX	SVM	RF
Fold 1	0.727	0.727	0.818	0.636
Fold 2	1.000	0.800	0.75	1.000
Fold 3	0.857	0.857	0.857	1.000
Fold 4	1.000	1.000	0.833	0.833
Fold 5	0.857	0.857	0.857	0.857
Fold 6	0.571	0.771	0.714	0.714
Fold 7	0.800	1.000	0.800	1.000
Fold 8	0.667	0.889	0.667	0.667
Fold 9	1.000	0.800	0.800	1.000
Fold 10	0.667	0.833	0.833	0.667
Average	0.815	0.854	0.793	0.837

Table 5. Performance results in terms of sensitivity and specificity of the experimental approaches in 10-fold cross validation.

Fold	Sensitivity				Specificity
Fold	ANN	FSX	SVM	RF	ANN	FSX	SVM	RF
Fold 1	0.643	0.875	0.714	0.428	0.948	0.958	0.966	0.935
Fold 2	1.000	0.700	0.600	1.000	1.000	0.975	0.950	1.000
Fold 3	0.875	1.000	0.875	1.000	0.938	1.000	0.958	1.000
Fold 4	1.000	0.583	0.889	0.889	1.000	0.955	0.933	0.930
Fold 5	0.600	0.572	0.600	0.750	0.971	0.944	0.971	0.958
Fold 6	0.500	0.667	0.625	0.625	0.875	0.889	0.917	0.917
Fold 7	0.889	0.600	0.667	1.000	0.917	0.967	0.950	1.000
Fold 8	0.572	1.000	0.571	0.571	0.940	1.000	0.946	0.940
Fold 9	1.000	1.000	0.500	1.000	1.000	1.000	0.950	1.000
Fold 10	0.600	0.572	0.800	0.600	0.910	0.939	0.960	0.920
Average	0.768	0.757	0.684	0.786	0.950	0.963	0.950	0.960

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Le, T.; Baik, S.W. A Robust Framework for Self-Care Problem Identification for Children with Disability. Symmetry 2019, 11, 89. https://doi.org/10.3390/sym11010089

AMA Style

Le T, Baik SW. A Robust Framework for Self-Care Problem Identification for Children with Disability. Symmetry. 2019; 11(1):89. https://doi.org/10.3390/sym11010089

Chicago/Turabian Style

Le, Tuong, and Sung Wook Baik. 2019. "A Robust Framework for Self-Care Problem Identification for Children with Disability" Symmetry 11, no. 1: 89. https://doi.org/10.3390/sym11010089

APA Style

Le, T., & Baik, S. W. (2019). A Robust Framework for Self-Care Problem Identification for Children with Disability. Symmetry, 11(1), 89. https://doi.org/10.3390/sym11010089

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Robust Framework for Self-Care Problem Identification for Children with Disability

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. SCADI Dataset

3.2. Preprocessing

3.3. SMOTE Algorithm

3.4. Extreme Gradient Boosting

3.5. The Robust Framework

4. Results

4.1. Oversampling Comparision

4.2. Performance Evaluation

4.3. Feature Importance

5. Conclusions

Author Contributions

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI