A Robust Framework for Self-Care Problem Identification for Children with Disability

Recently, a standard dataset namely SCADI (Self-Care Activities Dataset) based on the International Classification of Functioning, Disability, and Health for Children and Youth framework for self-care problems identification of children with physical and motor disabilities was introduced. This is a very interesting, important and challenging topic due to its usefulness in medical diagnosis. This study proposes a robust framework using a sampling technique and extreme gradient boosting (FSX) to improve the prediction performance for the SCADI dataset. The proposed framework first converts the original dataset to a new dataset with a smaller number of dimensions. Then, our proposed framework balances the new dataset in the previous step using oversampling techniques with different ratios. Next, extreme gradient boosting was used to diagnose the problems. The experiments in terms of prediction performance and feature importance were conducted to show the effectiveness of FSX as well as to analyse the results. The experimental results show that FSX that uses the Synthetic Minority Over-sampling Technique (SMOTE) for the oversampling module outperforms the ANN (Artificial Neural Network) -based approach, Support vector machine (SVM) and Random Forest for the SCADI dataset. The overall accuracy of the proposed framework reaches 85.4%, a pretty high performance, which can be used for self-care problem classification in medical diagnosis.


Introduction
With strong growth recently, machine learning has been applied in so many application domains such as economics [1][2][3], medical diagnosis [4][5][6][7][8][9][10] and biomedical application [11,12].In medical systems, many intelligence systems have been proposed to improve medical services.For disease diagnosis, Malmir, Amini & Chang [13] proposed a decision support system for medical services with uncertain input values.The main purpose of this system is to suggest physicians make faster and more accurate medical diagnoses on several specific diseases.They only enterer their associated symptoms' values and waiting for the conclusion suggestions of this system.Their tasks are to check the accuracy of that suggestion and make the disease diagnosis.Next, Eshtay, Faris & Obeid [14] proposed an improvement of an extreme learning machine based on competitive swarm optimization to classify the medical dataset which has 15 medical classes.The experiment in this study demonstrates that their model can obtain better generalization performance with only a smaller number of hidden neurons as well as with higher stability.Moreover, to support the services of hospitals, Turgeman, May & Sciulli [15] proposed a machine learning model based on Cubist tree to predict the hospital length of stay (denoted by LOS) for patients at the time of admission.The model was constructed and validated on the de-identified administrative dataset collected from the Veterans Health Administration (VHA) hospitals.This dataset consists of 20,321 inpatient admissions of 4,840 Congestive Heart Failure (CHF) patients.As another application, Liu, Chen & Tzeng [16] proposed a model to identify the main factors in consumers' adoption behavior of intelligent medical terminals to improve the quality of this system.Detection of key influential factors (and their interrelationships) of consumer adoption behavior is useful for improving and promoting intelligent medical terminals toward achieving a set aspiration level in each dimension and criterion.Later, Mustaqeem et al. [17] introduced a dataset from heart disease patients in a local hospital and proposed a model to predict for heart disease diagnosis.That system promises to be a useful contribution in the field of e-health and medical informatics.Next, Lucini et al. [18] proposed a text mining approach to predict future hospitalizations and discharges based on early ED patient records using the SOAP framework.The average F1-score of Nu-SVC reported in this study reached 77.70% with a standard deviation of 0.66%.
Disabilities including physical disability and motor disability are disorders that restrict individual activities [19].In reality, disability diagnosis is a complex process that requires expert occupational therapists.Therefore, the International Classification of Functioning, Disability and Health for Children and Youth (ICF-CY) was proposed to collect characteristics of children.Based on this framework, machine learning algorithms can learn and predict the disability.ICF-CY is the most widely used framework for disability diagnosis [20][21][22], which is a multipurpose classification conceptual framework and is a branch of International Classification of Functioning, Disability and Health (denoted by ICF) [23].ICF and ICF-CY are frequently used as conceptual frameworks in disability evaluation, assessment, as well as classification in machine learning.In the ICF-CY framework, a number of self-care activities including drinking, eating, caring for body parts, washing oneself, etc are considered; in which, self-care activities are defined as a subset of activities and a participation component in ICF-CY and usually are expanded from birth to 7 years old [23].For instance, children with cerebral palsy usually have problems in self-care skills due to their limitations in thinking [24].Detection for self-care problems of children is an important problem that supports the process of choosing treatment approaches.Due to the diversity and complexity of self-care problems and lack of professional therapists, the expert system for self-care problem classification can help therapists in choosing the best treatment approach for specific cases.
In practical datasets, data has an uneven distribution between classes, this is called the class imbalance problem (CIP).This problem leads to difficulty in prediction and reduces the performance of the traditional classifiers.There are many approaches for dealing with class imbalance problems in which the most advanced is the sampling technique.Le et al. [2] utilized several oversampling techniques to improve the performance of the bankruptcy prediction task.Ijaz et al. [25] used DBSCAN-Based Outlier Detection, Synthetic Minority Over Sampling Technique (SMOTE), and Random Forest for Type 2 Diabetes and Hypertension identification.Recently, Bang et al. [26] utilized the adaptive data-boosting technique for speech emotion detection in emotionally-imbalanced small-sample environments.SCADI (Self-Care Activities Dataset based on ICF-CY), the first standard dataset for self-care activities of children with physical and motor disabilities, which is based on the ICF-CY framework, was introduced by Zarchi, Bushehri & Dehghanizadeh [27] in 2018.In this dataset, the authors collected self-care actives of 70 children in seven classes.Each child had been carefully investigated through 29 self-care activities based on ICF-CY.In this study, the authors also proposed the ANN-based system for the self-care problem classification of children with physical and motor disability.However, this dataset has three issues that reduce the performance of the system.Therefore, this study proposed a robust framework for dealing with these issues to obtain good prediction performance.Briefly, the main contributions of this study can be summarized as follows.(1) A preprocessing module to convert the SCADI with 205 features to SCADI_2 with only 31 features was proposed.(2) An FSX framework using an oversampling technique for dealing with the class imbalance problem of identifying the self-care problem of children with physical and motor disabilities was proposed.(3) The importance feature analysis was conducted to identify the important features.The results of the study achieved 85.4% in terms of accuracy which is expected to make the problems of diagnosing as well as counseling and treatment better.The rest of this paper is structured as follows.Section 2 gives the related works on the class imbalance problem.As stated in the proposal, Section 3 presents the materials and methods of this study; in which, the summary of the SCADI dataset is introduced.Next, we propose a robust framework with three modules including preprocessing, over-sampling and classification modules.The experiment is conducted and presented in Section 4 to show the effectiveness of the developed framework for the self-care problem classification.Finally, the conclusion of this study is presented in Section 5.

Related Works
Due to the class imbalance problem facing the SCADI dataset, the peformance of the classifier will be reduced.Therefore, a specific technique should be used for handling this probelm.Currently, many methods are given to deal with class imbalance problem which are grouped into four following categories [28].(1) Algorithm level approaches that modify the existing classifier learning algorithms to bias the learning toward the minority class [29,30] without adjusting training data.(2) Data level approaches change the class distribution by resampling the data space [31,32] to improve the predictive performance.Data level approaches consist of three stratergies including under-sampling, oversampling and hybrids techniques.This way, the modification of the learning algorithm is avoided since the effect caused by imbalance is decreased with the resampling step.(3) The cost-sensitive learning framework falls between the data and algorithm-level approaches.Both data-level transformations (by adding costs to instances) and algorithm-level modifications (by modifying the learning process to accept costs) [33,34] are combined together.The classifier is biased toward the minority class by assuming higher misclassification costs for this class and seeking to minimize the total cost errors of both classes.(4) Ensemble-based methods usually consist of a combination between an ensemble learning algorithm and one of the techniques above, specifically, data-level and cost-sensitive ones [35].By combining the data level approach to the ensemble learning algorithm, the new hybrid method usually preprocesses the data before training each classifier, whereas cost-sensitive ensembles instead of modifying the base classifier in order to accept costs in the learning process, guide the cost minimization via the ensemble learning algorithm.The above four methods are used depending on the dataset to improve performance.
Among the above approaches for handling the class imbalance problem, the data-level approaches used are quite popular.There are some famous data-level approaches such as SMOTE [31], SMOTE-ENN [36] and SMOTE-Tomek [36].SMOTE generates synthetic data randomly.SMOTE-ENN and SMOTE-Tomek are the combination of the oversampling technique and undersampling technique.They use SMOTE for generating the synthetic data samples and then they use ENN or Tomek to remove the noise sampling in the balanced dataset.

Materials and Methods
As per the discussion in the introduction, this section summarises the standard dataset, namely SCADI, for self-care problem classification in Section 3.1.Next, we analyse three problems of this dataset and give three solutions for solving these problems in Sections 3.2 and 3.3.Then, the extreme gradient boosting model will be summarized in Section 3.4.The overall framework as the main contribution of this study will be presented in Section 3.5.

SCADI Dataset
The experimental dataset, namely SCADI [27], was collected by Zarchi et al. in collaboration with two professional therapists.SCADI can be considered as the first standard dataset in self-care activities of children with physical and motor disability as affirmed by this article [27].Twenty-nine activities in eight categories including washing oneself, caring for body parts, toileting, dressing, eating, drinking, looking after one's health and looking after one's safety were considered as self-care activities (see Table 1).Each self-care activity listed above has one of the seven values represented for the level including NO impairment, MILD impairment, MODERATE impairment, SEVERE impairment, COMPLETE impairment, NOT Specified and NOT Applicable.Each above status is one binary feature in the SCADI dataset.Hence, the total number of features in the SCADI dataset is 203 (29 × 7).Additionally, the age and gender of participants are added to these features namely F1 and F2 respectively.In the gender feature, a male participant is represented by 0 and a female participant is represented by 1.In summary, SCADI consists of 205 features for each child.To collect the records of the SCADI dataset, 70 students who study in the three educational and health centers belonging to Yazd, Iran at the period from 2016 to 2017 were invested.The age of these children was from 6 to 18 years old.Seventy children were categorized into seven groups based on their self-care activities by professional therapists as shown in Table 2, which will be used as the target class in the medical diagnosis.

Preprocessing
In the SCADI dataset, we realize the first problem is that the use of seven binary features for each activity will make the data have more dimensions and be susceptible to interference.Therefore, we proposed a modification of this dataset that uses only one feature for one activity.In the modified dataset, we use seven levels from 0 to 6 to represent seven statuses including NO impairment, MILD impairment, MODERATE impairment, SEVERE impairment, COMPLETE impairment, NOT Specified and NOT Applicable, respectively.Therefore, the modified dataset has 31 features including 29 activities, age and gender.Figure 1 shows the visualization of the SCADI_2 dataset using the PCA technique.

6
toileting, dressing, looking after one's health and looking after one's safety problem Class #6 No Problem Class #7

Preprocessing
In the SCADI dataset, we realize the first problem is that the use of seven binary features for each activity will make the data have more dimensions and be susceptible to interference.Therefore, we proposed a modification of this dataset that uses only one feature for one activity.In the modified dataset, we use seven levels from 0 to 6 to represent seven statuses including NO impairment, MILD impairment, MODERATE impairment, SEVERE impairment, COMPLETE impairment, NOT Specified and NOT Applicable, respectively.Therefore, the modified dataset has 31 features including 29 activities, age and gender.Figure 1 shows the visualization of the SCADI_2 dataset using the PCA technique.
In addition, the distribution of seven classes has the problem of class imbalance.This issue will reduce the performance of the predicted model.For handling this issue, we use the SMOTE algorithm which is introduced in the next subsection to balance the dataset.

SMOTE Algorithm
To deal with the class imbalance problem occurring in the experimental dataset, we use the SMOTE algorithm for resampling SCADI_2 first.The SMOTE algorithm [31] generates synthetic samples for the minority classes based on the feature space similarities between the real samples.Fundamentally, SMOTE will consider k-nearest neighbors using the Euclidean distance denoted by   for each sample xi that the algorithm wants to generate.Next, SMOTE randomly selects an element  in  so that  is also in the same class with  .The feature vector of the new synthetic sample denoted xnew is determined as follows.
where δ ∈ [0, 1] is a random number.According to Equation (1), the new synthetic sample is a point along the line segment joining  and  .The algorithm will repeat this strategy until the target ratio is in reach.In that case, the resample dataset is nearly balanced.Note that if the SMOTE algorithm did not find any nearest neighbor of sample  or the number of nearest neighbors is smaller than k, we modified this algorithm by only duplicating this data point.In addition, the distribution of seven classes has the problem of class imbalance.This issue will reduce the performance of the predicted model.For handling this issue, we use the SMOTE algorithm which is introduced in the next subsection to balance the dataset.

SMOTE Algorithm
To deal with the class imbalance problem occurring in the experimental dataset, we use the SMOTE algorithm for resampling SCADI_2 first.The SMOTE algorithm [31] generates synthetic samples for the minority classes based on the feature space similarities between the real samples.Fundamentally, SMOTE will consider k-nearest neighbors using the Euclidean distance denoted by K x i for each sample xi that the algorithm wants to generate.Next, SMOTE randomly selects an element xi in K x i so that xi is also in the same class with x i .The feature vector of the new synthetic sample denoted x new is determined as follows.
where δ ∈ [0, 1] is a random number.According to Equation (1), the new synthetic sample is a point along the line segment joining x i and xi .The algorithm will repeat this strategy until the target ratio is in reach.In that case, the resample dataset is nearly balanced.Note that if the SMOTE algorithm did not find any nearest neighbor of sample x i or the number of nearest neighbors is smaller than k, we modified this algorithm by only duplicating this data point.Figure 2 presents the flowchart of the SMOTE algorithm with our modification for dealing with the problem of a very small number of nearest neighbors.Figure 3 shows the visualization of the SCADI_2 dataset using Principal component analysis (PCA) after resampling data using SMOTE with different balancing ratios including 0.4, 0.6, 0.8 and 1.0 respectively.Figure 2 presents the flowchart of the SMOTE algorithm with our modification for dealing with the problem of a very small number of nearest neighbors.Figure 3 shows the visualization of the SCADI_2 dataset using Principal component analysis (PCA) after resampling data using SMOTE with different balancing ratios including 0.4, 0.6, 0.8 and 1.0 respectively.

Extreme Gradient Boosting
Chen and Guestrin [37] proposed Extreme Gradient Boosting (XGBoost) that is summarized as follows.Let D = {(xi, yi)}i = 1..n, where xi is a vector of m features and yi ∈ ℝ is the label value.XGBoost is the tree ensemble model i.e., a set of classification and regression trees (CARTs).The prediction function for a sample xi is as follows.Figure 2 presents the flowchart of the SMOTE algorithm with our modification for dealing with the problem of a very small number of nearest neighbors.Figure 3 shows the visualization of the SCADI_2 dataset using Principal component analysis (PCA) after resampling data using SMOTE with different balancing ratios including 0.4, 0.6, 0.8 and 1.0 respectively.

Extreme Gradient Boosting
Chen and Guestrin [37] proposed Extreme Gradient Boosting (XGBoost) that is summarized as follows.Let D = {(xi, yi)}i = 1..n, where xi is a vector of m features and yi ∈ ℝ is the label value.XGBoost is the tree ensemble model i.e., a set of classification and regression trees (CARTs).The prediction function for a sample xi is as follows.

Extreme Gradient Boosting
Chen and Guestrin [37] proposed Extreme Gradient Boosting (XGBoost) that is summarized as follows.Let D = {(x i , y i )} i=1..n , where x i is a vector of m features and y i ∈ R is the label value.XGBoost is the tree ensemble model i.e., a set of classification and regression trees (CARTs).The prediction function for a sample x i is as follows.
Symmetry 2019, 11, 89 where F and ŷi are the set of all possible CARTs and the predicted value respectively.The objective function can be written as follows.
where θ represents a CART, i l(y i , ŷi ) is the summation of the loss function between true and predicted labels and Ω( f ) is the regularization used to measure the complexity of the model.At each iteration t, this model searches a function f t ∈ F that optimizes the objective and adds this function to the ensemble.Let ŷ(t) i be the prediction label of vector x i at the t-th iteration.The model finds f t to optimize the following objective: The second order Taylor expansion was used to approximate the objective at step t as follows. where ).The model will iteratively add functions that optimize Equation ( 4) for several user-specified iterations.In addition, the author defined the regularization as follows: Using Equations ( 4) and ( 6), XGBoost will effectively determine the objective to improve the learning time.

The Robust Framework
Figure 4 shows the architect of the FSX framework.Firstly, the SCADI dataset is passed through to the preprocessing module which will modify the SCADI dataset to give the SCADI_2 dataset as shown in Section 3.2.Then, this dataset is divided into training and testing datasets by 10-fold cross validation.Then, the training set will be resampled by the SMOTE algorithm, which was summarized in Section 3.3.This process helps improve the prediction performance.The balance dataset, which is the result of previous step, will be used to training the XGBoost Classifier summarized in Section 3.4.In the testing phrase, the testing set will be used to predict the problem for the new cases in the testing set.Based on this prediction, the doctor will give counseling and treatment to patients.The pseudocode of FSX is introduced in Algorithm 1.

Oversampling Comparision
To choose the best oversampling technique for the SCADI dataset, we perform the oversampling comparision experiment that examines serveral oversampling techniques including SMOTE [31], SMOTE-ENN [36] and SMOTE-Tomek [36] with various balancing ratios for the oversampling module in the FSX framework.Based on the results shown in Table 3, SMOTE with the balancing ratio = 0.8 is the best oversampling technique for the SCADI dataset which yields 85.4% in terms of accuracy.Let the ACC best be 0 and M best is ermty 3 For each r in R: 4 Spliting the SCADI dataset using k-fold cross validation.5 Balancing the training set using SMOTE with the balancing ratio equals to r. 6 Using the balanced dataset in the previous step for training the XGB r .7 Evaluating XGB r using the testing set and obtaining the ACC r .

Oversampling Comparision
To choose the best oversampling technique for the SCADI dataset, we perform the oversampling comparision experiment that examines serveral oversampling techniques including SMOTE [31], SMOTE-ENN [36] and SMOTE-Tomek [36] with various balancing ratios for the oversampling module in the FSX framework.Based on the results shown in Table 3, SMOTE with the balancing ratio = 0.8 is the best oversampling technique for the SCADI dataset which yields 85.4% in terms of accuracy.

Performance Evaluation
This subsection reports on the performances of the proposed method with the balancing ratio = 0.8 compared with the performance of the ANN-based system in Reference [37], SVM and Random Forest (denoted by RF).According to Table 4, we can easily see that using our proposed framework can improve the prediction performance compared with the state-of-the art methods including ANN, SVM and RF for self-care problem classification in medical diagnosis.The proposed framework reaches the best performance at 85.4% while ANN, SVM and RF achieve 0.831, 0.780 and 0.834 respectively.The paired t-test was used to compare the performances of ANN and FSX.We ran two methods for the SCADI dataset 30 times and achieved a p-value = 0.0002 for two sets of performance, which is smaller than the threshold (0.05).Therefore, the proposed approach is better than ANN in terms of accuracy.

Feature Importance
In this subsection, the feature importance of each feature over the 10-fold cross validation is reported in Figure 5. Based on these results, we can divide all features into three groups.The first group includes F2, F1, F4, F17 and F23, which are all very important features.Missing one of these values, the model may not have the good performance.The second group consists of F28, F6, F25, F12, F3, F27, F31, F16, F29, F13, F5, F20, F9, F7 and F26, which are still important for improving the performance classification.However, their influence is not equal to the first group.The rest of features including F18, F14, F8, F21, F11, F24, F15, F30, F22, F19 and F10 can be removed.It means that during data collection, we do not need to identify the features in the third group.In this subsection, the feature importance of each feature over the 10-fold cross validation is reported in Figure 5. Based on these results, we can divide all features into three groups.The first group includes F2, F1, F4, F17 and F23, which are all very important features.Missing one of these values, the model may not have the good performance.The second group consists of F28, F6, F25, F12, F3, F27, F31, F16, F29, F13, F5, F20, F9, F7 and F26, which are still important for improving the performance classification.However, their influence is not equal to the first group.The rest of features including F18, F14, F8, F21, F11, F24, F15, F30, F22, F19 and F10 can be removed.It means that during data collection, we do not need to identify the features in the third group.

Conclusions
This study proposed a robust framework, namely FSX, to improve the prediction performance for SCADI, which is a standard dataset of self-care problem classification in medical diagnosis.This framework consists of three modules including the preprocessing module, oversampling module using SMOTE and classification module using the extreme gradient-boosting model.The preprocessing module is used to convert the original dataset to the modified dataset which only has 31 features.The second module is used to balance the training dataset to improve the prediction performance.The last module used a fashion classification model namely XGBoost to achieve the best performance.The first experiment was conducted by utilizing several famous oversampling techniques including SMOTE, SMOTE-ENN and SMOTE-Tomek with various balancing ratios to find the best oversampling method as well as the optimal balancing ratio for the oversampling module.The experimental results show that our proposed framework outperforms the state-of-theart methods for the SCADI dataset.FSX reaches 85.4% in accuracy for self-care problem classification in medical diagnosis.

Conclusions
This study proposed a robust framework, namely FSX, to improve the prediction performance for SCADI, which is a standard dataset of self-care problem classification in medical diagnosis.This framework consists of three modules including the preprocessing module, oversampling module using SMOTE and classification module using the extreme gradient-boosting model.The preprocessing module is used to convert the original dataset to the modified dataset which only has 31 features.The second module is used to balance the training dataset to improve the prediction performance.The last module used a fashion classification model namely XGBoost to achieve the best performance.The first experiment was conducted by utilizing several famous oversampling techniques including SMOTE, SMOTE-ENN and SMOTE-Tomek with various balancing ratios to find the best oversampling

Figure 3 .
Figure 3. Visualization of SCADI_2 dataset after resampling data using SMOTE with different balancing ratios.

Figure 3 .
Figure 3. Visualization of SCADI_2 dataset after resampling data using SMOTE with different balancing ratios.

Figure 3 .
Figure 3. Visualization of SCADI_2 dataset after resampling data using SMOTE with different balancing ratios.

Table 1 .
The SCADI dataset of self-care activities.

Table 2 .
The SCADI dataset of self-care activities.

8
If ACC r > ACC best then 9 ACC best = ACC r 10 M best = XGB r 11 End for 12 Return M best and ACC best

Table 3 .
Performance results of several oversampling techniques with various balancing ratios.

Table 4 .
Performance results in terms of the accuracy of the experimental approaches in 10-fold cross validation.

Table 5
shows the sensitivity and specificity of the experimental approaches in 10-fold cross validation.Random Forest achieved the best results at 0.786 in terms of sensitivity while FSX got the best results at 0.963 in terms of specificity.

Table 5 .
Performance results in terms of sensitivity and specificity of the experimental approaches in 10-fold cross validation.