Classification Models of Action Research Arm Test Activities in Post-Stroke Patients Based on Human Hand Motion

The Action Research Arm Test (ARAT) presents a ceiling effect that prevents the detection of improvements produced with rehabilitation treatments in stroke patients with mild finger joint impairments. The aim of this study was to develop classification models to predict whether activities with similar ARAT scores were performed by a healthy subject or by a subject post-stroke using the extension and flexion angles of 11 finger joints as features. For this purpose, we used three algorithms: Support Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbors (KNN). The dataset presented class imbalance, and the classification models presented a low recall, especially in the stroke class. Therefore, we implemented class balance using Borderline-SMOTE. After data balancing the classification models showed significantly higher accuracy, recall, f1-score, and AUC. However, after data balancing, the SVM classifier showed a higher performance with a precision of 98%, a recall of 97.5%, and an AUC of 0.996. The results showed that classification models based on human hand motion features in combination with the oversampling algorithm Borderline-SMOTE achieve higher performance. Furthermore, our study suggests that there are differences in ARAT activities performed between healthy and post-stroke individuals that are not detected by the ARAT scoring process.


Introduction
Machine learning (ML) is considered a subfield of computer science where knowledge from artificial intelligence and statistics is applied to the generation of computational models. ML algorithms can learn automatically and generate a model from input data without being explicitly programmed to produce a particular output [1]. The learning process is performed through training and the dataset for training is known as training data [2]. The classification algorithms belong to supervised learning. In supervised learning, the algorithm learns through a labeled data set (for example, a set of images labeled as containing a cat or a lion), where each training data sample is presented in the form of an input value with an output label [3]. The algorithm trains a model that, from the input values, can predict the correct response based on the features defined in the process [2]. The classification problems are commonly categorized into binary classification and multi-class classification. The dataset is classified into two classes in binary classification, while in multi-class classification, the given dataset is classified into more than two classes [4]. In recent years classification models have been used in various clinical applications, for example, in medical diagnosis, disease classification, prediction of clinical outcomes, and treatment responses [5][6][7][8]. There are several ML classification algorithms, three of the most widely used in healthcare applications are Support Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbors (KNN) [9][10][11]. However, in most classification problems in real-life applications, the sample sizes between the different classes are unbalanced [12]. Therefore, the imbalanced data problem is commonly solved with oversampling and between the activities of healthy and post-stroke subjects that are not detected with the ARAT scoring method. On the other hand, this study has a second objective: (II) Evaluate if data class balancing using the Borderline-SMOTE method allows better performanceclassifiers to be obtained. This paper is structured as follows. The data collection and the machine learning architecture development is presented in Section 2. Section 3 presents the results of the performance and comparison of the classification models before and after data class balancing. The discussion is presented in Section 4. Finally, Section 5 concludes this paper.

Materials and Methods
In this work, we selected three of the most widely used and best-performing classification algorithms in machine learning (ML), which are Support Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbors (KNN). The development of each of the classification models is presented in this section.

Data Source
The data in this study were obtained from previous studies and comes from the publicly available Finger Joints Angles ARAT database [30]. This section briefly describes the experimental protocol for data acquisition. The extension and flexion angles of eleven finger joints in healthy subjects [31] and post-stroke patients [32] were measured during the performance of sixteen activities of the Action Research Arm Test (ARAT) corresponding to the subtests (grasp, grip, and pinch) using an instrumented glove (Cyberglove Systems LLC; San Jose, CA, USA). The eleven finger joints angles recorded were: thumb carpometacarpal (CMC) joint, thumb, index, middle, ring, and little metacarpophalangeal (MCP) joints, thumb interphalangeal (IP) joint, and index, middle, ring, and little proximal interphalangeal (PIP) joints. For more information about the study protocol, please refer to [31,32]. The information obtained allowed us to construct a dataset composed of the flexion and extension angles of 25 healthy subjects during the performance of 400 ARAT activities and 12 post-stroke patients during the execution of 144 ARAT activities. All the activities were completed with an ARAT score of 2 or 3.

Data Preprocessing
In the dataset each sample was labeled according to the class to which each activity belonged (Control = 0; Stroke = 1). There were 800 cases in class '0' and 288 cases in class '1'. Table 1 shows the structure of the dataset used in the model. There were no missing values in the dataset and therefore all 1088 samples were used.  In the dataset we identified the input and output variables, the input was known as feature, and the output was known as response. The dataset has 14 features; the first three correspond to "Activity", "Subtest", and "Motion (Extension or Flexion)", and the remaining eleven correspond to the angles of the finger joints. On the other hand, the response variable is the class to which the subject corresponds {0-Control, 1-Stroke}. Hence, the categorical features in the dataset (x1, x2, x3) were transformed into binary values using one-Hot encoding method. One-Hot encoding transforms a single categorical variable with p observations and d distinct values, to d binary variables with p observations each. Therefore, each distinct value is converted into a new column and assigned a binary value indicating the true (1) or false (0) value to the column. Importantly, other demographic characteristics were not considered in the ML classifiers because the objective of the study was focused on assessing hand motion information.

Validation Methods
In order to evaluate the overall performance of the classification models, two validation methods were used. The first method used was the hold-out, in which we split our dataset into two parts, a training set and a test set. In each classification model we used 75% of the data as the training set and the remaining 25% as the test set. In addition, the second method used to measure the classification model's performance was the 5x2cv test. Unlike the common hold-out method, in which we usually split the data set into two parts: a training set and a test set, in the 5x2cv test, we performed five replications of two-fold cross-validation. In each replication, the dataset is divided into two equal sets (50% training and 50% test data).

Tuning Hyperparameters
Hyperparameters are user-adjustable parameters that can vary in quantity from one model to another. There are several computational methods to find the optimal hyperparameters of the model. Two of the most commonly used are GridSearchCV and Randomized Search CV. We decided to use the GridSearchCV; this technique uses all possible permutations of the hyperparameters of a given model. The performance of each model was then evaluated, and the best hyperparameter values were selected. In addition, GridSearchCV has the advantage of performing a K-Fold cross validation, where the number of folds is specified by its cv parameter. If it is not specified, it applied a five-fold cross validation by default. Cross-validation is a technique to identify different problems during model training, such as the occurrence of overfitting. To do this, GridSearchCV will split the training data into training and test partitions to tune the hyperparameters on these data [33].

Classification Metrics
Evaluation metrics for classification models can be applied in two phases. Firstly, in the training phase, to produce a more accurate prediction result in the future evaluation of the classification model. Subsequently, in the testing phase, evaluation metrics are used to measure the efficacy of the classifier when tested on unseen data [34]. Therefore, knowing the different metrics and making the proper selection is crucial to improving the model's performance. Therefore, to evaluate the performance of the classification models, we used the following metrics in this work: confusion matrix (not a metric but fundamental to the others), precision, accuracy, recall, F1-score, area under the receiver operating characteristics (AU-ROC), and the classification report. The formulas corresponding to the evaluation metrics are shown in Table 2. Table 2. Formulas of evaluation metrics.

Metric Formula Description
Accuracy TP+TN TP+FP+TN+FN Accuracy measures the ratio of the number of correct predictions over the total number of predictions. Therefore, accuracy measures how often the classifier correctly predicts.

TP TP+FP
Precision measures the positive instances that are correctly predicted from the total predicted instances in a positive class.

TP TP+FN
Precision measures the positive instances that are correctly predicted from the total predicted instances in a positive class.
F1-score 2 × Recall×Precision Recall+Precision F-score or F1-score evaluates the recall and precision at the same time. Therefore, F-score is maximum if the recall is equal to the precision.

Confusion Matrix
A confusion matrix is a useful tool for analyzing the performance of classification models when tested on unseen data. A confusion matrix is a cross table of true labels versus model predictions. Each row of the confusion matrix represents instances of an actual class, and each column represents instances of a predicted class [34]. Typically, it is used for binary classification problems but can also be applied to multi-class classification problems. In Figure 1 a binary confusion matrix of 2 × 2 is shown. classifier correctly predicts.
Precision + Precision measures the positive instances that are correctly predicted from the total predicted instances in a positive class.
Recall + Precision measures the positive instances that are correctly predicted from the total predicted instances in a positive class.
F1-score 2 × × + F-score or F1-score evaluates the recall and precision at the same time. Therefore, F-score is maximum if the recall is equal to the precision.

Confusion Matrix
A confusion matrix is a useful tool for analyzing the performance of classification models when tested on unseen data. A confusion matrix is a cross table of true labels versus model predictions. Each row of the confusion matrix represents instances of an actual class, and each column represents instances of a predicted class [34]. Typically, it is used for binary classification problems but can also be applied to multi-class classification problems. In Figure 1 a binary confusion matrix of 2 × 2 is shown.

Over-Sampling Data
In this work, the classes in the data set are slightly unbalanced. Therefore, the imbalanced data problem is solved with oversampling or undersampling techniques. The advantage of oversampling over undersampling is that no samples are lost from the original training set, since all data from the minority and majority classes are preserved [14]. However, in a large dataset, the time and memory consumption could be very large and costly in oversampling. Since the dataset in our study is not huge and the imbalance is mild, we do not face this problem. Therefore, we selected an oversampling technique. We decided to use the Borderline-SMOTE algorithm motivated by the results in studies of arrhythmia detection [15], estimation of brain metastasis [16] and emotion recognition [4].

Over-Sampling Data
In this work, the classes in the data set are slightly unbalanced. Therefore, the imbalanced data problem is solved with oversampling or undersampling techniques. The advantage of oversampling over undersampling is that no samples are lost from the original training set, since all data from the minority and majority classes are preserved [14]. However, in a large dataset, the time and memory consumption could be very large and costly in oversampling. Since the dataset in our study is not huge and the imbalance is mild, we do not face this problem. Therefore, we selected an oversampling technique. We decided to use the Borderline-SMOTE algorithm motivated by the results in studies of arrhythmia detection [15], estimation of brain metastasis [16] and emotion recognition [4].
Borderline-SMOTE is an algorithm derived from SMOTE (Synthetic Minority Oversampling Technique). Borderline-SMOTE generates the synthetic data around the borderline between the two classes [18]. The procedure is as follows: First, we calculate the nearest neighbors in the minority class N in all the training set samples. Next, we identified the nearest neighbors; if the majority correspond to the majority class, the samples are put in a set called Danger. The samples in Danger correspond to the borderline data of the minority class. Then, we selected a random N nearest neighbors for each sample in Danger to create the synthetic data. Therefore, we calculate the distance between the sample and its N nearest neighbors and multiply by a random number between 0 and 1. Finally, the synthetic samples of the minority class are generated: where p j represents the samples in Danger, r j represents a random number between [0, 1], and dif j represents the distance between the samples and the N nearest neighbor.

Statistical Analysis
The three classification models (RF, SVM, KNN) were compared to determine which has the best performance. First, we presented the results obtained with the GridSearchcv technique on the three classification models in different evaluation metrics such as Accuracy, Precision, Recall, F1-score, and AUC-ROC using the hold-out method. Next, the overall performance of the models were obtained and compared with the 5x2cv combined F test [35] using the MLxtend library by Sebastian Raschka [36], which provided the f-statistic and p-value. Subsequently, we evaluated the performance of the three classification models using the hold-out method with several evaluation metrics after balancing the data classes with the technique Borderline-SMOTE. Then, the evaluation metrics of the three models after data balancing were obtained and compared using the 5x2cv combined F test. Finally, each classification model was compared before balancing and after balancing using the 5x2cv t paired test. The statistical analysis was conducted using the software Anaconda (Anaconda Inc, Austin, TX, USA) with Python 3.9. A p-value of less than 0.05 was considered statistically significant for all the statistical analyses.

Results
In this section, the results of the performance and the comparison of the classification models Random Forest (RF), K-nearest Neighbor (KNN), and Support Vector Machine (SVM) are presented.

Hyperparameters Selection
The GridSearchCV technique was applied to each of the classification algorithms (RF, KNN, SVM) developed in this work using a cross validation fold value of five. The best performing hyperparameter values obtained in each of the classification models were as follows.

Classification Models with GridSearchCV
The hyperparameter values obtained were used to evaluate each classifier in the prediction of results in the test set and the results were as follows.

Random Forest
The RF classifier showed an accuracy of 93% and a high precision of 96.5%. In contrast, the recall of 76.4% and the f1-score of 85.3% were low. On the other hand, the classification report presented in Table 3 shows that the recall and the f1-score values were higher in the control class but were lower in the stroke class. In contrast, the precision was higher in the stroke class, as is shown in the confusion matrix in Figure 2. The KNN classifier presented a high precision of 95.3% and an accuracy of 87.9%. In contrast, the recall of 56.9 % and the f1-score of 71% were low. Table 4 shows that the recall and f1-score were higher in the control class. While the precision was low in the control class as is shown in Figure 3. higher in the stroke class, as is shown in the confusion matrix in Figure 2.

K-nearest Neighbor
The KNN classifier presented a high precision of 95.3% and an accuracy of 87.9%. In contrast, the recall of 56.9 % and the f1-score of 71% were low. Table 4 shows that the recall and f1-score were higher in the control class. While the precision was low in the control class as is shown in Figure 3.

Support Vector Machine
The SVM classifier showed a high precision of 98.3% and a high accuracy of 94.5%. In contrast, the SVM classifier showed a recall of 80.5% and an f1-score of 88.5%. However, the classification report in Table 5 showed high values in precision, recall, and f1-score in the control class and in the precision of the stroke class, as is shown in the confusion matrix in Figure 4.

Support Vector Machine
The SVM classifier showed a high precision of 98.3% and a high accuracy of 94.5%. In contrast, the SVM classifier showed a recall of 80.5% and an f1-score of 88.5%. However, the classification report in Table 5 showed high values in precision, recall, and f1-score in the control class and in the precision of the stroke class, as is shown in the confusion matrix in Figure 4.

Support Vector Machine
The SVM classifier showed a high precision of 98.3% and a high accuracy of 94.5%. In contrast, the SVM classifier showed a recall of 80.5% and an f1-score of 88.5%. However, the classification report in Table 5 showed high values in precision, recall, and f1-score in the control class and in the precision of the stroke class, as is shown in the confusion matrix in Figure 4.   Table 6 shows the mean values and standard deviation of several evaluation metrics for the RF, SVM, and KNN classifiers. The results showed that the three models have  Table 6 shows the mean values and standard deviation of several evaluation metrics for the RF, SVM, and KNN classifiers. The results showed that the three models have similar precision, and no significant differences were found among the three classifiers (p > 0.05). The SVM classifier showed significantly higher accuracy and f1-score than the KNN classifier. In contrast, no significant differences (p > 0.05) were found in accuracy and f1-score between the SVM and RF classifiers. In addition, the SVM classifier showed a significantly higher recall and AUC than the RF and the KNN classifiers.

Borderline-SMOTE Data Balancing
The dataset in this work presented a mild case of imbalanced data between the two classes (control and stroke) as is shown in Figure 5. The results presented earlier in this work showed a high accuracy in the three classifiers. However, the accuracy metric is not a good indicator when there are imbalanced classes, as in this case. In contrast, the classifiers KNN and RF showed a lower recall especially in the classification of subjects of the control class. Therefore, to optimize the performance of the classifiers, we decided to use the Borderline-SMOTE algorithm for data oversampling. In contrast, Figure 6 shows the classes after the data balancing with Borderline-SMOTE. a good indicator when there are imbalanced classes, as in this case. In contrast, the classifiers KNN and RF showed a lower recall especially in the classification of subjects of the control class. Therefore, to optimize the performance of the classifiers, we decided to use the Borderline-SMOTE algorithm for data oversampling. In contrast, Figure 6 shows the classes after the data balancing with Borderline-SMOTE.

Classification Models with Borderline-SMOTE
The three classification models showed an improvement in the classification of both classes (control and stroke), as is shown in Table 7 after data balancing with Borderline-SMOTE. Previously, the classification models showed difficulty in classifying subjects with stroke due to the class imbalance. The improvement was remarkable, particularly in the metrics of recall and f1-score. The overall performance of the classification models in the test set was evaluated with the following evaluation metrics: accuracy, precision, recall, F1-score, and AUC-ROC. The results were as follows: The SVM classifier showed a precision of 98%, while the RF classifier showed a precision of 96.4%, and the KNN classifier showed a precision of 86.3%. On the other hand, the KNN classifier presented a recall value of 98%, but analyzing the classification report in Table 7 we observed a low recall value of 84% in the control class. In contrast, the SVM classifier presented a recall value of 97.5% and the RF presented a recall of 94% and both classifiers had a uniform recall value in the two classes. In addition, the SVM classifier showed the highest f1-score of 97.7 %, while the RF classifier showed a f1-score of 95.2% and the KNN classifier showed a f1-score of 91.8%. Finally, the AUC-ROC of the three classifiers is shown in Figure 7. Table 7. Classification Report of the three classification models after data balancing (Borderline-SMOTE).

Classification Models with Borderline-SMOTE
The three classification models showed an improvement in the classification of both classes (control and stroke), as is shown in Table 7 after data balancing with Borderline-SMOTE. Previously, the classification models showed difficulty in classifying subjects with stroke due to the class imbalance. The improvement was remarkable, particularly in the metrics of recall and f1-score. The overall performance of the classification models in the test set was evaluated with the following evaluation metrics: accuracy, precision, recall, F1score, and AUC-ROC. The results were as follows: The SVM classifier showed a precision of 98%, while the RF classifier showed a precision of 96.4%, and the KNN classifier showed a precision of 86.3%. On the other hand, the KNN classifier presented a recall value of 98%, but analyzing the classification report in Table 7 we observed a low recall value of 84% in the control class. In contrast, the SVM classifier presented a recall value of 97.5% and the RF presented a recall of 94% and both classifiers had a uniform recall value in the two classes. In addition, the SVM classifier showed the highest f1-score of 97.7 %, while the RF classifier showed a f1-score of 95.2% and the KNN classifier showed a f1-score of 91.8%. Finally, the AUC-ROC of the three classifiers is shown in Figure 7.   Table 8 shows the mean values and standard deviation of several evaluation metrics for the RF, SVM, and KNN classifiers using Borderline-SMOTE. As can be seen, the three classification models have a similar and consistent performance. The results showed no significant differences (p > 0.05) in accuracy, recall, and f1-score among the three classifiers. However, the RF classifier showed significantly higher precision than the KNN classifier. In contrast, the RF and the SVM classifiers showed a significantly higher AUC than the KNN classifier.

Performance Comparison of the Classifiers before and after Data Balancing
In general, the three classification models RF, KNN, SVM showed an improvement after the data balancing process using the Borderline-SMOTE technique in several metrics. In the statistical comparison of the classifiers with unbalanced data and with data balancing, the following results were obtained. In precision metrics, no significant differences were found in the RF and SVM classifiers before data balancing and after data balancing. In contrast, the KNN classifier after data balancing showed significantly lower precision  Table 8 shows the mean values and standard deviation of several evaluation metrics for the RF, SVM, and KNN classifiers using Borderline-SMOTE. As can be seen, the three classification models have a similar and consistent performance. The results showed no significant differences (p > 0.05) in accuracy, recall, and f1-score among the three classifiers. However, the RF classifier showed significantly higher precision than the KNN classifier. In contrast, the RF and the SVM classifiers showed a significantly higher AUC than the KNN classifier.

Performance Comparison of the Classifiers before and after Data Balancing
In general, the three classification models RF, KNN, SVM showed an improvement after the data balancing process using the Borderline-SMOTE technique in several metrics. In the statistical comparison of the classifiers with unbalanced data and with data balancing, the following results were obtained. In precision metrics, no significant differences were found in the RF and SVM classifiers before data balancing and after data balancing. In contrast, the KNN classifier after data balancing showed significantly lower precision than before balancing, as is shown in Figure 8. On the other hand, the RF, SVM, and KNN classifiers showed significantly higher accuracy, recall, f1-score, and AUC after data balancing, as is shown in Figures 8-12.

Discussion
In this work, three binary classification models were developed using the following algorithms: Support Vector Machine (SVM), Random Forest (RF), and K-nearest neighbor (KNN). We used as features, the angles of flexion and extension of 11 finger joints during the performance of the activities of the Action Research Arm Test (ARAT) to classify activities between two classes: healthy subject (0 = Control) and post-stroke patients (1 = Stroke). Importantly, the activities included in the dataset of the stroke group obtained a score of 2-3 on the ARAT. While in all the activities included in the dataset of the control group, a score of 3 was obtained. Therefore, based on the ARAT score all tasks were completed and there is not much difference between the tasks in one group and the other. In this way, using completed tasks based on the ARAT score, the performance of the classifier is evaluated.
The result showed that the SVM classifier had the best performance in the test set before data balancing with a precision of 98.3%, an accuracy of 94.5 %, a recall of 80.5%, f1-score of 88.5% and an AUC of 0.989. In addition, the recall and AUC were significantly higher than the RF and KNN classifiers. However, the recall values in the three classifiers were low especially in the stroke group. The lower recall showed that the model was classifying stroke patients as healthy subjects as is shown in the confusion matrices in Figures  2-4. Besides, we detected a mild imbalance between classes, so we decided to use the oversampling technique Borderline-SMOTE. The results showed that after Borderline-SMOTE the three classifiers showed significantly higher accuracy, recall, f1-score, and AUC. In the recall, there was an increase in the RF (17.6%), SVM (17%), and KNN (41.1%). However, the KNN classifier showed a recall of 84% in the control class while the RF and SVM classifiers had a balanced recall in both classes. In addition, the precision of the KNN was significantly lower after data balancing. Finally, all the classification models showed an AUC > 0.95. In fact, the RF and SVM showed the best performance after data balancing and no significant differences were found in any metric between the two classifiers. However, the SVM showed a high accuracy, recall and f1-score and therefore a more balanced

Discussion
In this work, three binary classification models were developed using the following algorithms: Support Vector Machine (SVM), Random Forest (RF), and K-nearest neighbor (KNN). We used as features, the angles of flexion and extension of 11 finger joints during the performance of the activities of the Action Research Arm Test (ARAT) to classify activities between two classes: healthy subject (0 = Control) and post-stroke patients (1 = Stroke). Importantly, the activities included in the dataset of the stroke group obtained a score of 2-3 on the ARAT. While in all the activities included in the dataset of the control group, a score of 3 was obtained. Therefore, based on the ARAT score all tasks were completed and there is not much difference between the tasks in one group and the other. In this way, using completed tasks based on the ARAT score, the performance of the classifier is evaluated.
The result showed that the SVM classifier had the best performance in the test set before data balancing with a precision of 98.3%, an accuracy of 94.5 %, a recall of 80.5%, f1-score of 88.5% and an AUC of 0.989. In addition, the recall and AUC were significantly higher than the RF and KNN classifiers. However, the recall values in the three classifiers were low especially in the stroke group. The lower recall showed that the model was classifying stroke patients as healthy subjects as is shown in the confusion matrices in Figures 2-4. Besides, we detected a mild imbalance between classes, so we decided to use the oversampling technique Borderline-SMOTE. The results showed that after Borderline-SMOTE the three classifiers showed significantly higher accuracy, recall, f1-score, and AUC. In the recall, there was an increase in the RF (17.6%), SVM (17%), and KNN (41.1%). However, the KNN classifier showed a recall of 84% in the control class while the RF and SVM classifiers had a balanced recall in both classes. In addition, the precision of the KNN was significantly lower after data balancing. Finally, all the classification models showed an AUC > 0.95. In fact, the RF and SVM showed the best performance after data balancing and no significant differences were found in any metric between the two classifiers. However, the SVM showed a high accuracy, recall and f1-score and therefore a more balanced performance. Hence, The SVM was the model with the best performance after data balancing with Borderline-SMOTE and before data balancing.
The results in our study showed that after data balancing using Borderline-SMOTE the classification models showed an improved performance as in other research. Reddy et al. applied the Borderline-SMOTE algorithm on convolutional neural networks (CNN) to detect arrhythmias using electrocardiogram signals. Their results showed a significantly higher f1-score and accuracy after the use of Borderline-SMOTE [15]. The results of our work were similar, where the three classification models obtained significantly higher accuracy, recall and f1-score after using the Borderline-SMOTE technique. On the other hand, Chang et al. presented a study for emotion recognition with electroencephalogram (EEG) signals using data augmentation with the Borderline-SMOTE method [4]. They compared traditional machine learning methods, and their results showed that SVM and XGBoost had better performance in average accuracy and average macro f1-score than decision tree and KNN models. Our work obtained similar results where the SVM algorithm showed higher accuracy and f1-score than the KNN and RF models. However, our study used a binary classification analysis, and in the study of Chang et al., a multiclass classification method was used.
In addition, there are few studies that have developed machine learning models with data obtained from the ARAT test. Dutta et al. developed an SVM classifier with an accuracy of 92% to predict ARAT scores in patients with different degrees of disability. They used a glove with six flex sensor and three force sensors and a motion processing unit [26]. In contrast, Dutta et al. predicted ARAT scores (0, 1, 2, 3) using a multiclass classification model and presented problems in classifying classes 0 and 2. On the other hand, Rheme et al. used an SVM model to predict good and poor motor outcomes of stroke patients based on the ARAT score, grip force index, and magnetic resonance imaging (fMRI) using a SVM classifier with an accuracy of 76% [37]. However, the outcome of patients with initially moderate impairment could not be predicted with the information of the ARAT score and the grip force test. The results of the above studies demonstrate that it is difficult to predict in post-stroke subjects with moderate impairments based on the ARAT scoring process which confirms what was established in our study. Therefore, the difference with our work is that our study was not based on the ARAT score, but on the information of eleven finger joints angles obtained with the human hand motion system composed by a data glove with 18 flexion sensors during the performance of the ARAT. In addition, in our study we obtained an accuracy of 97.8% with the SVM model, 97.1% with the RF model, and 94.8% with the KNN model classifying ARAT activities. Therefore, our classification models presented better results than those presented in the studies by Dutta et al. and Rheme et al. Furthermore, to our knowledge, there are no previous studies that have developed machine learning models to classify ARAT activities using the range of motion of the finger joints as features and, therefore, a direct comparison with other studies will be biased as most of these use demographic characteristics and scores on several outcome measures as features.
For these reasons, our results demonstrate that using human hand motion information allows the development of high-performance Machine Learning models (SVM, KNN, RF) for classifying ARAT activities. However, to achieve these results it was necessary to balance the classes using the Borderline-SMOTE algorithm. Importantly, we demonstrate that there are differences not detected by the ARAT scoring process that are limited only to evaluate the quality performance. Therefore, these results are of clinical relevance for physiotherapists and other health care professionals who can use a classification model for the detection of finger joint impairments not only in people post-stroke but also after surgical procedures, hand injuries and other hand disorders.
Nevertheless, the present study had some limitations. We were limited to the use of traditional machine learning classifiers, but the use of Ensemble Machine Learning methods has shown very good results in clinical studies [38,39], so that in future work we could implement an ensemble classifier and compare the results. In addition, deep learning has also been used for classification problems in the area of healthcare, especially Convolutional neural networks (CNNs), showing good results in combination with Borderline-SMOTE [4,15]. Moreover, Fraiwan et al. used deep transfer learning to detect scoliosis and spondylolisthesis [40]. Therefore, the use of deep learning should also be considered for future studies. On the other hand, other research to improve the performance of disease predictions uses the development of a classification algorithm based on a multi-level iterative influence measure that would be interesting to use in future research [41]. Another limitation we had is that it was not possible to access patient demographic data. This would have allowed us to evaluate the impact of these features on the classification algorithm. Finally, we limited ourselves to evaluating patients with good upper extremity function according to the ARAT score. Therefore, it would be interesting to evaluate subjects with different degrees of upper extremity impairment to perform a classification model to predict ARAT scores based on information from the eleven finger joints. On the other hand, it is essential to consider the recall results since the activities of post-stroke patients identified as the control group could have been performed in the same way as a healthy person. Therefore, the bias would not be in the classifier model, but in the fact that there were no significant differences in certain activities between post-stroke patients and healthy subjects.

Conclusions
In this study, we present the novel development of classification models based on human hand motion features in combination with the oversampling algorithm Borderline-SMOTE. The classifiers with unbalanced data showed a low recall and f1-score especially in the stroke class, after the implementation of Borderline-SMOTE the three classifiers showed a significantly higher accuracy, recall, f1-score, and AUC. However, the SVM classifier showed the higher performance with a precision of 98%, a recall of 97.5% and an AUC of 0.996 after data balancing. Therefore, the results showed that the classification models using Borderline-SMOTE achieve a higher performance. In addition, the high performance of the classifiers showed that there are differences between the activities performed between healthy and post-stroke individuals that are not detected by the ARAT scoring process. Regardless, the recall results can show activities in which people from both classes performed equally well. Furthermore, the classification model based on hand motion information can be used in future work for the detection of finger joint impairments not only in people post-stroke but also after surgical procedures, hand injuries, and other hand disorders.