The two tests were selected as they are relevant to a variety of activities of daily living, can be performed without medical supervision and in the home environment, and have been used to evaluate the progress and condition of post-stroke patients in line with the criteria presented in [
6]. Both tests evaluate lower limb strength, mobility, static and dynamic balance, functionality, and durability, all of which are relevant to rehabilitation outcomes. Improvement in performing the tests over time translates to better ability to perform daily tasks (such as sitting and standing, walking small distances), self-efficacy, and engagement with rehabilitation as discussed in [
1]. Additionally, the use of medical tests provides the benefit of evaluating the system against medical standards and clinical specification. However, to achieve high accuracy, large datasets representative of real patient measurements are needed, which, to the best of our knowledge, are not publicly available for the problem mentioned above.
3.1.1. Experimental Dataset
In [
1], data on completion of TUG and FTSTS tests were collected through the presented low-cost home rehabilitation system for eight participants. The participants simulated difficulties in the various stages of the TUG test simulating elderly individuals performing the same test. We consider each difficulty level as a label for the purposes of this dataset. The features recorded for each participant are:
- 1.
Test completion time (seconds)
- 2.
Age (years)
- 3.
Height (meters)
- 4.
Weight (kg)
- 5.
Body mass index (BMI) (kg/m
calculated using Equation (
1))
- 6.
Sex (male/female)
The use of BMI, in addition to height and weight, could improve the prediction accuracy. Indeed, according to [
47], the inclusion of features which are generated from other features (using linear or polynomial equations) can improve the performance of more complex models such as NNs. We will investigate the influence of features such as BMI in
Section 5.
This dataset had no missing, malicious, erroneous, inconsistent, or irrelevant data points. Data formatting was necessary to map gender and category data entries from
String to
Integer values. Some outliers were present in the originally recorded data as evident in [
1]. These outliers were removed following the >
(
denotes standard deviation) approach [
48]. Finally, all measurements were normalised in the range
.
The categories (classes) that we have recorded for TUG were: (a) difficulty to walk, (b) difficulty to turn, (c) difficulty to stand/sit, (d) normal, and (e) fast. For FTSTS, the classes are: (a) difficulty to stand/sit annotated as slow, and (b) fast.
There are two major issues with this experimental dataset. First, it is small—only eight participants are included, which will lead to inaccurate outcomes. Second, the dataset is unbalanced, potentially leading to bias and affecting the objectivity requirement. Indeed, first, the class normal in TUG (and fast in FTSTS) has fewer collected data points.
Class unbalance is addressed with the use of the synthetic minority over sampling technique (SMOTE) [
49]. With the use of SMOTE, all classes were balanced, with each class having 40 datapoints both in TUG and FTSTS. In total, 17 samples were generated by SMOTE for TUG, of which one was for the fast and 16 for the normal class. Moreover, 7 were generated for FTSTS for the difficulty/slow class.
Besides class unbalance, the dataset suffers from data imbalance; the resulting dataset had
male participant entries and all participants belonged to the 20–45 years old age group, clearly leading to biased outcomes. To identify the effect of this bias, we will discuss the feature importance for the proposed method in
Section 4.
3.1.2. Synthetic Dataset
ML models have demonstrated success in many application fields. However, since advanced ML models are data-driven; to translate these successes to medical fields, it is necessary to provide large datasets to develop and train the models. Generating an open-access datset based on collected data is a significant obstacle due to patient data privacy and cost [
50]. Hence, the rehabilitation field suffers from a lack of appropriate datasets to develop and test advanced ML models. In many other fields, where real data collection is expensive or impractical, synthetic datasets are generated and employed. However, the medical domain has been reluctant to embrace synthetic datasets.
Most recently, the community has recognised this adverse effect [
50,
51]. In [
51,
52] the main arguments against synthetic datasets are rebutted through a comparative analysis of actual and synthetic datasets used to train and test the model. It is demonstrated that the synthetic dataset can produce equal performance if synthesised from a statistical representation of the actual cohort of patients. In [
50], it is discussed that if the model performs accurately when tested with unseen real patient data, then the synthetic dataset can be accepted as representative of the reality. In this paper, we follow this line of work; we synthesise a dataset, ensuring statistical agreement with real measurements, and test the trained model using experimental dataset in
Section 5.
The reasons for synthesising a dataset are:
- 1.
To meet the requirements of [
6] for transferability and co-morbidity diagnosis, the underlying reason being the development of co-morbidity for stroke survivors and early diagnosis/warning to carers;
- 2.
To improve variance and bias of our experimental dataset;
- 3.
To improve accuracy of our model;
- 4.
To include a wider range of height, weight, age and BMI that is representative of the wider population that is dischared to home-based rehabilitaiton directly impacting bias;
- 5.
To improve believability by sourcing information from medical journals, where a larger cohorts of geriatric subjects and patients have participated in experiments;
- 6.
To improve objectivity by including information from experiments with participants diagnosed to have the medical conditions that can be developed as co-morbidities, such as Parkinson’s and dementia.
It is important to highlight here that the synthetic dataset will be used for initial training of the model. However, as one of the criteria in [
6] is to provide individualised solutions, our model would be constantly re-trained and corrected using real data acquired by the user of the device presented in [
1]. Thus, throughout the system lifetime, the synthetic data will eventually be a proportionally small contributor to the model.
Furthermore, the synthetic dataset could not be generated from the experimental dataset using an oversampling method. SMOTE could not be used in this case as (1) experimental dataset is gender- and age-unbalanced; (2) SMOTE requires an input dataset and the original dataset, described in the previous subsection, does not link the six features with specific conditions. Thus, an alternative approach for dataset synthesis is proposed next.
We consult a database of medical/clinical research publications that present statistical results of experiments with patient cohorts for TUG and FTSTS. Each publication states the medical condition of the cohort, the number of subjects and the statistical characteristics of the cohort. To the best of our knowledge, this database presents a comprehensive review of all medical conditions for which TUG and FTSTS are used to evaluate patients. Every condition reviewed in [
45,
46] was included for TUG and FTSTS, respectively. Publications cited in [
45,
46] were included as inputs in the synthetic dataset algorithm, if they presented statistical descriptors for all of the features used in the experimental dataset above, or if the features could be extrapolated or calculated (e.g., BMI). Publications were in turn excluded if any of the features was not reported or could not be extrapolated from the information presented. Because the selection is not made using a specific feature as the criterion, this selection method does not introduce a particular bias to the pool of included publications. Additionally, publications were not excluded based on race, sex, or ethnicity of the participants.
In addition to the difficulty classes mentioned in the experiment dataset for TUG, the following condition classes were finally considered:
Healthy, Geriatric, Parkinson’s, Parkinson’s non-fallers—medication, Parkinson’s non-fallers—no medication, Parkinson’s fallers, Dementia mild/moderate, Dementia severe, Arthritis improvement, Arthritis knee arthroplasty, Arthritis, Stroke, Brain injury, Bilateral vestibular hypofunction, Unilateral vestibular hypofunction, Spinal injury, Paraplegia, Tetraplegia. In total, 16 publications remained for TUG based on the inclusion criteria, namely [
53,
54,
55,
56,
57,
58,
59,
60,
61,
62,
63,
64,
65,
66,
67,
68,
69]. A total cohort of
subjects is represented by these publications with age in the range
years, height in the range
m, and weight in range
kg. Each condition class has 280 datapoints and the overall set has a total of 5040 datapoints.
Similarly, for FTSTS, the condition classes are: Healthy, Geriatric, Geriatric fallers, Parkinson’s stage 1, Parkinson’s stage 2, Parkinson’s stage 2.5, Parkinson’s stage 3, Parkinson’s stage 4, Parkinson’s, Arthritis, Arthritis knee arthroplasty, Stroke, Vestibular disorder. In total, 12 publications remained for FTSTS based on the inclusion criteria, namely [
56,
70,
71,
72,
73,
74,
75,
76,
77,
78,
79,
80], while 3 were excluded. A total cohort of
subjects is represented by these publications with age range
years, height in the range
m, and weight in range
kg. Each class has 240 datapoints and the overall set has a total of 3120 datapoints.
According to [
73], BMI has low correlation to the FTSTS test completion time (
p-value
) so higher variability was introduced compared to the TUG synthetic dataset as reflected in
Figure 1. The same is true for height [
75,
80]. On the contrary, hand positioning has a significant correlation to completion time. All participants in the experiments and included publications followed the same hand positioning, hands crossed over chest. This parameter was ignored [
81], given that it would be the same value for all datapoints and hence provided no difference in completion time.
Algorithm 1 is proposed to generate synthetic data points from the statistical properties presented in considered publications, where and denotes class mean and standard deviation. The algorithm is based on the following observations:
These observations are further supported by the Experimental dataset observations both in terms of linearity and variability. Algorithm 1 presents the relevant implementation.
To introduce linearity, e.g., to ensure a linear relationship between time of completion and age, we generate pseudo-random numbers from the Gaussian distribution of each feature and generate a sorted sequence (function
in Algorithm 1). Because lists of all features are sorted from lower to higher, the linear relationship is generated; for example, the lowest completion time will be aligned to the lowest age. Then, to introduce variability, we swap the order of several data points in one of the two feature columns (function
in Algorithm 1)—for example, keeping completion time sorted and swapping values in the age column. If variability is required between one feature and a set of other features, then the same swaps are applied to the full set. For example, if we swap data point 3 with data point 7 in BMI to introduce further variability compared to age, we apply the same swap to height and sex as well, thus maintaining the correlation between the BMI, weight, and gender sequences.
Algorithm 1 Generate features for n datapoints of Class x. |
- Require:
- Require:
-
-
-
-
-
-
-
-
-
- Ensure:
|
Finally, to verify the validity of the generated datasets for FTSTS and TUG, we compare the correlation matrix of the features of the synthetic datasets to the original recorded experimental datasets. We used the correlation matrix as a guide for the amount of variability to be introduced. The final correlation matrices are presented in
Figure 1.
It is evident that a similar correlation matrix emerges in both the experimental and the synthetic datasets. There is one significant difference. Sex is highly correlated with both height and weight in the experiment. This is because the only female participant in the experiments was also the shortest and lighter. This generated a bias between height and sex, and BMI results also represent an unrealistic correlation to height. Thus, the sex bias is the most important bias in the case of the experiment. Addressing this by introducing more female subjects with different height, weight, and BMI measurements would improve all of the biases resulting from the bias towards male participants. The synthetic dataset, on the other hand, demonstrates a weaker relationship between sex and height as well as sex and weight and height with all other features, including BMI, which is closer to reality and thus compensating for the biases in the experimental dataset.
Figure 2 presents both the datasets after projecting all the features to two dimensions using PCA. Visualising the data also demonstrated a potential polynomial function as displayed in
Figure 2a,c where all the points seam to follow a polynomial curve with different offset for each of the classes. This relationship is further investigated in the following section.