A Machine Learning Approach for Walking Classification in Elderly People with Gait Disorders

Walking ability of elderly individuals, who suffer from walking difficulties, is limited, which restricts their mobility independence. The physical health and well-being of the elderly population are affected by their level of physical activity. Therefore, monitoring daily activities can help improve the quality of life. This becomes especially a huge challenge for those, who suffer from dementia and Alzheimer’s disease. Thus, it is of great importance for personnel in care homes/rehabilitation centers to monitor their daily activities and progress. Unlike normal subjects, it is required to place the sensor on the back of this group of patients, which makes it even more challenging to detect walking from other activities. With the latest advancements in the field of health sensing and sensor technology, a huge amount of accelerometer data can be easily collected. In this study, a Machine Learning (ML) based algorithm was developed to analyze the accelerometer data collected from patients with walking difficulties, who live in one of the municipalities in Denmark. The ML algorithm is capable of accurately classifying the walking activity of these individuals with different walking abnormalities. Various statistical, temporal, and spectral features were extracted from the time series data collected using an accelerometer sensor placed on the back of the participants. The back sensor placement is desirable in patients with dementia and Alzheimer’s disease since they may remove visible sensors to them due to the nature of their diseases. Then, an evolutionary optimization algorithm called Particle Swarm Optimization (PSO) was used to select a subset of features to be used in the classification step. Four different ML classifiers such as k-Nearest Neighbors (kNN), Random Forest (RF), Stacking Classifier (Stack), and Extreme Gradient Boosting (XGB) were trained and compared on an accelerometry dataset consisting of 20 participants. These models were evaluated using the leave-one-group-out cross-validation (LOGO-CV) technique. The Stack model achieved the best performance with average sensitivity, positive predictive values (precision), F1-score, and accuracy of 86.85%, 93.25%, 88.81%, and 93.32%, respectively, to classify walking episodes. In general, the empirical results confirmed that the proposed models are capable of classifying the walking episodes despite the challenging sensor placement on the back of the patients, who suffer from walking disabilities.


Introduction
Physical activity plays a major role in mental and physical health and well-being. The correlation between physical activity and mental and physical health is stronger in the older population. Inadequate physical activity is linked with mobility disorders, loss of independence, and lower muscle strength [1]. Chodzko-Zajko et al. [2] showed that the development of age-related disabilities and other health diseases can be prevented and delayed by having an active lifestyle. It is reported by the World Health Organization (WHO) that the level of fitness and functional health is generally higher in the physically active elderly population [3].
One of the suitable methods for measuring physical activity is accelerometry, which does not constrain the subjects. Furthermore, accelerometry is considered a reliable and cost-effective method for monitoring ambulatory motion under free-living conditions [4,5]. However, in order to be able to evaluate physical activities using accelerometry, an accurate classification of different activity types is required [4].
The latest technological advancements in wearable devices have made it feasible to monitor daily activities. The new sensors are more cost-effective and last longer in terms of battery life. In recent years, various research has been conducted to classify daily activities using accelerometer data [6][7][8][9][10][11][12][13]. However, most of the developed models were tested on datasets collected from young population [7,9,10,[14][15][16][17]. There are also other studies that focus on the data acquired from older individuals [12,[18][19][20][21][22]. It has been shown in the literature that the models trained using the data from younger and healthy individuals do not generalize well when it is tested on the data acquired from the elderly population, who suffer from different diseases and gait disorders. This results in a lower performance on the data from older adults, which prevents using the developed models to classify activities of older individuals in free-living conditions [23]. In most of the previous studies, it has been also a common practice to use different sensor placements in order to achieve an accurate activity classification [24]. Although most of these studies have obtained interesting outcomes on physical activity classification using younger adults' data and various sensors' placements, there are still some questions that need to be further investigated. For this purpose, in a collaboration with a municipality in Denmark, we investigated the possibility of improving the quality of daily life in older citizens and patients, who suffer from dementia and walking difficulties. Analyzing the walking activity is of great importance for personnel in the municipality's care home to monitor the physical health and for early detection of the development of dementia and/or other diseases in the individuals. However, due to the nature of dementia disease and its effect on the walking ability of the patients, it has been challenging to detect the walking activity of the individuals in the care home. Dementia usually affects the walking abilities of patients and forces them to use walking aids, which subsequently alters their walking patterns. Therefore, we try to investigate the following questions in this study: • Is it applicable to develop a model that can classify the walking activity of patients with walking abnormalities, who suffer from dementia and Alzheimer's disease? • Is it possible to classify walking activities using only one sensor placed on the back of participants? • How do the different Machine Learning (ML) algorithms perform on classifying walking (as one of the most effective and popular forms of activities) from non-walking activities in older adults?
To investigate the above points, in this paper, different ML algorithms are compared and presented, which are capable of classifying the walking activity of patients with walking disabilities using only one sensor placement on the back. It has been reported in the literature that gait disorders are correlated with Alzheimer's disease [25,26]. The back sensor placement is desirable in patients with dementia and Alzheimer's disease because they may remove the sensors placed on the other parts of the body, which are visible to them due to the nature of their diseases. In addition, using only one sensor placed on the back makes it possible to easily implement such a model in practice, especially in cases that are challenging to place multiple sensors such as on the thigh and hip. This is a huge advantage that removes the burdens of applying such models in practice in care homes/rehabilitation centers.
Two single and two ensemble classification algorithms were developed and evaluated using a dataset collected from older adults, who live in one of the municipalities in Denmark. The dataset contains 20 elderly patients, most of whom suffer from dementia and Alzheimer's disease. To the best of our knowledge, there have not been any studies applying sensor placement on only the upper/mid back of patients with various walking difficulties as presented in this study. It has been proven that the performance of single classifiers can be enhanced using an ensemble learning framework. This is due to the fact that, in an ensemble approach, a collection of classifiers contributes to the final decision-making instead of only using a single weak learner. Thus, the performance of ensemble learning models is generally higher than single classification algorithms [27]. There are various applications in which ensemble learning methods are utilized such as cyber security [28][29][30][31][32][33], energy [34][35][36][37], and health informatics [38][39][40][41][42][43][44][45][46][47].
A vast range of features such as statistical, temporal, and spectral, were extracted from the collected accelerometer time series and the best subset of them were selected using Particle Swarm Optimization (PSO) algorithm [48]. The selected features are used as inputs for the ML models to classify walking from non-walking activities. The four used ML classifiers are k-Nearest Neighbors (kNN), Random Forest (RF), Extreme Gradient Boosting (XGB), and Stacking Ensemble (Stack).
The remainder of this paper consists of three sections. Section 2 describes the methodology of the used approaches in this study. In Section 3, the obtained results in this study are presented and discussed, and lastly, Section 4 concludes the paper.

Dataset
In this study, a dataset of 20 elderly patients, who live in one of Denmark's municipalities and have different walking abnormalities, is used to train and evaluate the proposed model. It should be mentioned that The Regional Committee on Health Research Ethics for the Region of Southern Denmark was contacted regarding ethical approval of the study. They responded that according to Danish law about ethics related to health research, ethical committee approval was not required for this study. The dataset contains 20 time series, which are collected with a sampling frequency of 11 Hz, using accelerometer sensors developed by a Danish company called SENS Innovation ApS [49]. This is a commercial sensor that only measures the acceleration and temperature, which helps in having a longer battery lifetime. The sensor can record accelerometer data for up to two weeks without any recharging, which provides the opportunity for the healthcare professional to monitor the subjects longer. It is very easy to use and place the sensor on the back of the subjects. It is recommended by the manufacturer that the sensor should be placed on the mid back and slightly to the left or right of the spine as shown in Figure 1. The length of the collected time series varies between around 278 s and 527 s per subject. The subject was asked to perform some free-living activities such as sitting and standing, sitting active, standing active, and finally walking around the care home for more than 10 m. The collected data is accompanied by a recorded video for each participant, which was used afterward for labeling the different activities performed by the participants. The participants in the study had dementia and Alzheimer's disease history and they used different walking aids. There are 6 females and 14 males with an average age of 79.1 ± 6.9 and 76.4 ± 9.4, respectively. A summary of the dataset used in this study is given in Table 1. In general, the gender imbalance may lead to some classification biases. However, since the population size and the gender imbalance are not very large here, the investigation of gender imbalance is out of the scope of this paper. It should be noted that all the elderly patients in the municipality's care home were asked to participate in this study without any specific criteria and the current population represents the ones who accepted the invitation.

Data Preprocessing
First, the triaxial accelerometer data (x-, y-, and z-axis) were segmented into smaller chunks of 3, 6, and 9 seconds with 50% overlaps between the adjacent segments. It should be mentioned that the size of segments was chosen due to the fact that the clinically proven length for human walking analysis is around 5-6 s. We evaluated half-and double-sized segment sizes (i.e., 3 and 9 s) as well to show that 6 s segment size is actually suitable for the purpose of our study. Then, each chunk (i.e., 3, 6, and 9 s) was labeled as either walking, if more than half of the samples are walking, or as non-walking (other activities) if the majority of the samples correspond to other activities.
As an example, the triaxial accelerometer data for one of the subjects (# 18) is illustrated in Figure 2. As can be seen, the walking episode is relatively shorter compared to the nonwalking (other activities) part. This results in a very imbalanced dataset. Therefore, a synthetic over-sampling method was also used to alleviate this problem and to increase the size of the walking class to be the same as the non-walking class. The oversampling process is explained in Section 2.7 in detail.

Feature Extraction
In this study, a Python package called Time Series Feature Extraction Library (TSFEL) [50] was used to extract various types of features. TSFEL is able to efficiently extract statistical, temporal, and spectral domain features. The computational complexities of the statistical, temporal, and most of the spectral features calculated by TSFEL are linear, which makes this library an efficient tool for the time series feature extraction task [50]. Table 2 provides a list of extracted features using TSFEL. The extracted features are briefly explained in Appendix A.

Feature Subset Selection
In order to improve the classification performance, a subset of defined features are usually selected to be used as inputs for the ML algorithms [51,52]. In this study, a PSO [48] was used to select the best and most accurate subset of features from all the features introduced in Section 2.3. The PSO is generally categorized as a swarm intelligence method. Swarm intelligence methods are based on the idea that single agents are not able to solve a problem individually. So, many agents (particles) try to achieve a unique goal in a swarm. In the PSO algorithm, each particle represents a potential solution for the problem at hand and every movement of the particles results in a new solution. The PSO algorithm works based on three rules: (1) the particles continue their movement in the same direction as the last movement (inertia term); (2) each particle should move towards its best-found solution (nostalgia term); and (3) each particle should also move towards the best solution, which has been found among all the particles (global term) [53,54]. The three rules used by the PSO algorithm for updating the position of the particles can be mathematically expressed as follows [55]: (1) where ω is the inertia weight, x pbest i and x gbest i are the best-found position (solution) for particle x i and the global best solution of the swarm, respectively. The parameters r 1 and r 2 are random numbers in the range of [0,1] and c 1 and c 2 are constants to control the nostalgia and global terms, respectively. In this paper, c 1 = c 2 = 2 and ω = 0.9 [56]. The overview of the PSO algorithm steps is described as follows: • Initialize the positions and the velocities of the particles. • Find and select the best particle (x gbest ) among the particles as leader. • Repeat the following steps until the termination criteria is reached.

-
Find new x pbest for each particle.
• Return the best particle as the most optimum solution.

Classification Models
For the classification part, four different algorithms were applied, which are introduced in the following subsections.

k-Nearest Neighbors
kNN is considered a non-parametric classifier, which was first introduced by [57]. The kNN algorithm was further developed by Thomas Cover [58]. The output of the kNN algorithm is nothing but a class membership determined by the majority voting of its k neighbors. For example, if more than half of the neighbors vote for a class, the algorithm predicts that specific class as its final prediction. In addition, if k = 1, then the sample is assigned the same class as that single nearest neighbor.

Random Forest
RF is considered an ensemble learning method. For the training of RF, many different single decision trees (DTs) are built to make the final decision using a majority voting approach [59][60][61]. It should be mentioned that the single DTs in the RF classifier are created by random sub-sampling from the train data. In addition, a subset of variables/features is used to train each single DT. Using only a subset of available train data and features for building different DTs in the RF can also alleviate the over-fitting problem in the complex single DTs.

Extreme Gradient Boosting
Gradient boosted DTs is currently one of the mostly used algorithms in applied machine learning and XGB is a novel implementation of this algorithm, which is faster and has generally a higher performance compared to its predecessor [62]. A gradient boosting algorithm is used in XGB, which is categorized as an ensemble method [62]. Unlike the normal boosting algorithms in which the new trees are built to lower the errors of the previous ones, gradient boosting adds new trees/models to predict the errors of prior trees. Finally, all the individual trees are then combined to give the final prediction. It should be noted that the gradient descent algorithm is used in XGB to minimize the loss for adding new trees/models in the ensemble [63].

Stacking Ensemble
Stack is an ensemble learning method, which learns how to combine single classification algorithms to achieve the best overall performance. So, compared to other ensemble learning algorithms, such as RF and XGB, that use only DTs as the base learners, it can combine different types of classifiers. All the members of the ensemble generate a unique prediction for a specific sample. All the predictions from the single classifiers are then used as inputs for a meta estimator to make the final prediction [64][65][66]. Therefore, Stack is capable of combining different well-performing single classifiers in order to improve classification performance. In other words, it utilizes various single classifiers that are each good in different ways. Unlike RF and XGB, the whole train data is used to train the single learners and another model (meta estimator) is trained on top to learn the best possible combination of them [64]. However, it should be noted that the meta estimator, which is also named the level-1 model, is trained using the predictions made by the individual classifiers (level-0 models) on a subset of the dataset that has not been seen in the training step. In this paper, three single classifiers are used as level-0 models, which are kNN, RF, and XGB.

ML Based Walking Classification Framework
In the proposed method, the capabilities of different ML classifiers are investigated and compared for walking classification using accelerometer data. The flowchart of the proposed approach is illustrated in Figure 3, which is described step by step as follows:

1.
Segmentation: As mentioned in Section 2.2, the accelerometer time series are segmented into smaller chunks of 3, 6, and 9 s with 50% overlaps.

2.
Cross validation: leave-one-group-out cross-validation (LOGO-CV) is applied to split the dataset into train and validation sets. In each iteration, all the individuals' accelerometer data except one are used to train the classification algorithms. The remained subject/individual is then used to validate the models. This process is repeated twenty times in order to make sure that all the subjects are validated at least once.

3.
Feature extraction: Three different types of features (i.e., statistical, temporal, and spectral), as listed in Table 2, are extracted from the segmented time series on all x-, y-, and z-axis. In total, 60 different features are extracted for each axis. 4.
Feature selection: A subset of extracted features are selected using the PSO algorithm as explained in Section 2.4. The fitness (objective) function applied in the PSO algorithm is a combination of the error value and the size of the selected feature. The objective function is given in (3).
where α and β are the weight parameters equal to 0.9 and 0.1, respectively. Figure 4 shows the convergence of the PSO algorithm as the fitness value decreases by increasing the number of iterations.

5.
Synthetic dataset oversampling: Since the length of walking episodes in the accelerometer data is shorter than the other activities (Figure 2), the number of extracted walking segments is smaller compared to non-walking segments. Therefore, the dataset is considered relatively imbalanced and the number of walking segments for training the classification algorithms is not sufficient. This increases the risk of a biased classification, which in turn leads to a higher error rate on the minority class (walking) [67]. To overcome this problem, the adaptive synthetic over-sampling technique (ADASYN) is applied to generate more samples for the minority class and to enable the classifiers to achieve their desired performance [67]. The ADASYN method consists of three main steps: (1) estimate the class imbalance degree to calculate the number of required synthetic samples for the minority class; (2) find the K nearest neighbors samples of the minority class using the well-known Euclidean distance; and (3) generate the synthetic samples for the minority class as follows: where x i represents a sample from the minority class, x ki is one of the nearest neighbor samples chosen randomly, and λ ∈ [0, 1] is a random value. As illustrated in Figure 3, the oversampling is only applied on the train set to avoid any unrealistic evaluation of the validation set. 6.
Classifiers training: In this step, the preprocessed accelerometer data from nineteen subjects/individuals is used to train all the classification algorithms.

7.
Classifiers evaluation: Finally, the performances of the four trained classifiers are evaluated using the validation set ( Figure 3) to determine their performance.

Results and Discussion
As stated in Section 2.1, a dataset of 20 elderly subjects is used to train and evaluate the performance of the different classifiers. In Section 2.7, we described the process of balancing the dataset using the ADASYN method. Figure 5a,b illustrate the distribution of the imbalanced and balanced data for two arbitrary features. It can be seen that the number of walking segments has been increased after applying ADASYN in Figure 5b. The number of segments for each class before and after data balancing are given in Table 3.

Evaluation Metrics
Various classification metrics can be utilized to compare and report the performance of the ML classifiers. A confusion matrix, as given in Table 4, can be used to calculate the different metrics. The rows and columns in Table 4 represent the actual labels and the predictions from the models, respectively. We use four classification metrics in this paper, which are accuracy (Acc), Sensitivity (Se), Precision or Positive Predictive Value (PPV), and F-score. The definitions of these metrics are given as follows using Table 4: where TP, TN, FP, and FN are the number of true positives, true negatives, false positives, and false negatives, respectively. F-score is actually a weighted harmonic mean of Se and PPV. If β = 1, it is called a balanced F-score (F 1 -score), which takes into account both Se and PPV equally.

LOGO-CV Classification Performance
The average performance of the different classification algorithms on 20 subjects for S = 3, 6, and 9 s are plotted in Figure 6. As can be seen, the kNN model has the lowest performance among all the classifiers for almost all the segment lengths. On the other hand, the Stack model achieves the highest ACC, Se, and F 1 -score. In terms of PPV (precision), the XGB classifier outperforms other models for S = 3 s (Figure 6a). However, the precision of the Stack model is comparable with the XGB for S = 6 and S = 9 seconds. From Figure 6, it can be concluded that S = 6 s is the most optimum segment length for the detection of walking episodes. Therefore, we present some of the obtained results for S = 6 s in the rest of this section. The window size of 5-6 s is also considered clinically valid for human walking analysis [12].  Table 5 summarizes the average performance over all the subjects for the four classifiers (S = 6 s). These results confirm that the Stack method outperforms other classifiers on Se, F 1 -score, and Acc with 86.85%, 88.81%, and 93.32%, respectively. In addition, the PPV (precision) of XGB and Stack are comparable with 94.02% and 93.25%, respectively. Table 6 reports the performance of the Stack classifier as the best method on all subjects. The selected features using the PSO algorithm are also given in the last column. The corresponding feature names can be found in Table 2. The total number of selected features for different subjects varies between 23 and 34 out of 60 extracted features. As given in Table 6, almost all the subjects achieved F 1 -score, PPV, and Acc above 90% except subject numbers 2, 5, 9, 10, and 14. However, subjects 9, 10, and 14 are the most challenging ones for the detection of walking episodes with Se of 75%, 73%, and 78% and PPV equal to 81%, 80%, and 74%, respectively. From Table 1, all these three subjects use a walker and they suffer from diseases such as spinal stenosis, hip fracture, osteoporosis, osteoarthritis, and knee problem, which makes it more difficult for them to walk normally.

Inter-Subjects Analysis
In this section, we evaluate the performance of the developed classifiers on five carefully chosen subjects with different walking aids. So, the classifiers were first trained on 15 subjects and they were then tested on the 5 remaining individuals. From Table 1, subjects number 3,9,13,15, and 20 were selected as test sets. Subject 3 is the only one who uses a crutch, while numbers 9 and 13 use a walker and numbers 15 and 20 have no walking aids. This selection ensures that there are representatives from almost all groups in the test set, which makes the model evaluation more robust and reduces the classification bias on the unseen data in the future. In addition, the selected subjects are balanced in terms of gender with three males and two females.
The performance of the four classifiers for S = 3, 6, and 9 s is given in Table 7. In general, the Stack method performs the best for almost all the segment lengths, while XGB achieves slightly higher PPV compared to others. It is interesting to note that Stack and XGB outperform the other two classifiers, namely kNN and RF. As it can be seen from Table 7, the classification performance is higher for S = 6 s compared with other segment lengths. Stack is the most sensitive and accurate classifier with Se and Acc equal to 86.13% and 91.50% for S = 6 s, respectively. In addition, Stack achieves 88.50% for F 1 -score measure as the highest among all classification algorithms. The PPV value for the XGB method is the maximum at 92.42%, which is slightly higher than Stack with a PPV value of 92.03%. Table 7. Performance comparison of different classifiers for S = 3, 6, and 9. The numbers are in percentage and the highest performances of the different models are given in bold. The confusion matrices for the four classifiers on the five test subjects (3, 9, 13, 15, and 20) are given in Figure 7. The reported results are for S = 6 s and O = 3 s, which was shown to be the most optimum segment length as per the results in Table 7. For example, the Stack method detects most of the walking and non-walking segments correctly (Figure 7d), which are equal to 91.1% and 92.95% of the cases, respectively. In other words, only 8.9% of the walking segments are incorrectly classified into other activities (non-walking) compare to 10.26% for the XGB method. This makes Stack classifier a very suitable method for the detection of walking episodes from the accelerometer data. On the other hand, the number of false positive cases for the XGB method (4.9%) is the least among all the classifiers, which explains the higher PPV for XGB as reported in Table 7.

Algorithm
The ROC curve of the proposed models for the five test subjects (3, 9, 13, 15, and 20) are also plotted in Figure 8. As shown in the figure, the Stack model achieves the highest area under the curve (AUC) of 0.97 followed by XGB and RF with 0.96 and 0.95, respectively. As expected, there is a considerable gap between kNN (AUC = 0.85) and other algorithms. This shows the promising classification performance for ensemble learning methods especially the Stack classifier as a combination of two ensemble-based methods (RF and XGB) and a single classifier (kNN).

Computational Resources
All the experiments presented in this paper were run on the University of Southern Denmark's internal cloud infrastructure with 64 vCPU and 376 GB RAM. Overall, the processing time of one iteration of the LOGO-CV algorithm (i.e., training on 19 subjects and validation on 1 remaining patient) takes around 178.37 s. Several open-source libraries were used to conduct the experiments such as scikit-learn [68], XGBoost [62], NumPy [69], Pandas [70], Matplotlib [71]. It should be also noted that the size of the collected triaxial accelerometer data for 20 patients is around 10.5 MB.

Conclusions, Limitations, and Future Works
In this paper, different ML models have been developed to classify walking episodes from other activity types (non-walking). For this purpose, 60 different features (statistical, temporal, and spectral) were first extracted from the accelerometer data. Then, a PSO algorithm was applied to select the best subset of features, which were then used as inputs for the ML classifiers. There were three main contributions to this work which addressed the three questions stated in Section 1. First, the performances of different ML methods were compared for classifying walking segments in older adults. Second, we investigated the possibility of using the sensor placement only on the back, unlike the conventional hip and thigh or multi-sensor placement. Although placing the sensor on the back makes it more challenging to detect walking activity, it is most desired in cases where the subjects suffer from dementia and Alzheimer's disease. Third, the proposed models were evaluated to whether they are suitable for classifying the walking activity of older subjects with walking abnormalities. The experimental results showed that the single classifiers such as kNN were outperformed by ensemble-based models (RF and XGB). In addition, the obtained results showed that the Stack model, which is a combination of all three classifiers (kNN, RF, and XGB), outperforms others. For example, Stack achieved improvement on Se, Acc, and F 1 -score by around 1%. From this, we can also conclude that the classification results improve by increasing the diversity of the classifiers included in the ensemble. For example, even though the kNN classifier is inferior to the ensemble classifiers, it helps the ensemble model (Stack) for the detection of walking episodes that are challenging for other methods such as RF and XGB. Therefore, the combined Stack model achieves the highest performance. The obtained results confirm that ML methods can be efficiently applied in clinical settings to classify walking segments using accelerometer data collected from elderly dementia patients, which paves the way to be used in house by personnel in care homes/rehabilitation centers to better monitor patients' daily activities and progress.
There are also some limitations to our proposed algorithm, which can be studied in future works. First, the model's performance highly depends on the feature extraction and a selection step. The performance of the model might be decreased in case of less ideal feature extraction and selection. So, the impact of the different feature extraction and selection techniques can be further investigated. Second, the population size is limited in this study. We may collect data from more municipalities' care homes in the future. This enables us to train more advanced Deep Learning (DL) models that can handle raw accelerometer data, which subsequently helps to bypass the feature engineering step. Finally, the algorithmic bias was not investigated to study the effectiveness of the proposed model on a highly imbalanced population in terms of gender and age. Institutional Review Board Statement: The Regional Committee on Health Research Ethics for the Region of Southern Denmark was contacted regarding ethical approval of the study. They responded that according to Danish law about ethics related to health research, ethical committee approval was not required for this study.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: According to Danish data protection laws, the collected data set cannot be shared outside the project.

Acknowledgments:
The authors would like to thank Michelle Lykke Larsen for helping with the data collection process. We would also like to gratefully thank the volunteers and Gitte Friis, who participated in and organised this study. In addition, we would like to thank our collaborators, Brane ApS, SENS Innovation ApS, and Kerteminde Municipality, who actively helped us during the course of this study.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A.1. Empirical Cumulative Distribution Function
The empirical cumulative distribution function (ECDF) is calculated by first ordering all the available unique values in the time series and then the cumulative probability is computed for each value/observation as follows: where the nominator represents the number of observations/values that are less than or equal to a given value such as x and n is the total number of values in the time series data. Histogram represents the probability of occurring certain values. It is done by defining a number of bins with certain length to calculate the probability of the values occurrences in that specific bin.

Appendix A.4. Interquartile Range
The interquartile range (IQR) simply calculates the difference between the 0.75 and 0.25 percentile of the data. The minimum and maximum are simply the greatest and smallest value of the time series (signal). However, the mean and median values of a signal can be calculated as below: and where X represents the signal with its size equal to n, i.e., X = {x 1 , x 2 , . . . , x n }.

Mean absolute deviation
Median absolute deviation = Median(|x i − Median(X)|), i = 1, 2, . . . , n, where Mean and Median are the mean and median of the time series defined in (A3) and (A4).

Appendix A.7. Root Mean Square
Root mean square of a signal (time series) is the square root of the mean squares of the data points, which can be calculated as: where n is the size of the time series.

. Variance and Standard deviation
Variance measures the variability of the data points from the mean and standard deviation is the square root of the variance. The variance and standard deviation are computed as follows: where n and Mean are the size and average of the time series X.
Appendix A.9. Kurtosis & Skewness Both Kurtosis and Skewness describe the shape of a distribution, which are computed as: and where Mean and Variance are given by (A3) and (A8).
Appendix A.10. Absolute and Total Energy, Centroid, and Area under the Curve Absolute energy is the sum of the squares of data points, which is computed as given in (A12). However, total energy is the absolute energy divided by time range of the time series as given in (A13). Similarly, centroid of a time series along the time axis is calculated in (A14). Area under the curve of a time series is simply computed using the well-known trapezoid rule. where n is the size of the signal and t 0 and t n are initial and last time stamp of the signal (time series). In addition, (·) represents the dot product of the two vectors T = {t 0 , t 1 , . . . t n } and E = {x 2 0 , x 2 1 , . . . x 2 n }.
Appendix A.11. Autocorrelation Autocorrelation compares the similarity between a time-delayed (shifted) version of a signal to the signal itself. For periodic signals, the integer multiple delays/shifts of the signals are perfectly correlated with the signal itself [72].

Appendix A.12. Shannon Entropy
Shannon entropy is a well-known metrics to measure the uncertainty of a random process, which can be used in non-linear and non-stationary signals [73]. In other words, Shannon entropy can be used to quantify the degree of complexity of a signal. Shannon entropy is formulated as [74]: where p(x i ) is probability of the x i data point such that ∑ n i=0 p(x i ) = 1.
where f k denotes the frequency of bin k and t m is the center of an analysis window in seconds. Spectral spread, which is also called spectral standard-deviation, calculates the spread of the spectrum around its mean value. Spectral spread can be computed as [85]: where f k denotes the frequency of bin k and t m is the center of an analysis window in seconds. µ 2 (t m ) is the spectral centroid at t m as given in (A23). Spectral skewness measures the asymmetry of the spectrum of signal (time series) around its mean value, which can be calculated as [85]: where f k denotes the frequency of bin k and t m is the center of an analysis window in seconds. µ 1 (t m ) and µ 2 (t m ) are the spectral centroid and spread at t m as given in (A23) and (A24). µ 3 =0 corresponds to a symmetric distribution, while µ 3 > 0 and µ 3 < 0 indicate more energy at lower and higher frequencies with respect to the mean value, respectively. Spectral kurtosis measures the flatness of the spectrum of signal (time series) around its mean value, which can be formulated as [85]: where f k denotes the frequency of bin k and t m is the center of an analysis window in seconds. µ 1 (t m ) and µ 2 (t m ) are the spectral centroid and spread at t m as given in (A23) and (A24). µ 4 =0 corresponds to a normal distribution, while µ 4 > 0 and µ 4 < 0 indicate a wider and narrower distribution, respectively.
Appendix A.26. Spectral Slope, Decrease, Variation, and Distance Spectral slope is calculated applying a linear regression over the spectral amplitude values. There is a linear correlation between spectral slope and spectral centroid. Spectral decrease was proposed by [86] and it averages the set of spectral slopes between frequency f k and f 1 [86]. Spectral variation quantifies the spectral shape changes over time, which is defined as one minus the normalized correlation between the successive time frames [86]. Spectral distance calculates the cumulative sum of distances at different frequencies ( f k ) with respect to the linear regression.

Appendix A.27. Wavelet Entropy and Energy
Wavelet entropy calculates shannon entropy of continuous wavelet transform (CWT) [87]. Suppose that a set of wavelet coefficients is given as W(a i , t), i = 1, 2, . . . , M. The wavelet entropy is computed as [88]: where a is the scale parameter of wavelet coefficients. Furthermore, wavelet energy is simply the sum of the squares of the absolute values of wavelet coefficients W(a i , t).