Sensor Data Acquisition and Processing Parameters for Human Activity Classification

It is known that parameter selection for data sampling frequency and segmentation techniques (including different methods and window sizes) has an impact on the classification accuracy. For Ambient Assisted Living (AAL), no clear information to select these parameters exists, hence a wide variety and inconsistency across today's literature is observed. This paper presents the empirical investigation of different data sampling rates, segmentation techniques and segmentation window sizes and their effect on the accuracy of Activity of Daily Living (ADL) event classification and computational load for two different accelerometer sensor datasets. The study is conducted using an ANalysis Of VAriance (ANOVA) based on 32 different window sizes, three different segmentation algorithm (with and without overlap, totaling in six different parameters) and six sampling frequencies for nine common classification algorithms. The classification accuracy is based on a feature vector consisting of Root Mean Square (RMS), Mean, Signal Magnitude Area (SMA), Signal Vector Magnitude (here SMV), Energy, Entropy, FFTPeak, Standard Deviation (STD). The results are presented alongside recommendations for the parameter selection on the basis of the best performing parameter combinations that are identified by means of the corresponding Pareto curve.


Introduction
Ambient Assisted Living (AAL) is currently on the research agenda of many stakeholders worldwide, especially in Western countries, driven mainly by the needs of an aging population and in an attempt to address the demands of care and intervention for the elderly and those who require care. The main areas of interest in Assisted Living (AL) include fall prevention, promotion of independence, as well as ambulation and Activity of Daily Living (ADL) monitoring (for fall detection, activity recognition and classification). The timeliness and accuracy of the classification of ADL activities could have severe consequences if inadequate, especially in the case of an emergency event such as a fall and are therefore essential to provide the elderly with a sense of security and confidence [1,2]. Furthermore, reasonable levels of ADL facilitate the promotion of independence, hence the need for ambulation and ADL monitoring. Consequently, automated monitoring of subjects living independently in their homes, using wearable and off-body sensor-based devices, has been the subject of numerous research studies. While the literature highlights a great number of research areas for assisted living, such as sensor designs, placement of monitoring devices, novel monitoring techniques, fall detection and ADL data collection and classification methods, it fails to clarify some of the underlying and fundamental aspects of data collection in this field such as data acquisition and pre-processing (outlined in Figure 1, presenting standard prerequisites before ADL classification can take place). Falls and ADL events are generally classified based on the features extracted from segments of the monitoring sensor data and have therefore a significant role in the accuracy of event classification [3]. Even though researchers are aware of the importance of sampling frequency; segmentation method; and window size with respect to feature extraction, the issue is not addressed in the reviewed studies with no clear explanation or justification given for the parameter selection. Furthermore, researchers tend to ignore the required Computational Load (CL) for data classification, which is of particular interest once data classification takes place on an embedded system for real time ADL recognition.
The literature review showed that there is no consensus in the selection of parameter combinations which once chosen, are seldom varied by researchers to improve classification results. Therefore, the work described in this paper empirically investigates the influence of sampling frequency (SF), segmentation method (SM), and windows size (WS) on the classification accuracy (CA) and computational load (CL) using two independent datasets (from Bao et al. and Roggen et al.). The work presented here tests eight commonly used features that are obtained from the accelerometer sensor data to determine CA and CL. The input information for the classifier are Root Mean Square (RMS), Mean, Signal Magnitude Area (SMA), Signal Vector Magnitude (here SMV), Energy, Entropy, FFTPeak, and Standard Deviation (STD). The results have been analysed using an ANalysis Of VAriances (ANOVA) to reveal the influence of the parameter combinations on the CA and CL. This is followed by an approach to recommend the parameter combinations that achieve the best CA disregarding CL and vice versa. Other parameter combinations may represent interesting trade-off points between these two preferences. This may be required in situations where time and hardware resources are limited. The authors aim to provide a more informed approach to parameter selection for event classification (with respect to the investigated ADLs) in the area of AAL. Section 2 will highlight existing literature to outline the inconsistency and insufficient justification for parameter selection in ADL classification. This section also presents the process of data acquisition and introduces different segmentation techniques. Section 3 describes the investigation procedure. Section 4 presents the experimental results with a recommendation for parameter combinations, and Sections 5 and 6 present the discussion of results and conclusion.

Sampling Rate
The acquisition of data is one of the most critical steps in event classification as re-running experiments with test subjects is not always possible. Undersampling leads to loss of information and oversampling can result in information buried in unwanted noise. In the latter case, longer computational time is needed for analysis as more data needs to be processed. The minimum sampling rate f sampling is dependent on the maximum frequency contained in the data signal f max (the sampling theorem) [4]. In the area of AAL, a review of the literature has not uncovered a typical sampling frequency.
The highest sampling rate for AAL that the authors found during their research is 512 Hz by [5] followed by the works of [6] where the authors use a sampling rate of 256 Hz to collect accelerometer data. [7] use a two-axis accelerometer and a sampling frequency of 76.25 Hz, which is less than 1/3 of [6] sampling rate. [8] choose f sampling to be 64 Hz. The authors acknowledge the high frequency sampling rate used by [6] however they reduced the sampling frequency on the bases that lower values are more feasible with off-the-shelf activity monitors. They further mention the work of [9], who sample accelerometer data at 50 Hz, therefore resampling their own data at the same frequency as well.
Overall the literature highlights that values around 50 Hz are one of the more common sampling rates. [10] use 52 Hz, [11] use 50 Hz to sample their tri-axial accelerometer, while [12] and [13] also report a 50 Hz sampling rate for an eWatch with two-axis accelerometer and a light sensor. To the authors' best knowledge, [13] are the only ones that tested different sampling frequencies (from 1 to 30 Hz) for the sensor data. The outcome highlights that the recognition of ADLs improves with higher sampling rates but only marginally improves with sampling rates above 20 Hz. In [14] the authors demonstrate that 98% of the FFT spectrum amplitude is contained below 10 Hz, and 99% below 15 Hz. This corresponds to the findings of [15] who state that a sampling frequency of 20 Hz is sufficient to successfully classify ADLs. The lowest sampling rate that the authors found in the literature is 5 Hz by [16].

Segmentation Method
One of the challenges of data pre-processing following acquisition consists in deciding which points to actually use in the live stream of data. Several different segmentation methods exist to divide a larger data stream into smaller fit for processing chunks. The selection of the right segmentation technique is crucial, as it immediately impacts on the extracted features used for the ADL classification and the resulting classification accuracy. Therefore even the best classifier performance will be weak when the extracted features are non-differentiable [3]. Furthermore, the segmentation techniques can also have an impact on the real time capabilities as complex segmentation methods can increase CL but might result in improved classification accuracy. Moreover, the segmentation method also dictates how often features need to be extracted and classification algorithms need to be executed. Literature has highlighted several different segmentation techniques used in various research projects, such as: Fixed-size Non-overlapping Sliding Window (FNSW) [3,17], Fixed-size Overlapping Sliding Window (FOSW) [3,17], Top-Down (ToD) [17], Bottom-Up (BUp) [17], Sliding Window And Bottom-up (SWAB) [17], Symbolic Aggregate approXimation (SAX) [3], String Matching (SM) [3], Reference-based Windowing (RbW) [18], Dynamic Windowing (DWin) [19] and Variable-size Sliding Window (VSW) [20]. The significant difference in these techniques resides in their online and offline capabilities. The meaning of an online technique is that the data can be segmented before the complete data is collected, while offline methods require the entire dataset first. For real time applications, only online techniques are of interest. [17] note that online algorithms can produce very poor approximations of data under certain conditions but have a relatively good performance on noisy data. However, the authors also highlight that the FOSW segmentation algorithm is of particular interest in medical research, e.g., patient monitoring as the algorithm is simple and intuitive for researchers to understand. As part of this paper the algorithms investigated should be fairly simple to understand and online capable (FNSW, FOSW, and SWAB).

Window Size
Researchers who use fixed size window segmentation methods apply inconsistent window sizes. [10] use especially short windows of 1 s. [8] report to use a 2 s window based on their short ADLs in their research and because they achieve only a minimal gain in classification accuracies with features from a 3 s window. Further examples for short windows are [21] and [9], with 2 s and 2.56 s respectively. [13] extract features from a 4 s buffer, [12] use 5 s in their research, and [7] report a window length of 6.7 s. While these researchers are using short window sizes, [20] describe the usage of a 60 s windows in the work of [22] and 74 s in [23]. Furthermore, [20] introduce possible modification to the fixed size window methods. The authors suggest in their work to dynamically adjust the window size based on special events in the sensor data as different ADLs have different time frames. They raise the point that longer window sizes can cover more than one ADL while a small window could split an activity, which both leads to suboptimal information for an activity classifications algorithm.  The work of [16] has a similar point, indicating that to achieve good classification accuracy, different sensor features should be extracted using varying window sizes. These methods lead to complex monitoring systems if several ADLs need to be classified. Each feature window could yield different ADL classification results, which would then require a voting system to predict the correct ADL from the list of possible activities. Section 2 highlighted the divergence in parameter selection in the literature covering ADL event classification. Table 1 represents a summary of the different combinations discovered. The section above pointed out problems that are introduced when the wrong sampling frequency (over/undersampling) is used for the data acquisition. It also showed that researchers in the field are not in agreement over which sampling rate to use. The section also showed the use of various window sizes covering a wide range of values. Most studies base their parameter combinations on past experiments, hardware limitations, or do not state a specific reason. It was also found that possible CA or CL improvements based on different combinations are neither investigated nor mentioned. This inconsistency in parameter selection is the fundament for the study presented here for a more informed decision on parameter selection.

Investigation Procedure
This work presented here was based on two different datasets from the literature. The first dataset contains two-axis accelerometer data collected by [7]. The data contains 20 different participants (13 males and seven females with a mean age of 21.8 years (±6.59 years SD)), which were recruited at the MIT with the help of posters. The experiment required the test subjects to execute several different ADLs under laboratory conditions without any supervision or guidance. Sensor data was collected simultaneously at five different positions (ankle, thigh, wrist, hip, upper arm), with a sampling frequency of 76.25 Hz. From the five sensor positions, the data of the hip sensor from all twenty different participants were used with the focus on ADLs such as, walking, sitting, walking and carrying an item, standing still, lying down, and climbing stairs. The second dataset is the Opportunity dataset collected as part of a European Funded project by [24]. The dataset is not limited to just body-worn accelerometer data. The complete set includes a total of 10 different sensor types, such as microphone, magnetometer, UWB localization, RFID, etc., totaling a collection of 72 sensors. Data was recorded with 12 test subjects, which are not further specified. Of these 12 subjects only three subjects are labeled and available in the UCI Machine Learning Database. The labeled locomotion activities that were used from this dataset are stand, walk, sit, and lie. As both datasets should be similar to allow for comparable result, the Opportunity dataset was limited to the accelerometer (sampled at 64 Hz) placed at the subject's hip and limited to the x and y axis. Using Matlab [25], the sensor data was resampled, segmented and the data features extracted, while the Weka software package [26] provided the implementation of the classification algorithms.

Resampling
Section 2 showed that sampling rates vary greatly throughout the literature; it also indicated a high use of sampling rates around 50 Hz even though the work of Maurer et al. argues that sampling rates above 20 Hz only marginally improve classification accuracy [13]. Therefore the complete data set was resampled using Matlab at six different sampling rates in the range of 10 to 60 Hz in 10 Hz steps. Intermediate steps were ignored for the benefit of faster experiments, as well as the authors' belief that the omission of intermediate steps would not cause loss of generality of the results. Additionally, sampling rates above 60 Hz were excluded, as the authors concur with [8], who stated that higher sampling rates are harder to achieve with off-the-shelve-components.

Data Segmentation
The work presented here focuses on three online segmentation techniques: FNSW, FOSW (with four different overlap percentages), and SWAB that were introduced in Section 2.2. As described above, the advantages of these algorithms are that they are online capable, therefore can be used while the data collection is in progress and are simple and intuitive so that they are easily understood.
FNSW is a simple segmentation technique without any data overlap (see Figure 2a). The end point of segmentation window N is the starting point for window N + 1. It is therefore possible to exactly calculate the amount of windows generated for a given data set size with Equation (1): where S is the total number of signal samples and , where is the data resampling rate used (in the range of 10 to 60 Hz) and is the selected window size (in the range of 0.5 to 24 s). One disadvantage of this technique is that data associated with a particular feature (e.g., fall) can be split between windows. A FNSW sliding technique that is not covered in this paper is to leave a gap between adjacent windows, as this would result in uncovered sensor data and therefore could miss important information. The FOSW segmentation technique is based on FNSW but includes data overlap (see Figure 2b showing FOSW with an overlap of 50%). Depending on the percentage overlap, more or less data overlaps from window N into N + 1. This is also referred to as a window shift. A 0% overlap corresponds to the FNSW segmentation method, while an overlap of 100% would yield to a static window as it would not be shifted and the data would always be segmented at the exact same point. Therefore, the requirement for FOSW is to move with at least one data point per segmentation. The number of segmentation windows generated can be calculated using Equation (2): (2) with being one of the following percentage overlap values used for this research: 25, 50, 75, and 90.
SWAB is the third segmentation technique used as part of the study presented here and was designed by [17]. It is a combination of the Sliding Window and Bottom-up approach. The process is visualized in Figure 3. The algorithm has a fixed size data buffer that is used for the Bottom Up approximation, which joins the smallest approximation segments until a stopping condition is met. Once the approximation for the window is complete the data buffer shifts by the first segment (here identified as Segment #1 in the illustration) and the process is repeated for the new buffer window. Each segment is used for feature extraction. As the data shift is dependent on the dataset and its approximation, it is not possible to estimate the amount of segmentation windows generated by the algorithm. The implementation is more complex compared to the FNSW and FOSW methods described above and therefore an increased CL is expected while the CA gain is uncertain even though literature suggests improved results.
As highlighted earlier in Section 2.2, there is no clear recommendation in the published literature on the selection of the window size used for the data segmentation. The authors therefore tested a range of 32 different sizes in the range of 0.5 to 24 s. In the area of 0.5 to 8 s, the size is increased in 0.5 s steps, while thereafter the step size is increased to 1 s. The 0.5 s step size was increased after 8 s because the ADLs under investigation have only a short time frame and computational load was reduced for the experiment. Even though literature showed the use of longer window sizes the aim is to only include single ADLs in each window to achieve the best classification results. The authors' initial research [27] supports this idea as it indicates a decrease in accuracy for window sizes above 8 s. Furthermore, it is the authors' belief that most ADLs will only take a short amount of time and a maximum of 24 s should be sufficient to include at least two ADLs.

Data Feature Selection
The following eight metrics are quite common in the area of ADL classification and therefore used to retrieve the different features of the accelerometer sensor data in this research: Root Mean Square (RMS), Mean, Signal Magnitude Area (SMA), Signal Vector Magnitude (here SMV), Energy, Entropy, FFTPeak, Standard Deviation (STD). These metrics and their significance are discussed below as each individual metric has its own influence in the research field.
RMS has been used to distinguish walking patterns [6] as well as being an input to classifiers for activity recognition [16,28]. The RMS value is calculated using Equation (3): The Mean metric (see Equation (4)) has been used to: recognize sitting and standing [28,29]; it discriminates between periods of activity and rest [30]; and as an input to classifiers such as Decision The next metric, SMA is used to distinguish between periods of activity and rest in order to identify when the subject is mobilizing and undertaking activities, and when they are immobile [15,33,34]. Equation (5) implements SMA: (5) SMV, normally referred to as Signal Vector Magnitude (SVM) but changed to SMV to avoid confusion with the SVM classifier used, indicates the degree of movement intensity and is an essential metric in fall detection [33,34]. The SMV value is calculated using Equation (6): Two additional metrics used in this research are Energy and Entropy, which discriminate between types of ADL such as walking, standing still, running, sitting and relaxing [7,35]. The calculation of the Energy value is based on Equation (7) and the Entropy is calculated using the Matlab function from [36]: (7) Another feature that was extracted from the accelerometer data stream is the FFTPeak for each axis. The metric has been used for activity recognition [5,12,35]. The FFTPeak algorithm was based on the Matlab Example found at [37].
The last metric used is Standard Deviation (STD), which has been extensively used for activity recognition [29]; and as an input to classifiers, such as J48, Random Forest and Artificial Neural Networks (ANN) [16,38]. Equation (8) describes the calculation:

Classifier Selection
The software tool Weka implements the algorithms of several different classifiers from which nine were selected based on literature to verify the effects of changes in the described parameters above. [12] points out that common algorithms for activity classification are Support Vector Machines (SVM), Decision Trees, and Bayesian classifiers. The work of [13] included the use of Decision Tress and Naï ve Bayes classifier. [10] based their research on Decision Trees, Bagging of 10 Decision Trees, AdaBoost using Decision Trees as base classifiers and a Random Forest of 10 Decision Trees. Additionally, in the work of [16] the authors compare J48 (Decision Trees) and Random Forest. This investigation therefore included the following classifiers: Naï ve Bayes, SMO (based on SVM), KNN, KStar, MultiClassClassifier, Bagging, Decision Table, J48, and Random Forest. All algorithms were tested using Weka's standard configuration and a 10-fold cross validation. An additional classifier fine-tuning is a research field in its own and therefore not discussed here. The use of different classification methods has enabled the authors to verify the impact of the sampling rate, segmentation method, and window size on the classification accuracy over a wide field of algorithms used in AAL.

Experimental Results
This section will be split in three subsections to focus on different aspects of the parameter selection problem introduced by the variation in classification methods (CM), sampling frequency (SF), segmentation methods (SM) and window size (WS). The first two analyses use an ANalysis Of VAriance (ANOVA) to investigate the impact of the identified parameters on the two dependent variables classification accuracy (CA, as described in Section 4.1) and computational load (CL, as described in Section 4.2). The analyses have been conducted in SPSS [39]. In the third subsection the authors present the results of an investigation into the effects of various input parameter combinations on CA and CL with a view to enable optimum parameter selection based on the identification of the corresponding Pareto curve (see Section 4.3). Figure 4 below shows the different levels of the four parameters CM, SF, SM, and WS; this has resulted in 32 different window sizes, three segmentation methods with different parameters (resulting in six SM levels) and six sampling frequencies for each of the nine different classification algorithms. This results in 10,368 different parameter levels for each of the 23 test subjects for the analysis of variance (ANOVA) in SPSS. The performance measures used are classification accuracy (CA) and computational load (CL), as explained in Section 4.2. Next to classification accuracy, literature also highlighted the use of precision, recall, and f-measure. In [40] the author argues that precision, recall, and f-measure are especially useful for highly imbalanced datasets. For example, when faced with a two-class problem with a split of 98% (majority) and 2% (minority), just guessing the majority class will achieve an accuracy of 98%. If the detection of the minority class, say, representing rare and infrequent events (e.g., falls), is important, an accuracy of 98% would be misleading in terms of the performance of the classifier. The datasets used for this research include six and four, respectively, different activity classes, which are roughly equally distributed and are equally important to classify. It is therefore adequate to use the accuracy of the classifier instead of f-measure:

Statistical Analysis of Accuracy
This section reports on the impact of the variations of four input parameters (CM, SF, SM, and WS) on classification accuracy (output). During the initial analysis of the ANOVA output, the tested accuracy for the KStar algorithm showed a strong sensitivity to changes in the input parameters (SF, SM, and WS). Its influence on the analysis was significant and superimposed on the results. As a consequence the impact of certain input parameters appeared to be of significance for the classification accuracy, while overall the impact resulted from the sensitivity of the KStar classifier. Therefore the authors decided to exclude KStar in the analysis in order to avoid a misinterpretation of the overall impact of parameters on accuracy.

Dataset Bao et al.
The ANOVA results (presented in Table 2), excluding KStar, showed that 49% of the variations in the dependent variable (accuracy) are described by the four input parameters. This means, that other input parameters that were not tested in the scope of this experiment may have further influence on the accuracy. Such a result is not surprising, as the investigated problem is highly complex and it is understandable that factors such as the test subject itself and the recorded movement have also an impact on the resulting accuracy. The table shows the Sums of Squares (a measure for the average variability in the data), Degree of Freedom (df-scores that are free to vary once the mean of the set of scores is known), Mean Square (which is used to estimated the variance), F (F-Ratio represents the indicator for the significance on performance caused by the independent variables instead of chance), and Sig. (indicating the significance level at which the main/two-way interaction effects are significant <0.05 or non-significant >0.05) for all main and two-way interaction effects. The two-way interaction effects outline a changing main effect of one factor for different levels of a second factor and are therefore of higher interest than the main effect alone, if they are identified as being significant. The Sig. column in Table 2 shows that each main and the six two-way interaction effects (SM and CM, WS and CM, SM and WS, SF and WS, SF and CM, SF and SM) have a significant impact on the accuracy. Furthermore, the Type III Sum of Squares can be used as an indication for the importance of the main and two-way interaction effects. For an easier overview the rows are already sorted and it is observed that CM is the most influential factor followed by SM, WS, and SF in decreasing order. The significance for the two-way interaction effects starts with SM and CM and is followed in decreasing order by WS and CM, SM and WS, SF and WS, SF and CM and SF and SM. Figure 5 shows the interaction effect of changes in the segmentation methods on the classifier methods. The Naï ve Bayes classifier shows only a minor improvement in accuracy, whilst the remaining seven classifiers show a substantial increase with an increased segmentation overlap for FOSW. Another visible effect is the good performance of classifiers with the SWAB segmentation method, mostly outperforming FOSW with 75% overlap. For Naï ve Bayes, SWAB showed a significant decrease in performance, which results in an even lower CA when compared to FNSW.  Figure 6 presents the effect of an increased window size on the different classifiers. Besides Naï ve Bayes, all classifiers show a decrease in accuracy for window sizes below 7 s before stagnating and start to decrease after the window size increases above 9 s. Naï ve Bayes is the only CM that actually improves CA with an increased window size.
The next effect investigated is the interaction between WS and SM in Figure 7. It is noticeable that an increased window size decreases the accuracy for each segmentation method. Furthermore, the effect reduces with an increased segmentation overlap, showing a less significant impact on FOSW with 90% compared to FOSW with 25% overlap. SWAB follows the behavior of FOSW with 75% overlap with a decreased overall accuracy. The figure also shows that a window size of 6.5 to 11 s results in the best accuracy.   Figure 8 shows the interaction effect between WS and SF. The effect of an increased window size is similar for all six sampling frequencies, while the effect is reduced for a sampling frequency of 10 Hz for longer window sizes. The graph also highlights that higher frequencies achieve the best accuracy for shorter window sizes, while the 10 Hz sampling frequency requires a slightly larger window. The last effect under investigation is SF and CM. The graph is not presented, as there is no interaction effect between the different classifiers. The only effect that exists is a minor improvement of accuracy for a change of sampling frequency from 10 to 20 Hz with a nearly constant accuracy thereafter for all classifiers. This correlates with the statement in [13] that a sampling frequency above 20 Hz has only a marginally effect on the classification accuracy. For sampling frequencies above 20 Hz only minor improvements can be reported.

Dataset Opportunity
The ANOVA results (presented in Table 3), excludes KStar for the same reason that is mentioned in Section 4.1.1, showed that 53% of the variations in the dependent variable (CA) are described by the four input parameters. As before, other input parameters that were not tested in the scope of this experiment may have further influence. Compared to the table earlier, the Sig. column shows this time, that the three two-way interaction effects (SF and SM, SF and CM, SF and WS) are non-significant for this dataset. The data is also sorted based on the Type III Sum of Squares for an easier overview of the importance of the main and two-way interaction effects. The influential factors of CM, SM, WS, and SF in decreasing order are the same compared to the earlier Table 2 but the order of the two-way interaction effects changed. The significant effect is now SM and WS, followed in decreasing order by SM and CM, and WS and CM with the rest non-significant. Figure 9 investigates the interaction between SM and WS. It is noticeable that an increased window size decreases the accuracy for each segmentation method. While SWAB shows a near linear decrease in CA, the FOSW and FNSW segmentation methods, show higher variation in CA for longer WS.    The next effect investigated is the interaction between WS and CM in Figure 11. All CM show a decrease in CA for longer WS. The effect is less significant on Naï ve Bayes compared to the other classifiers. The graph shows that shorter WS result in better CA result.

Statistical Analysis of Computational Load
The selection of different SF, SM, WS and CM does not only have an impact on the classification accuracy but also on the CL of the system. The CL for the classification of ADL events is based on two main factors. The first one is the data pre-processing and feature extraction step (indicated as in Figure 12) and the second factor is the actual event classification (indicated as in Figure 12). depends on the selected SF, SM and WS, excluding any other pre-processing steps such as filtering which is not of interest in this study, while purely depends on the selected CM. For real-time applications, the combination of SM and WS introduces a limitation for certain parameter combinations, leading to the requirement that Equation (10) needs to be fulfilled: (10) The authors therefore conducted an analysis with the CL as the dependent variable to investigate the influence of the four input parameter SF, SM, WS and CM. In the preliminary analysis, one of the levels of the SM input showed to have a high influence on the dependent variable. As before with KStar superimposing on parameters for the accuracy, the SWAB segmentation method increases noticeably the CL as compared to the other methods. Hence, effects that were non-significant before are significant once SWAB is removed as a SM level. Therefore, the analysis will outline the overall input variables without SWAB segmentation method.

Dataset Bao et al.
The result of the ANOVA is represented in Table 4, outlining that 48% of the variation in the dependent variable CL are described by the variation in the input parameters. The Source column is sorted based on the Sum of Squares to allow for an easier recognition of the importance of an input parameter. The data highlights that the most important factor for the CL is CM (with SWAB included, this effect was actually non-significant). This is followed by WS, SF and as the least significant parameter SM. For the two-way interaction effect the five significant combinations are WS and CM, SM and CM, SF and WS, SM and WS followed by SF and SM. Figure 13 shows the interaction effect between WS and CM. All classifiers require a longer CL for longer WS with a small visible drop in CL for short window sizes about 2 s. The graph also shows that the rate in which the CL increases is higher for MCC and SMO.   The interaction effect for segmentation method and classifier method in Figure 14 shows that there is significant improvement in CL for MCC and SMO for an increased overlap. The remaining classifiers show only minor changes.  Figure 15 shows the interaction effects between WS and SF. The longer window sizes result in a higher CL in all SF. Moreover, the graph shows that for higher SF the rate of increase in CL does also increase. The last interaction effect under investigation is WS and SM. The graph in Figure 16 shows that the segmentation method has similar patterns to the classifier. All segmentation methods have a significant increase in CL for longer window sizes. An interesting observation is that segmentation methods with higher overlap result in lower CL for higher window sizes.

Dataset Opportunity
The result of the ANOVA is represented in Table 5, outlining that 72% of the variation in the dependent variable CL are described by the variation in the input parameters. The Source column is sorted based on the Sum of Squares to allow for an easier recognition of the importance of an input parameter. The data highlights that the most important factor for the CL is CM. This is followed by WS, SF and as the least significant parameter SM. For the two-way interaction effect the four significant combinations are WS and CM, SM and CM, SF and WS, followed by SM and WS. The interaction effect SF and SM that was significant earlier, is non-significant for this dataset. Figure 17 shows the interaction effect between window size and classifiers. All classifiers require a longer CL for longer WS. The graph also shows that the rate in which the CL increases is higher for SMO. MCC that had an increased rate in the earlier dataset follows now the behavior of the other classifiers.  The interaction effect for segmentation method and classifier in Figure 18 shows that there is significant improvement in CL for SMO for an increased overlap. The remaining classifiers show only minor changes for the different segmentation methods. Figure 19 shows the interaction effects between WS and SF. The longer window sizes result in a higher CL for all SF. Moreover, the graph shows that higher sampling frequencies result in an increased rate of CL as well.  The last interaction effect under investigation is windows size and segmentation method. The graph in Figure 20 shows that the segmentation method has similar patterns to the classifier. All segmentation methods have a significant increase in CL for longer window sizes. An interesting observation, as with the other dataset, is that the segmentation methods with higher overlap result in lower CL for higher window sizes.

Parameter Selection
Based on the parameter influence described in Sections 4.1 and 4.2, the inevitable question still stands: what is the best parameter selection for a given requirement? The answer, however, depends strongly on the preference with respect to classification performance, e.g., is the best accuracy required or are there limitations to CL. Therefore, a set of well-performing parameter sets based on the trade-off between accuracy and CL were identified. For the given dataset certain parameter combinations will achieve a similar CA but will require different CL and vice versa. When plotted in a graph, such as presented in Figure 21, the best accuracy for a given CL would follow the black line (called Pareto frontier), with dominated parameter sets lying on the left hand side of the curve. Hence, a parameter set is dominated if there exists a combination of parameter values that results in the same level of accuracy with less CL or achieve better accuracy with the same CL. The Pareto frontier, also referred to as Pareto curve, outlines the set of non-dominated solutions, herein represented by a set of parameter combinations. One set of parameter values may achieve best CA at the cost of a high CL (Point 1) and another combination will achieve the lowest CL at the cost of a lower accuracy (Point 2). Parameter sets in between points 1 and 2 on the Pareto frontier are subject to a trade-off (Point 3), hence accepting the sacrifice of either, accuracy or CL, depending on the context of the applications or potential corresponding limitations, e.g., hardware constraints.

Summary and Discussion of Results
One of the main problems in AAL is the availability (or the lack thereof) of test subjects, as compared to clinical trials, where subjects can reach into the thousands. In [41] the authors highlight that research in AAL starts out as a demonstration of feasibility under laboratory conditions, which in a further step needs an increased number of participants and ethical considerations. In [42], the authors argue that the use of any one of two activity classification methods, uniform (where the training data comes from all tests subjects) and individual (training data representing separate test subjects) can lead to problems; generalization (arising from the uniform method), small training data set (individual) can both result in poor performance. The research and associated experiments presented here fall in the individual category as performance measures (CA and CL) are generated for each of the test subjects involved. The authors believe that despite the pitfalls described above, this was the better method to adopt; this is in line with Elbert et al.' approach. Moreover, as the tested activities are nearly equally represented in the dataset, using the accuracy measure can be done without loss of validity. This is in contrast with the differentiation between, say, normal and abnormal conditions, where the latter occurs rarely resulting in an imbalanced set of data; the use precision, recall and f-measure would be a more appropriate performance indicator [20].
In summary the outputs of the work presented here, are listed below:  The importance of parameters for CA ranked in order of decreasing influence is CM, SM, WS and SF;  The impact of WS is different for both datasets;  Increased segmentation overlap improves CA;  The influence of SWAB on CA is different in both datasets;  SF above 10 Hz has only a minor improvement on CA;  CL behaves the same for both dataset;  The importance of parameters for CL ranked in order of decreasing influence is CM, WS, SF and SM;  Some dominant parameter combinations of the Pareto curve are similar for both datasets;  Higher CL does not automatically result in higher CA.
The following discussion will look into the results of the ANalysis Of VAriance (ANOVA) for CA and CL and finish with the dominant parameter points of the Pareto curves. The two-way interaction effect between SM and CM highlights for both datasets that FOSW with 90% overlap results in the best CA. From FNSW (no overlap) to FOSW with 90% overlap, both datasets show that more overlap improves the CA. A possible reason for this is that the increase in overlap allows for a bigger training set and has the lowest loss of information, in the range of investigated SM. The results for SWAB are mixed. For the Bao et al. dataset CA is just below CA for FOSW with 90% overlap, while the Opportunity dataset showed SWAB to be the worst segmentation method tested. Further research needs to look into the actual benefit of a dynamically sliding window, which incidentally was reported in [20] as giving good results, as the results reported here (in terms of classification accuracy) are inconsistent between the two datasets. Another difference between the datasets is observed for the WS and CM two-way interaction effect. While for the first dataset (Bao et al.), the CA improves for window sizes between 1 and 8 s and only decrease for WS values above 8 s, the second dataset (Opportunity) achieves best CA for 0.5 s and starts to decrease immediately after that. A similar behavior can be seen for the two-way interaction effect of WS and CM and WS and SM with both datasets (compare Figure 6 and Figure 7 in Section 4.1.1 and Figure 11 and Figure 9 in Section 4.1.2). Researchers should therefore choose smaller window sizes if possible. Another difference between the two datasets is the significance of the two-way interaction effect WS and SF; namely, significant interaction (but not the most significant) for Bao et al. and non-significant interaction for the Opportunity dataset. The graph in Figure 8 (see Section 4.1.1) shows that sampling frequencies above 10 Hz achieve nearly the same CA, while the 10 Hz sampling frequency is marginally lower, endorsing the finding in [13] that sampling frequencies above 20 Hz result in only minor accuracy gains.
For CL, both datasets show the same behavior for the two-way interaction effects. For the three interaction effects including WS (WS and CM, WS and SM, WS and SF) similar behavior is observable. A shorter WS results in a lower CL, while a longer WS will increase the CL. This effect is lowest for WS and CM and highest for WS and SM. The interaction effect between SM and CM highlights no significant change for any classifier besides SMO. SMO is the only classifier that can reduce the CL with an increased segmentation overlap.
The authors used ANOVA to quantify the influence of the different parameters on the CA and CL. They have also used a Pareto curve based approach to highlight dominant parameter combinations for -optimum‖ achievable performance (optimality being decided by the user in a given context/application). Figure 24 presents the four Pareto curves based on the dominant combinations. The illustration shows that all graphs have a similar outline and that it is possible to achieve similar results irrespective of the dataset. This is highlighted with only a 4.3% difference in CA between the two top performing parameter combinations. However, the dominant parameter combinations are different for each dataset. Therefore, it is not possible to present a single combination that will work best for all datasets. Having said that, some dominant points have similar parameter combinations. In both datasets high CA is achieved with the KNN classifier and the FOSW with 90% overlap for SM. Furthermore, the Pareto points show that a sampling frequency above 30 Hz is not necessary and only minor improvements in CA are achieved with a sampling frequency above 10 Hz. As a consequence, the authors recommend adjusting parameters individually for each dataset and test subject to achieve optimal results, especially with regards to WS. The Pareto curves also reveal that a higher computational load does not necessarily result in better classification accuracy, as the algorithms under investigation are not recursive. The Pareto curve is also a good tool to investigate the influence of a hardware limitation such as a low sampling rate, storage space and battery runtime. When superimposing the hardware limited Pareto curve with the non-limited curve a simple comparison of achievable CA and CL is possible. The results presented in Section 4.3 in combination with the ANOVA in Sections 4.1 and 4.2 can be used for future research as a tool to select parameter combinations for AAL event classifications with the sound understanding of how each parameter influences the outcome of event classification accuracy and computational load.

Conclusions and Future Work
This paper has presented a new instrument to help select data capture and processing parameters for the recognition of Activities of Daily Living (ADL). A review of the literature uncovered a lack of consensus in terms of the selection of sampling frequency, segmentation method and window size, and classifier method for the recognition of ADL. The impact of the sampling frequency (six levels), segmentation method (three segmentation algorithms with different parameters resulting in six different levels) and segmentation window size (32 levels) on the classification accuracy and computational load of a set of commonly used classifiers (nine levels) has been investigated. This has involved experimenting with two datasets, containing 20 and three test subjects, respectively, and analysis of the resulting data using ANalysis Of VAriance (ANOVA). The analysis showed that the choice of classifier method is the most important parameter followed by the segmentation method, window size and finally sampling frequency. It also showed that in the case of computational load the parameters ranked in order of decreasing influence are classifier method, window size, sampling frequency, and segmentation method. The results have been presented graphically using a Pareto curve, which highlighted two dominant classifiers for both datasets (KNN, Naï ve Bayes). The Pareto curve did not show matching dominant points in both datasets, however, it showed that combinations of three out of the four factors (CM, SM, SF) are likely to result in dominant points. The authors have suggested that the Pareto curve is a good instrument which can be used to select sets of parameters based on their impact on classification accuracy and computational load and resolve trade-off issues.
As part of their future work in the general area of AAL, the authors plan to investigate a number of issues specific to the findings presented in this paper. An important point of interest is the identification of the reasons behind the inconsistency between the two datasets used in terms of the influence of WS on the classification accuracy. A possible influential factor, not considered in the present work, is the nature of the ADL itself. It might be necessary to adjust the WS parameter with regards to the expected ADLs in the dataset; [16] suggested to use different WS parameter combinations per activity. The authors also intend to investigate the influence of the features extracted and the position of sensors on classification accuracy. Different feature combinations (and a reduction in the number of required features) may improve the classification accuracy of different ADLs as well as reduce the CL. Moreover, the authors' propose to couple the results obtained so far with a Decision Support System (DSS). Having the option to learn, adjust from past experiences, and include new ADLs, would allow for more informed decisions in parameter selection over time. Additionally, hardware limitations, such as battery time and communication bandwidth, should be included into the selection process. Another direction that the authors want to pursue is the investigation of how to improve the Pareto curve by replacing the computational load with a measure for training time and training samples, as it could highlight classifiers that could achieve good accuracy within a low starting time.