#### 4.2. Comparison of MANN Results to Other Studies on the SBHAR Dataset

In this section, the performance of MANN is evaluated with respect to classifiers proposed by other studies on the SBHAR dataset. It should be stressed that no research studies related to TAs have been performed on the SBHAR dataset, and thus, only classification performances on NTAs could be compared. Of course, since MANN was mainly built to improve the classification of highly time-correlated items in time-dependent classification tasks, a particular improvement on NTAs was not expected.

In order to compare the MANN results to those in the literature, a standard user-independent train-test split, proposed by the authors of the SBHAR dataset, was employed: out of the 30 volunteers, 21 were used for training and the remaining 9 for testing. This user-independent train-test split allowed a good generalization of the classification accuracy, since no data from the same user were used either in training or testing, and consequently, there was no risk of the classifier learning patterns specific to a certain user (thus the name “user-independent”).

The performance of the architecture was evaluated by the accuracy, which is defined as:

The result for MANN’s accuracy was 96.24%, which was comparable to the highest accuracy values obtained by other studies. The complete comparison between the MANN results and those of other works is shown in

Table 2: MANN was third in terms of overall accuracy, over-performing, for instance, several convolutional neural network implementations (e.g., 95.18% for [

10], 94.79% for [

9], 90.89% for [

38]). The architectures which proved to have a superior accuracy to MANN were an SVM classifier [

11], with a 96.37% accuracy, and a convolutional neural network [

39], with a 97.63% accuracy. I should be stressed again that these results were only related to NTAs, which were the activities less likely to be improved by the MANN architecture, but nevertheless, MANN obtained a solid result compared to the other studies.

In order to benchmark MANNs on the SBHAR dataset including TAs, the confusion matrix was computed on the entire dataset (including both TAs and NTAs) employing the same 21 + 9 user-independent train-test split: results are shown in

Table 3. The overall accuracy was 95.48% (in the bottom-right of the matrix), which was lower than that measured when considering only NTAs, since TAs are usually tougher to classify. The matrix showed also two additional performance parameters specific to each activity: recall and precision. The former is defined as:

where

${r}_{i}$ is the recall for the activity

i,

${TP}_{i}$ is the number of “true positives” (i.e., the number of test items belonging to class

i and correctly identified), and

${FN}_{i}$ is the number of “false negatives” (i.e., the number of test items belonging to class

i and incorrectly classified), whereas the latter is defined as:

where

${p}_{i}$ is the precision for the activity

i and

${FP}_{i}$ is the number of “false positives” (i.e., the number of test items not belonging to class

i and incorrectly classified to class

i). It was clear that NTAs were usually better classified by MANN than TAs: the average recall for NTAs was 96.38%, while that for TAs was 77.74%. It is interesting to note that, although some TAs were extremely difficult to recognize (e.g., “lie-to-sit” had a recall of 64.00% and “lie-to-stand” one of 59.26%), others were successfully classified, as for instance “sit-to-lie” and “stand-to-lie”, with recalls of respectively 96.88% and 83.67%. Indeed, as will be shown in the next section, these two activities were those that were improved the most by the use of MANN’s memory buffer.

#### 4.3. Comparison of MANN to Standard Methodologies in HAR

In the previous section, MANN was compared to architectures proposed in other studies. However, since TAs in the SBHAR dataset have not been treated by anyone else, it was not clear how much MANN improved the classification accuracy for these activities. To test this, here, MANN was compared to other architectures commonly used in HAR applications [

19,

31] considering both TAs and NTAs. In particular, the following were considered: a Logistic Regressor (LR) with the L2 regularization strength set as 1, a Support Vector Machine (SVM) with the penalty parameter set as 1, a Random Forest (RF) with 300 trees, a K-Nearest Neighbors (KNN) with

$k=5$, and an Artificial Neural Network (ANN) with one hidden layer containing 82 neurons.

To make the analysis statistically more robust, in this section, a personalized train-test split is employed, since using the standard 21+9 train-test split of the previous section only provided a single confusion matrix. In particular, a stratified 3-fold cross validation with 10 random sub-samples was used: data were divided into 3 folds, making sure that labels (i.e., activities to be recognized) were equally distributed in these 3 portions of the dataset, and in turn, two of the three folds were used as training sets and the other one as test set, where a confusion matrix was computed. By doing so, the performance of the architecture was evaluated without any bias, since no data from the training set were included in the test set. This procedure was repeated 10 times, randomly shuffling data before partitioning, and thus, 30 independent confusion matrices, i.e., performance measurements, were obtained. Finally, these 30 matrices were averaged to produce the mean and standard deviation for each value of interest.

It is important to point out that here, the train-test split was not user-independent: when randomly shuffling the dataset, data for each user were included in both the training and the test sets. Consequently, the performance scores presented here should not directly be compared to those discussed in

Section 4.2.

Since the analyses of this section required the knowledge of the classification accuracy for each activity, the performance parameter taken into consideration was the recall

${r}_{i}$ (Equation (

2)), which was the ratio of correctly classified test items for the activity

i. For any of the 30 pairs of training and testing, the recall of each activity was measured and then averaged to produce the mean and standard deviation of the mean as the final evaluation:

where

${r}_{i}^{\alpha}$ is the recall for the activity

i measured in the train-test split with index

$\alpha $ (

$\alpha $ runs from 1 to 30, which is the number of independent train-test splits),

${\overline{r}}_{i}$ is the mean recall for the activity

i, and

${\overline{\sigma}}_{i}$ is the standard deviation of the mean for the activity

i, which is the statistical error of

${\overline{r}}_{i}$. As an overall performance score, we then computed the mean recall for all activities:

Figure 4 shows the comparison between the performances of all classifiers, plotting the mean recall

${\overline{r}}_{i}$ of all activities and the average mean recall

${\overline{r}}_{m}$ (where the errors

${\overline{\sigma}}_{i}$ and

${\overline{\sigma}}_{m}$ are indicated as error bars).

MANN soundly outperformed the other classifiers: its average recall was 90.18%, which was considerably higher than both ANN (87.93%) and LR (87.85%), which were the most efficient out of the other classifiers. It was clear that most of the improvement by MANN could be tracked down to TAs, and in particular to the activities “sit-to-lie” and “stand-to-lie”, where MANN reached a recall of respectively 89.78% and 90.57%, whereas ANN only had 77.10% and 76.83%, and LR obtained 74.28% and 77.91%. These two TAs were difficult to classify because they had a similar pattern, since in both cases, the user lied down, either from a sitting position or from a standing one. MANN was able to improve the classification performance for these TAs thanks to the memory buffer: accessing the information of previous times, it was able to discriminate between sitting and standing and, consequently, to correctly identify from which position the user was laying down. On the other hand, no particular improvement was observed for NTAs, although it is interesting to observe a slight improvement in classifying “sitting” and “standing” activities, which have been proven to be tough to distinguish by machine learning algorithms [

22]; MANN obtained a recall of 96.34% for the former and 97.09% for the latter, while ANN obtained 95.93% and 96.62% and LR 94.90% and 95.78%.

Successively, it was investigated how MANN was able to treat noise compared to the other architectures. Indeed, a classification algorithm in an HAR application should be resilient to noise contamination, since in real life, applications data are not as clean as in laboratory conditions (e.g., data from the accelerometer and the gyroscope are affected by an error due to an uneven surface on which the user is walking). To perform this analysis in a controlled way, data were contaminated with noise after the feature extraction procedure explained in

Section 3.2. In particular, after data were processed and separated into training and test sets, the test data were contaminated with a noise source, simulated as a Gaussian distribution with zero mean and the standard deviation set as a parameter

$\alpha $ (the bigger

$\alpha $ was, the stronger the noise source was). Since data were previously standardized (i.e., values for each feature were linearly transformed so that items had 0 mean and a standard deviation 1), this approach allowed the noise to be equally weighted for every feature extracted. The parameter used to quantify noise resilience was the loss

L, defined as:

where

${\overline{r}}_{m}(\alpha =0)$ was the average of the 12 recalls for each activity (measured, as described previously, using the 30 independent train-test pairs) with no noise source (i.e.,

$\alpha =0$) and

${\overline{r}}_{m}\left(\alpha \right)$ is the average recall when the noise source has strength

$\alpha $. Thus, the loss quantified how much the noise reduced the overall recall, and consequently, if a classifier had a strong noise resilience, its loss should be small (ideally zero) even for high values of

$\alpha $. The results are shown in

Figure 5, where

$\alpha $ ranges from 0.1 to 1.5.

Overall, MANN had a good noise resilience: for small $\alpha $ values, the loss was negligible (0.1% for $\alpha =0.1$), and it increased up to 10% for $\alpha =1.5$: this meant that the performance worsening due to random noise in the test set was always inferior to 10%. It is particularly interesting that MANN increased noise robustness with respect to standard ANN, where the loss was almost double for a strong noise source (20% for $\alpha =1.5$). Thus, the introduction of the memory buffer to the neural network not only improved the classification accuracy, but also made it more robust to random noise sources. The classifier which showed the best behavior for noise contamination was KNN ($L=-1\%$ for $\alpha =0.1$ and $L=4\%$ for $\alpha =1.5$), which was understandable given the nature of its classification approach relying on the neighborhood. On the other hand, both RF and SVC showed poor noise robustness performances, the former being the worst for low noise source and the latter being the worst for high ones: for $\alpha =0.4$, $L=8\%$ for RF, and $L=0.1\%$ for SVC, while, for $\alpha =1.5$, $L=40\%$ for RF, and $L=80\%$ for SVC, which meant that these classifiers lost almost all predictive capabilities if a strong noise source contaminated the data.

Lastly, learning times for each classifier were compared to test whether the memory buffer increased them dramatically for MANN. For each training set, the time required by each classifier to learn its weights was measured (each training set was composed of approximately 3000 items), and then, these times were averaged over all sets. The results are shown in

Figure 6.

MANN was comparatively quick at learning the data, requiring about 7 seconds for each training set. ANN and LR were the quickest, requiring only 4 seconds; this was understandable, since the memory buffer in MANN increased the number of input and hidden neurons, thus having more weights to optimize. However, the increase in learning time observed for MANN was not too high, making it feasible to use it in real-time classification problems. Finally, the other classifiers had higher learning times (11 seconds for RF, 18 seconds for SVC and 20 seconds for KNN).