Improving Performance and Quantifying Uncertainty of Body-Rocking Detection Using Bayesian Neural Networks

: Body-rocking is an undesired stereotypical motor movement performed by some individuals, and its detection is essential for self-awareness and habit change. We envision a pipeline that includes inertial wearable sensors and a real-time detection system for notifying the user so that they are aware of their body-rocking behavior. For this task, similarities of body rocking to other non-related repetitive activities may cause false detections which prevent continuous engagement, leading to alarm fatigue. We present a pipeline using Bayesian Neural Networks with uncertainty quantification for jointly reducing false positives and providing accurate detection. We show that increasing model capacity does not consistently yield higher performance by itself, while pairing it with the Bayesian approach does yield significant improvements. Disparities in uncertainty quantification are better quantified by calibrating them using deep neural networks. We show that the calibrated probabilities are effective quality indicators of reliable predictions. Altogether, we show that our approach provides additional insights on the role of Bayesian techniques in deep learning as well as aids in accurate body-rocking detection, improving our prior work on this subject.


Introduction
Body rocking is one type of Stereotypical Motor Movement (SMM) observed in normal children (in medical literature referred to as "primary" cases) and in children presenting symptoms of distinct mental disorders ("secondary" cases). Such movements are normally involuntary and recurrent, sometimes nonrhythmic and purposeless. For secondary cases, SMM has partial overlap with developing disorders such as autism spectrum disorder (ASD), obsessive compulsive disorder (OCD), obsessive compulsive behavior (OCB), as well as self-destructive behavior (SDB), Tourette syndrome (TS), and attention deficit hyperactivity disorder (ADHD) [1,2]. Such repetitive behavior could be, but is not limited to, hand flapping, body rocking, or a combination of the two. Several possible triggers have been investigated for this behavior and some indications showed a link between such patterns with excitement and anxiety [3]. Body rocking has been shown to be common in blind infants [4] and may make maintaining social relationships more difficult [5]. A survey with college undergraduates has shown that self-reported body rocking is connected to General Anxiety Disorder, [6]. For primary cases of body rocking, the authors in [7] showed that habit reversal and differential reinforcement is beneficial for self-awareness and stopping of the movement, which is also summarized in Table 1 of [8].
Since medical research has shown potential for improvements through re-education, multiple efforts have been put together to identify and reliably characterize the occurrence of body rocking for both cases aforementioned, envisioning early diagnosis and selfawareness for behavior reversal.
Wearable sensors such as inertial measurement units (IMU), equipped with 3-axis accelerometers and gyroscopes, are appropriate for this purpose since they are non-invasive, 1.
With enough model capacity, our Bayesian framework provided better performance and was less sensitive to overfitting; 2.
Higher capacity alone did not consistently result on higher performance for a given model when compared to the Bayesian framework; 3.
Although transfer learning did not impact significantly the performance, it prevented the calibrated probability degradation as model complexity increased; 4.
The calibrated probability obtained from our Bayesian framework is an interpretable quantity that accurately represents the likelihood of correctness of the prediction of the specific dataset; 5.
Using the calibrated probability as a criterion for selecting reliable detection, we observe a clear improvement on precision with relatively low trade-off in other metrics (e.g., F1-score).
With that, we argue that reliable detection would be possible for real-time notification systems. The paper is organized as follows. Section 2 provides a literature overview of the detection of rocking motion. Section 3.1, presents the data collection system. The BNN framework, models used, pre-processing and evaluation strategies are included in Sections 3.2, 3.4 and 3.5, respectively. Finally, the results and discussion are presented in Sections 4 and 5, respectively.

Related Work
There exist three main methodologies in the literature for the detection of rocking motion: (1) using handcrafted features, (2) learning features from the data, or (3) a mix of the two.
The first methodology, handcrafted features, is the most popular. The study in [19] places wearable accelerometers on the chest as well as in front of a t-shirt collar for individuals with ASD presenting body-rocking behavior. A zero crossing method is applied on the time domain signals based on typical interval ranges and signal amplitudes between body-rocking motions where the best performance (84%) was obtained from the chest sensor. The study reported at [20] made use of recurrence plots which identifies similari-ties by means of Euclidean distance between accelerometer signals representing similar trajectories in a 3D phase space. The authors showed that this methodology is orientation invariant and claim resilience to disturbances caused by differences in amplitudes. Recurrence quantification analysis (RQA) was employed as a feature extraction method where characteristics such as amount of recurrences, determinism, and entropy were paired with a random forest (RF) algorithm to obtain classification accuracy of up to 86% on average. They also conducted an interesting analysis to find the best body location for the sensors. As a result, they found that the sensor placed on the torso contributed the most, based on the output of the RF algorithm. Additionally, using the signals from the torso provided the highest classification accuracy. This result corroborates the findings of [21]. Another approach relying on handcrafted features is the seminal work by [22] where the Stockwell transform, variance, mean difference between axes of entropy, and correlation coefficients were used as features to be classified for body-rocking detection with a support vector machine (SVM). Earlier approaches that are similar to the ones mentioned and based on handcrafted features can also be found in [21,[23][24][25].
The second methodology in [26] is motivated by learning features from raw data in an end-to-end fashion. This work uses a similar deep learning approach to [27] by applying a convolutional neural network (CNN) to raw signals for feature extraction. This approach is applied to the dataset provided by [22] and their own, with simulated SMM and non-SMM activities. The extracted features were classified by long-short term memory (LSTM) recurrent neural networks (RNN), in a combination of knowledge transfer and ensemble learning to find the best performing model. In an earlier study [28], the same author developed a simpler method where instead of an LSTM for classification, an SVM was used and ensemble learning was not employed. Another very similar study by the same author tried the same configuration but with an LSTM [29], motivated by the argument that handcrafted features do not capture signal dependencies well. However, the author in [27] showed that handcrafted features still have value for SMM detection.
The third methodology is present in the work by [27]. It is a framework based on transfer learning with the support of CNNs. The CNNs were trained on time domain and frequency domain representations of the collected signals, where the frequency representations were extracted using the Stockwell transform as initially introduced to represent body-rocking signals in [22]. The transfer learning approach is justified by [27,30] due to the fact that data domains that share similar characteristics can enhance the ability of algorithms to learn and perform predictions on unknown data. Therefore, the CNNs are used to learn the time-frequency related information on SMM (body rocking, hand flapping, or simultaneous body, rocking and hand flapping) extracted from the dataset created and used by [22] and non-SMM related activities (such as walking, sitting, sitting down, standing and standing up) extracted from the PUC dataset [31]. The parameters learned with such training are later used as the transferred knowledge to an SVM classifier. They show this approach outperforming all contemporaneous state of the art methods [20,22,26] with accuracy and F1-score values capping on average at 98.29% and 93.66%, respectively.
Uncertainty quantification can be obtained by means of sampling or direct estimation recovering measures of uncertainty such as entropy, variance, mutual information, etc. Methods based on such approaches have been used for improving classification in activity recognition tasks [32][33][34], as well as eliminating predicted samples with high uncertainty [35]. In the context of multi-instance learning, uncertainty quantification has been used to improve instance level classifiers [36], and to aid active learning scheme to provide different levels of confidence about predicted samples for weak labelers (for the models under training) or strong ones (for the samples that had available labels) [37]. Model ensemble is one way of enabling uncertainty quantification in which such models could be even neural networks [38]. Modeling from a Bayesian approach allows one to estimate the model predicted samples distribution or posterior based on assumed prior distribution of the data. Although the prior assumptions may be misleading [39,40], there are works modeling an evolving prior for better estimation of the posterior [41].
As observed thus far, none of the main methods for body-rocking detection focused on reducing false positives, a factor that impacts a user's ability to engage with a wearable system. Using uncertainty quantification for the purpose of eliminating bad predictions has not been proposed for the purpose of body-rocking detection to the best of our knowledge. Therefore, the BNN's framework combines both the ability to screen bad detections and the ability to take advantage of the discriminative power of deep learning models (as discussed above).

Datasets
In this work, the public SMM dataset Electrodermal Activity Automated Quality Assessment (EDAQA) (preprocessed version can be found here: https://github.com/ lsadouk/code_SMMs, accessed on 6 July 2022) [25] is used. This dataset was made available initially by [22], and the preprocessing and models by [27] are used as a benchmark. This dataset is split into two trials, Study 1 and Study 2. The sampling frequency of the IMU sensors is different from one trial to another, Study 1 is sampled at 60 Hz and Study 2 is sampled in 90 Hz. Additionally, the subjects are the same and the two trials are spaced out by two years. This public dataset will be referred to as EDAQA dataset.
The second dataset used in this study was collected by our group jointly with the Education Services for the Deaf and Blind (ESDB) of the North Carolina Department of Public Instruction under IRB 14046 [9], hence this dataset will be referred to as ESDB dataset (https://zenodo.org/record/5559169#.YqvpxtLMIUE, accessed on 6 July 2022). For the ESDB data, the 14 sessions were grouped into pairs to ease a granular analysis, hence ending with seven sessions. The characteristics of each dataset can be found in Table 1. In this work, the data coming from the wrist was not used since the performance obtained with it was not promising, as observed in our previous work [9]. Please note that since this dataset has only one subject, it can be used to analyze the performance of a personalized model rather than a population-level model. The ESDB dataset was collected using a Raspberry Pi Model 3, equipped with a touchscreen display to aid labeling the collected data in real time. The Mbientlab's MetaMotionR IMUs (https://mbientlab.com/tutorials/MetaMotionR.html, accessed on 6 July 2022) were used. The IMU device contains an accelerometer, gyroscope, LED and a piezoelectric vibration generator. The software application has a UI for real-time data labelling implemented in Python 3.6. The operating system is the Rasbpian Stretch. The IMU streams data to the embedded system over bluetooth. The IMU sampling rate is 100 Hz. If the system were to be used for notifications, once a detection occurs, the vibration generator in the MetaMotionR device could be activated. A picture of the data collection procedure is shown in Figure 1.

Bayesian Neural Networks
Deep learning (DL) has shown incredible performance in different applications but it is still hard to analytically understand the internal nature of such models, which paradoxically may prevent further advancement of this technology. Model uncertainty has been used as a way to evaluate such models, offering a probabilistic interpretation of model's intrinsic factors driving its performance. In particular, techniques such as dropout have been used to capture variability in deep learning models in a similar way to ensemble learning by randomly removing some network connections during training time [42].
For a while, it has been known that an infinite-depth neural network (NN) with a distribution established over its weights converges to a Gaussian process [12,42,43], while finite approximation to weights distributions has been attempted under the framework of Bayesian Neural Networks [12,13]. The authors in [14] showed that a neural network with arbitrary depth and non-linearities with dropout before its weights, as normally used in NNs, is a Bayesian approximation of a Gaussian process marginalized over its covariance parameters. This allows the characterization of the uncertainty due to intrinsic parameters to the model and due to its input data. Next, a brief introduction of the method used by [14] is presented, which shows how model uncertainty can be characterized while enabling model interpretability.
Let us consider the estimated outputŷ i of a NN and the ground truth y i for an input x i with i = 1, · · · , N where each data point (x i , y i ) comes from the dataset (X, Y), i.e., the sets of input and output, respectively. For our discussion, we will consider a NN with L layers of the form: where a(·) is some activation function, W l ∈ R K l ×K l−1 are the NN weights and b l ∈ R K l the vector of biases for each layer l = 1, · · · , L. A standard cost function often used for training of these networks (even when considering dropout) has the form: where E(·, ·) is a loss function, and the λ's are weight factors for L 2 regularization. A deep Gaussian process (GP) is a model resulting of the hierarchical composition of GPs. A Gaussian process models a finite collection of its variables using a multivariate normal distribution with a defined covariance matrix function. Hence for a GP, the covariance matrix can be approximated using a variational distribution over each component of its spectral decomposition [14]. It is known that each hidden layer in a NN can be represented by one of the layers of a deep GP [14].
In this context, the predictive distribution of the deep GP model can be represented as: for some precision hyper-parameter τ > 0, where N (y; µ, Σ) represent the normal Gaussian distribution with mean µ and covariance Σ, and X and Y are the training set. Since the posterior P(w|X, Y) is intractable, the authors in [14] show how to approximate it using variational inference. Such approximation is made by Monte Carlo Dropout sampling [14] and by minimizing the KL divergence between an approximating distribution q(w) and P(w|X, Y), as it will be shown next.
, with z l,j ≈ Bernoulli(p l ) for l = 1, · · · , L, and j = 1, · · · , K l−1 , given some matrixes M l and probability p l as variational parameters. Note, that by using this argument we are in practice sampling the elements of M l . Let q(w) be the distribution over the matrix M l . We use the KL divergence between q(w) and the posterior P(w|X, Y) as our objective for minimization, which (after some mathematical manipulation) can be expressed as: where w i are sampled from the distribution specified by q(w) by obtaining realizations of the Bernoulli distribution z l,j as it is made during the dropout process. By setting , we obtain an expression with similar form to Equation (2). For the case of regression, and given enough training data (so the terms due to regularization of the weights and biases is negligible), we can approximate Equation (5) by [14]: whereσ i is a variable that captures the observation noise for sample (x i , y i ) which is treated as another output of the NN.
Next, we discuss how to approximate the predictive distribution given a new sample point x * once the model has been trained. The distribution for the predicted value y * is [46]: This distribution is approximated using a moment-matching technique by finding an estimate for the first two moments with the help of Monte-Carlo integration. The first moment approximation is obtained by the following: where {w t } T t=1 are obtained by drawing T samples from the distribution specified by q(w). This expression is basically T averaged forward stochastic passes through the NN.
The second moment approximation is obtained by: Hence the model's predictive variance is obtained by: which is the same as the sample variance of T forward passes through the NN plus the inverse of the model's precision.
There are two types of uncertainties considered for quantification [45] (1) aleatoric uncertainty, which is the uncertainty associated with the data, (2) epistemic uncertainty, which is the uncertainty associated with the model [47] and usually can be explained by enough data [48]. For aleatoric uncertainty, there are two subtypes: heteroscedastic and homoscedastic. The first quantity is dependent on the data, while the second one assumes identical noise for all input samples.
Predictive entropy [49] measures the amount of uncertainty associated with a measurement. With the Monte Carlo dropout sampling, it is approximated as Mutual information (MI) between the posterior over the weights and the prediction y * quantifies the uncertainty in the BNN's output [50]. This measure is larger when the stochastic predictions are less stable, and it is calculated via: Predictive entropy represents the effect of epistemic and aleatoric uncertainties. On the other hand, mutual information is a representation of the epistemic model uncertainty [47].
The regression aleatoric uncertainty is now extended to a classification task, by modeling the regression uncertainty of the logitvector-the output of the last layer before the Softmax activation function. A Gaussian distribution is placed over the logit vector aŝ z ∼ N (y, σ 2 ), where [ŷ, σ 2 ] = f W (x) with f W as the NN. The expected log likelihood for each training sample is described as [15]: where y is the ground truth label of x and c is the index for the ground truth label. Since Equation (13) is analytically intractable, it is approximated by Monte Carlo integration. Denoteẑ t = f W (x) + σ · t , where t follows a standard Gaussian distribution. The loss function becomes where T is the number of Monte Carlo sampling iterations and c is the class index of the logit vectorẑ [15]. The loss function in Equation (14) is the one that is going to be used later on in the experiments. The uncertainty quantification relies on the estimation of the approximated predictive probability which for the classification case is shown by [48] as the output of softmax vector:

Probability Calibration
Assume a multi-class classifier with a prediction and corresponding predicted probability [ŷ,ρ] = H(x) for an input x withρ representing the probability that the class label is correct. In this case, we would expectρ to match the empirical probability of this event. That is, a sufficient condition for calibration can be defined as It is known that the predicted probability is not calibrated for neural networks, especially in the case of BNNs [45]. In this work, we employ the approach of Zhong et al. [15,51] of using the three uncertainties obtained with the BNN framework (i.e., the variance estimate of the prediction, the entropy and the mutual information) to find a map, say Ψ, from the uncertainties domain to a calibrated probability domain. Hence, a calibration function Ψ : R 3 → [0, 1] is desired such thatρ = Ψ • U produces calibrated probabilities, where U represents the three uncertainty measures from the BNN. In our framework, we used a neural network to approximate Ψ with architecture composed of three fully connected layers (FCN) with 32 (activation tanh) and 64 (activation tanh) neurons in the hidden layers and one neuron (activation sigmoid) for the output layer. Table 2 gives a summary of the equations so far.

Equation Title
H

The Models
The Bayesian DL uncertainty quantification approach is applied to an end-to-end model, which together with Sadouk et al. [27] represents the state of the art for bodyrocking detection in our assessment. We consider Rad et al. [26] as our baseline, who used a fully end-to-end deep learning approach for the same goal. For this implementation, we re-used most of the CNN pipeline developed by them (https://gitlab.fbk.eu/MPBA/smmdetection, accessed on 6 July 2022). This model is deeper than the one from [27] with 3 CNN layers of 4, 4, and 8 kernels, respectively. Each of these layers has filter sizes of 10 and stride of 1, the output of the CNN is flattened and passed through a batch normalization layer followed by an FCN with eight nodes, the logits are then dropped out with p = 0.2 to generate a hot encoding output for body rocking and not-body-rocking status. For training this network, a cross-entropy loss function is used. A diagram of Rad's architecture is shown in Figure 2a. This model will be referred to as Rad's model or Rad's approach and it represents the state of the art for end-to-end detection of body rocking in our assessment. For a fair use of the Bayesian DL framework, we evaluate an additional model that is wider than the one aforementioned. The model will be referred to as WiderNet which is basically an upscaled version of Rad's approach in terms of number of filters per layer.
For instance, the model WiderNet 2× would be a three-layer CNN model, exactly as in Rad's approach, but with 8, 8, and 16 filters per layer of size 10 each. Two variants of the models are created differing only in the number of nodes in their FCN. Therefore, the model "WiderNet 2×, FCN 128" has 128 nodes in its FCN as one can see in Figure 2b, thus the abbreviation "FCN" will be followed by the quantity of nodes. The use of a WiderNet is justified since Rad's model is a narrow model (a few filters per layer) and may not display the advantages of the Bayesian approach since Rad's shallower aspects do not contribute to the premises of Bayesian formulation; namely, the deeper/wider the model, the closer to a Gaussian process. To show that Wider models gain more benefits when using a Bayesian approach, the WiderNet will be upscaled and evaluated 2×, 4×, 8×, and 16× for comparison. Each upscaled version is going to be evaluated with the original Rad's FCN (i.e., 8 nodes) and 128 nodes. Implementation is publicly available (https://github.com/rafa-coding-projects/Body-Rocking, accessed on 6 July 2022).

Dataset Pre-Processing and Evaluation Strategy
As mentioned earlier, we used the public EDAQA dataset and our ESDB dataset for analysis. The detailed characteristics of each dataset can be found in the preprint version of this manuscript (https://github.com/rafa-coding-projects/Body-Rocking, accessed on 6 July 2022). The models are evaluated in both datasets with the leave-one-subject-out (for the EDAQA dataset) or leave-one-session-out (for the ESDB dataset) strategies for training and testing. The training was performed for 45 epochs. For each subject left out, the procedure is repeated 10 times to account for additional variability in the training process.
For all models, the data were filtered by a Butterworth band-pass filter, segmented into windows of N samples for each axis (either three axes of gyro of accelerometer measurements), and trained end-to-end as described in the original work [26]. The window size is a moving window with N = 10.
For the Bayesian approach, the models had their original loss function replaced by the Bayesian loss (see Equation (14)), and dropout is added for each layer with probability 0.05. A diagram showing the additions of the Bayesian components to Rad's model can be found in Figure 2c.

Transfer Learning for Model Improvement
As a comprehensively explored topic in statistics and DL literature [26,27], transfer learning (TF) is useful to take advantage of data coming from similar domains, consequently increasing model generalization capabilities. In the application of wearable sensors, transfer learning is especially useful to allow a model to work with sensors placed on different limbs.
In our particular scenario, we initially train the best performing model using the EDAQA data and later on we re-train the model on the ESDB dataset conserving most of its parameters. As explained before, the torso is the optimal location for body rocking, since it has minimal coupling with other repetitive activities that could be performed by other limbs. On the other hand, data coming from the arm has significant mechanical coupling with other activities, which makes it much more challenging to work with. For example, in a classroom environment, repeatedly taking folders from the student's backpack to their desk would trigger a detector on the wrist but it may not on the chest. The transfer learning technique has been chosen as an aid to improve the model's performance in the challenging (ESDB) arm data set.

Uncertainty Quantification as a Criterion for Choosing Reliable Predictions
We make use of the uncertainty quantification metrics as well as the calibrated probabilities generated by the Bayesian DL framework to develop a criterion to establish whether a prediction made by the model should generate a notification or not. More specifically, we will make use of a threshold on these quantities as a criterion for selection of a reliable detection. Entropy and MI were chosen to be reported since they were shown to be the most effective. We have not found any clear relationship between the predictions and other dispersion measures, such as estimated variance, inter-quantile ranges, coefficient of variation, etc. Finally, as mentioned in the introduction, excessive false positives can lead to alert fatigue. The trade-off between the distributions of correct and incorrect predictions will be analyzed as the threshold is changed.

Metrics
For the evaluation metrics, we focus on Area-Under-Curve (AUC) of the computed Receiver Operating Characteristic (ROC) curve and precision. The AUC is known to be less sensitive to oscillations in predictions and a good indicator for generalization, in contrast to other metrics such as accuracy. The other metrics reported are precision, recall, and F1-score. For the Bayesian approach, the metrics above are calculated using the model output specified by (15). The regular softmax output (as shown in Figure 2a,b) is for non-Bayesian models.

Bayesian Approach Compared to Current Methods
For the EDAQA Study 1 in Figure 3 top, one can observe that the AUC, and F1-score obtained by Rad's approach is in general superior to the WiderNet models without a Bayesian approach (with a legend as "AUC FCN 8" and "AUC FCN 128"). By evaluating the curves for the AUC plot, one can observe that increased model complexity seems to be degrading the model's performance, most likely due to overfitting. The Bayesian approach stands out as being superior to Rad's approach by almost one standard deviation when using WiderNet 8× FCN 128. The figure also shows that the Bayesian approach does not require an aggressive increase on capacity of the architecture in order to perform better since WiderNet 2× already increases the AUC by 3%, all obtained AUC values are around 94% for the Bayesian approach.
We continue evaluating Study 1, but now for F1-score in Figure 3 top in the middle column. As expected, the performance degradation due to increase in capacity of non-Bayesian models is also reflected in their F1-score.
Furthermore, different improvements are obtained as the WiderNet's capacity is increased, which is more noticeable than when analyzing the AUC. The F1-score for Bayesian approach increases from 61% to almost 66%. WiderNet 8× provided the highest F1-score of 65.8%, just slightly above WiderNet 8× FCN 128 with 65.5%, more than 10% compared to the F1-Score obtained with Rad's approach of 54.9%. Precision is also further improved by the Bayesian approach from 62% with Rad's model to reaching up to 70% with WiderNet 16× FCN 128. Therefore, for F1-score and precision, the Bayesian approach provided greater improvement than for AUC.
For EDAQA Study 2, the reader can refer to Figure 3 middle. The performance in this portion of EDAQA dataset is superior than the performance obtained in Study 1 as also shown in [27]. The same model degradation trend when increasing capacity observed in Study 1 is also present in Study 2 when considering non-Bayesian approaches. On the other hand, the WiderNet 8× provided an F1-score of 70% compared to 58% of Rad's model, an improvement of 12%. The precision when using WiderNet 16× was at 73% and Rad's model at 68%, an improvement of 5%.
Finally, for ESDB in Figure 3 bottom, one can observe a superior average performance by WiderNet variants in general when compared to Rad's model. This happens although all AUC values are within a range of about 4%. Thus, considering AUC, one can see that widening Rad's model provided improvements for the non-Bayesian approach of WiderNet 2× and 4× only. Increasing the model capacity any further degrades the AUC, as the non-Bayesian models variants WiderNet 8× and WiderNet 16× show, independently of how many nodes are placed in the FCN. The Bayesian approaches had an even higher performance since AUC improvements were observed until an increase in capacity of 8×. The models that had FCN 128 seem to be less sensitive to capacity increase. The FCN 128 Bayesian variants seem to have plateaued in terms of performance, leading us to believe that the framework has reached a limit in performance. Applying the Bayesian framework to Rad's model slightly improved its AUC score from 92% to 93% while the Bayesian WiderNet 8× had 95%. The results for WiderNet FCN 128 are averaged and summarized in Table 3. Finally, considering also AUC for model generalization as well as dealing with an imbalanced data, the up-scaled model WiderNet 8× FCN 128 seems to be the best performing one. In Figure 4, one can verify how AUC, and precision play a role between the best models and Rad's.

Effect of Transfer Learning (TF)
We evaluated all Bayesian WiderNet FCN 128 variants using TF. It was observed that TF from the EDAQA to the ESDB dataset was not effective for Rad's approach, it rather substantially decreased their performance. We noticed that Bayesian WiderNet models obtained better performance with TF than Rad's model, but still slightly worse than training the models from scratch.
It is important to note that [27] obtained good results and improvements using TF from one subject to another. However, the same limbs were being used, while in our case, TF was attempted from torso data to right upper arm data. The sensing modalities are also different, since EDAQA dataset only uses an accelerometer, whereas in ESDB the body-rocking activity is more evident in the gyroscope data.
The transfer learning is accomplished by first training a Bayesian approach model from scratch on EDAQA Study 1, since from previous results, it seems to be a bit more challenging for the models than Study 2, the imbalance in that portion of EDAQA dataset is also less severe than in Study 2. Then, the first CNN layer and the FCN of the model are trained on ESDB data analogously as before, namely, with 35 epochs, using leave-onesubject out for the testing set and repeating this procedure 10 times for each subject.
We also analyzed the reliability plots or calibration diagrams, which capture the correctness of the calibrated probability ρ (as described in Section 3.3). The preprint version of this manuscript (https://github.com/rafa-coding-projects/Body-Rocking, accessed on 6 July 2022) contains one example of a reliability plot. The x-axis of the plots captures the mean calibrated probability value (i.e., which value of ρ is predicted), and the y-axis corresponds to the fraction of positives in the dataset. Ideally, these numbers should match and the resulting curves should follow a diagonal line. To capture the offset from this ideal configuration, we report the Area From Diagonal (AFD), i.e., the area between the model curve and the diagonal. The AFD is reported for all WiderNet models with FCN 128 with and without TF in Figure 5. We observe that TF seems to help the Bayesian models to neutralize the effect of higher variability as the model capacity increases.

Uncertainty-Based Detection Selection
In this section, we explore the use of uncertainty as a criterion to select only predictions with high confidence. The Bayesian WiderNet 8×, FCN 128 model is used for this discussion since it is the model that provided the best overall performance among all models. The exploration is performed in the ESDB dataset only, since this is the focus of this manuscript for developing real-time notification systems.
The distributions of the Entropy and MI measures of uncertainty for the correct and incorrect predictions are displayed in Figure 6.
One can observe that the distribution of incorrect predictions have a higher occurrence of entropy values greater than 0.4. One cannot tell much about a distinct pattern of MI since both groups have high concentration of values smaller than 10 −1 ; therefore, only entropy is used for the analysis detailed next. Rad's model displays similar patterns.
We consider two uncertainty criteria for detection selection: (1) Setting positive detections with entropy above a specified threshold to be a negative (i.e., only keeping those detections with entropy that is low enough) and (2) Setting positive detections with calibrated probability below a specified threshold to be negative (i.e., only keeping those detection with calibrated probability that is high enough). Making use of the calibrated probability as a selection criteria is more desirable since the probability values can be more easily interpreted, and also because, as seen in the discussion below, it provides a better trade-off than using a single uncertainty measure. The goal is to use these criteria for selection of predictions that are truly reliable in order to avoid alarm fatigue. predictions. This indicate that it is a good feature for quantifying the uncertainty (i.e., higher entropy means higher likelihood of having an incorrect prediction). Best in color. Figure 7 provides a visualization of the trade-off between F1-score and precision. Please note that the calibrated probability with a threshold of 0.65 yields a slight drop in the F1-score while increasing precision by 7%. This is a beneficial trade-off for our use case since that means that we can obtain more true positive detections without sacrificing the overall performance of the system. This degradation in F1-score was further explored in Figure 8 by exploring the impact in recall. We observe that for an improvement of 7% in precision, the degradation in recall was of around 6%.
Since the proposed criteria rejects some of the original detections of the model, Figure 9 shows the trade-off between uncertainty threshold values and the percentage of samples that are kept unchanged (i.e., not set as negative prediction by the threshold criterion). We notice that for a percentage of samples kept unchanged higher than 60%, both criteria will provide similar AUC values. This plot supports our previous analysis, showing the impact of sample selection in metrics, such as precision and F1-score.

Discussion
BNNs improve performance beyond what is obtained by simply increasing model capacity. The first experiments show that the Bayesian approach presented a modest and inconsistent improvement to Rad's approach, it improved the performance on Study 2 while it was slightly degraded for Study 1 (see Table 3). Increasing model complexity for EDAQA dataset degraded performance for non-Bayesian models, making a case for possible overfitting. For WiderNet, the performance was enhanced in general when considering the Bayesian variants. One possible explanation is that applying the Bayesian framework on a model has a regularization effect [52], which for a model with lower capacity, such as Rad's model, results in lower performance. It is important to note that the AUC improvements for Bayesian approach were insensitive to further increases in capacity, leading us to believe that the framework has reached a limit on its performance. Additionally, it is interesting to note that for the EDAQA dataset, the precision increases with larger Bayesian models whereas it decreases for larger non-Bayesian models. It is important to note that according to [53], a DL model approaches a Gaussian process as the number of layers of the DL model goes to infinity. Another important aspect to bring to the discussion is that as shown by [54] a sufficient deep and wide model can even fit corrupted data since DL models have enough capacity to model very complex and even noisy data. However, based on the observations so far, we have some evidence that the WiderNet model had benefited from the Bayesian approach, showing that not only deeper models benefit from such an approach but also widerones. It also shows that model capacity alone did not extract the "full potential" of the model. Additionally, the Bayesian approach gives us a relatively computationally cheap way of obtaining uncertainties from model predictions. One could argue that a simpler ensemble could also provide the same benefit, but based on our previous work [9] we observed that random forests for example does not perform well for this dataset (and thus we used an SVM); therefore, a DL approach was chosen for this work. Transfer learning reduces model variability. The evidence provided in Section 4.2 supports a claim which has an intuitive appeal: a model that has learned a similar domain will provide less variability when being retrained. An interesting unfolding of this result could be used to investigate the impact in model generalization, which we leave as future work.
The calibrated uncertainty can serve as a prediction quality indicator. Section 4.3 shows that calibrated probability provides slightly better improvements in precision for choosing good predictions. To further illustrate that, we computed the ROC when removing the samples (and associated ground truth values) that do not meet the selection criterion and obtained an AUC of 98%. Although this represents an unrealistic scenario, we have evidence that the remaining samples are in very close agreement with their respective ground truth values. Revealing that the uncertainty-based thresholding really eliminates predictions with "poor quality".  Limitations and Future Work. This study and proposed pipeline have some limitations worth discussing. (1) Runtime. To produce the ensemble of predictions for uncertainty quantification, several predictions are necessary which require more computational power than for a single prediction. Thus, this is a constraint to be taken into consideration when implementing such methods in real time. However, a Ubuntu desktop with an i7 CPU 3.7 GHz, 64 GB of RAM and a GPU GeForce GTX 1080 Ti takes 3 ms per inference for the Bayesian approach and 9 µs for non-Bayesian, about 32 times more. Although this comparison was made on a desktop, implementing a deep architecture with Bayesian approach for real-time processing on an embedded device is viable as we showed in our previous work [51] with an architecture more complex than the ones presented in this manuscript. However, a clinical study should be considered with several subjects wearing a wearable system running the pipeline proposed in this work. With that, the shortcomings of the method in terms of comfort and effectiveness could be assessed. (2) Dataset size. Although we used a public dataset with six subjects, as noted in Section 3.1, we only evaluated uncertainty quantification results on the data with one subject. Repeating such evaluations on the EDAQA dataset could bring extra insights in future work. Additionally, expanding the ESDB dataset could potentially bring further insights into transfer learning between different subjects since the domain of EDAQA did not provide many performance improvements to the domain of ESDB what could be due to noise in EDAQA or simply lack of domain similarity. (3) Further Fine-Tuning of Pipeline. The results obtained from the pipeline can be potentially improved using a validation set and further tuning the parameters of the models, which is left for future work. (4) Prior sensitivity analysis. We have not comprehensively explored different priors for this work, but we acknowledge their importance in obtaining accurate posteriors depending on the size of the data.
The study of impact of different priors on body-rocking detection using BNN could be potentially interesting in future work. However, it is expected that priors should not have a big impact on performance for large datasets. (5) Fixed dropouts probability. We acknowledge that better results have been obtained in the literature using BNNs with dropout by learning the dropout probability during model training. Therefore, the performance showed in this manuscript could be potentially increased with such aid in a future work.

Conclusions
In this work, a comprehensive comparative study of methods to classify the bodyrocking activity was presented. The methods were evaluated in light of a Bayesian approach. It was observed that a shallower model tends to not take advantage of the Bayesian approach. Additionally, the Bayesian approach was shown to provide superior performance benefits when applied to higher capacity models, as demonstrated by simple networks that were wider than the baseline model, which we called WiderNet. Although the experiments show this tendency, we acknowledge that more evaluations with other deeper and wider models, as well as other datasets, are needed in order to isolate the capacity effect v.s. Bayesian approach. Assuming precision as a better metric than F1-score for body-rocking classification, the calibrated probability and estimated entropy turned out to be useful criteria to establish a "reliable level" for model predictions, significantly improving the model's precision, reducing the amount of false positives, and making the case to use such methods for real-time detection.
Bayesian DL is still a growing research area for which new insights are being shared. We foresee that the performance observed in this paper can be further improved by not only comprehensively evaluating deep architectures, but also exploring the effects of different priors for body-rocking classification and new ways of obtaining posteriors. The work of [55], for example, condensed a series of justifications for the use of cold posteriors on top of the fact that there is theoretical and experimental evidence that posterior predictive can be better than point estimators and that model averaging, in general, provide robust prediction. They show that cold posteriors improve the predictions of Stochastic Gradient-Monte Carlo Markov Chain-based ensembles, which could also bring benefits for body-rocking classification. Although there are claims in the literature showing that the deep learning architectures have so much capacity that they can even fit in corrupted data [54], we intend to show that the Bayesian approach does improve the predictions. Although we do not explore the impact of the prior in the predictions, we recognize its crucial role for accurate estimations of the posterior, as shown by [39,40,56]. Applications such as body-rocking detection can largely benefit from the constant new outcomes from Bayesian DL. In order to make "live" body-rocking detection using viable BNNs, a clinical study evaluating user feedback for real-time detection and uncertainty threshold adjustment should be conducted, with devices equipped with real-time BNNs, which we leave as future work.