It is realistic to assume that for most remote health monitoring technology, detailed user behavior information would never be available after deployment. Therefore, traditional supervised machine learning activity recognition systems are not applicable, and we turn to unsupervised learning. We propose two different methods for segmenting distinct behaviors.
The first method is based on fitting a GMM to the data generated from each clinimetric test. The method does not require any labeled data for training, but imposes the strong assumption that the data violating the test protocols can be clustered into a different Gaussian component to data that adhere to the protocol. Despite the simplicity of this method, we demonstrate that in some scenarios, it manages to segment out most of the bad quality data points with very little computation involved.
For more complex scenarios, we also propose a general technique that can be used to segment different behaviors based on the properties of the data into some estimated number of different “states”.
2.4.2. Segmentation with GMMs
The simpler proposal is a GMM-based approach that relies on the assumption that (at least most of the time) the magnitude (for the accelerometry data, this is the length of the 3D acceleration vector, and for the voice data, this is the spectral power of each frame) of the sensor data adhering to the test protocols is different from the magnitude of the data violating them; or that we can approximately cluster the magnitude of that data into two separate Gaussian components. After we have applied the appropriate preprocessing depending on the data source (as described above), we fit a two-component (
) GMM to each of the different sensor datasets. The GMMs are estimated in an unsupervised way using the expectation-maximization (E-M) algorithm where after convergence, points are clustered to their most likely component using the maximum a posteriori (MAP) principle (more precisely, each point in time is assigned to the component that maximises the probability of its component indicator). Let us denote the preprocessed data by
with
T being the number of sensor outputs for a given test after preprocessing. By fitting a
component GMM to
, we will estimate some indicators
that denote the component assignment of each time point (for example,
denotes that time point
is associated with Component 1). We denote by
and
the component mean and variance for the first and second component respectively. We use the estimated means
and
to identify whether the component corresponds to protocol adherence or violation. For walking tests and voice tests, we assume that if
, all data points
represent adherence to the test protocols; hence,
represent protocol violation. By contrast, for the balance tests, we assume that time points associated with the larger mean represent violation and the points associated with the smaller mean represent adherence to the protocols. This is because adherence in the walking tests results in higher acceleration, and adherence in the balance tests results in lower acceleration. Since the GMM ignores the sequential nature of the data (see
Figure S1a, Supplementary Material), the estimated indicators
can switch very rapidly between the two components, providing an unrealistic representation of human behavior. In order to partially address this issue, we apply moving median filtering [
45] to the indicator
and run it repeatedly to convergence. In this way, we obtain a “smoothed” sequence
that we use as classification of whether each of
is adhering to, or violating, the relevant protocol; time point
t is classified as adherence if
and violation if
.
2.4.3. Segmentation with the Switching AR Model
In order to extend the GMM to model long time-scale dependence in the data, we can turn to HMMs [
46]. HMMs with Gaussian observations (or mixtures of Gaussian observations) have long dominated areas such as activity [
47,
48,
49] and speech recognition [
46,
50,
51]. However, simple HMMs fail to model any of the frequency domain features of the data and are therefore not flexible enough to describe the sensor data; instead, we need to use a more appropriate model.
The switching AR model is a flexible discrete latent variable model for sequential data, which has been widely used in many applications, including econometrics and signal processing [
52,
53,
54,
55]. Typically, some
K number of different AR models are assumed a priori. An order
r AR model is a random process that describes a sequence
as a linear combination of previous values in the sequence and a stochastic term:
where
are the AR coefficients and
is a zero mean, Gaussian i.i.d. sequence (we can trivially extend the model such that
for any real-valued
). The order
r of the AR model directly determines the number of “spikes” in its spectral density, meaning that
r controls the complexity or amount of detail in the power spectrum of
that can be represented.
In switching AR models, we assume that the data comprise an inhomogeneous stochastic process, and multiple different AR models are required to represent the dynamic structure of the series, i.e.,
where
indicates the AR model associated with point
t. The latent variables
describing the switching process are modeled with a Markov chain. Typically,
, allowing us to cluster together data that are likely to be modeled with the same AR coefficients.
The switching AR model above is closely related to the HMM: as with the switching AR model, the HMM also assumes that data are associated with a sequence of hidden (latent) variables that follow a Markov process. However, in the case of HMMs, we assume that given the latent variables, the observed data are independent. In other words, the simplest HMM can be considered as a switching AR model where the order
r of each AR model is zero with non-zero mean error term. Neither of the models discussed here are necessarily limited to Gaussian data, and there have been HMM extensions utilising: multinomial states for part-of-speech tagging [
56], Laplace distributed states for passive infrared signals [
48] or even neural network observational models for image and video processing [
57].
The segmentation produced with any variant of the HMM is highly dependent on the choice of K (the number of hidden Markov states, i.e., distinct AR models). In the problem we study here, the number K would roughly correspond to the number of different behavioral patterns that occur during each of the clinimetric tests. However, it is not realistic to assume we can anticipate how many different behaviors can occur during each test. In fact, it is likely that as we collect data from more tests, new patterns will emerge, and K will need to change. This motivates us to seek a Bayesian nonparametric (BNP) approach to this segmentation problem: a BNP extension of the switching AR model described above, which will be able to accommodate an unknown and changing number of AR models.
The nonparametric switching AR model (first derived as a special case of nonparametric switching linear dynamical systems in Fox et al. [
58]) is obtained by augmenting the transition matrix of the HMM underlying the switching AR with a hierarchical Dirichlet process (HDP) [
59] prior. Effectively, the HMM component of the switching AR model is replaced with an infinite HMM [
60]. The infinite HMM avoids fixing the number of states
K in the Markov model; instead, it assumes that the number of HMM states of an unknown, and potentially large
, and depends on the amount of training data we have already seen. Whenever we are fitting an infinite HMM, we typically start by assigning the data into a single hidden state (or a small fixed number of states), and at each step with some probability, we increase the number of effective states at each inference pass through the signal. In this way, it is possible to infer the number of effective states in an infinite HMM as a random variable from the data. The parameters specifying how quickly the number of effective states grows are called local and global concentration hyperparameters:
denotes the local and
the global concentration.
The local
controls how likely it is that new types of transitions occur between the effective states, or essentially how sparse is the HMM transition matrix. The global
reflects how likely it is for a new effective state to arise, or how many rows the transition matrix has. Unlike the fixed
K in standard parametric HMMs, the hyperparameters
and
of the infinite HMM (or any of its extensions) can be tuned with standard model selection tools that compute how the value of the complete data likelihood changes as
and
change. This allows us to model the behavioral patterns in the smartphone clinimetric tests in a completely unsupervised way. For a lengthier discussion and derivation of the infinite HMM and the nonparametric switching AR model, we refer the readers to [
58,
59,
60].
2.4.4. Segmentation Context Mapping
The switching AR model groups together intervals of the preprocessed data that have similar dynamics described by the same AR pattern, i.e., we group points according to their corresponding indicator values for . The generality of this principle allows us to apply the framework widely across different datasets generated from diverse clinimetric tests such as walking, balance or voice tests.
A trained expert can reasonably identify intervals of walking or balancing that adhere to the corresponding test protocols, while specific physical activities would be difficult to identify purely from the accelerometer output. A lack of behavior labels can challenge our understanding of the segmentation from the previous stage. This motivated the collection of additional controlled clinimetric tests to shed some light on the patterns we discover using the nonparametric switching AR model. The controlled smartphone tests have been performed by healthy controls. We collect 32 walking, 32 balance and 32 voice tests in which we vary the orientation and location of the phone during a simulated clinimetric test. During these tests, subjects are instructed to perform some of the most common behaviors that we observe during clinimetric tests performed outside the lab. Activities conducted during the tests include freezing of gait, walking, coughing, sustained phonation, keeping balance and several others. A human expert annotated the monitored behaviors with , which associate each data point with a behavioral label (i.e., means point was recorded during walking). In contrast to the clinimetric tests performed outside the lab, here we have relatively detailed information about what physical behavior was recorded in each segment of these controlled tests.
Since we have the “ground truth” labels b for the controlled clinimetric tests, we can be confident in the interpretation of the intervals estimated by the unsupervised learning approach. This allows us to better understand the different intervals inferred from data collected from outside the lab, when labels b are not available. Note that are not used during the training of the nonparametric switching AR model, but only for validation. Furthermore, the distribution of the data from the actual clinimetric tests collected outside of the lab significantly departs from the distribution of the data of the controlled tests.
We assess the ability of the model to segment data consisting of different behaviors. This is done by associating each of the unique values that the indicators z can take with one of the behavioral labels occurring during a controlled test. For each , state k is assumed to model behavior with being the most probable behavior during that state.
Using this simple mapping from the numerical indicators
to interpretable behaviors, we obtain estimated behavior indicators
. Using the estimated behavior indicators
and the “ground truth” labels
, we compute the following algorithm performance measures: balanced accuracy (BA), true positive (TP) and true negative (TN) rates for the segmentation approach in
Table 1. For example, given behavior
, these metrics are computed using:
where
denotes the indicator function, which is one if the logical condition is true, zero otherwise.
Outside the lab, we cannot always label physical behaviors with high confidence. Instead, we use binary labels
, which take values
if point
adheres or
if it violates the applicable test protocol (as described in
Section 2.2). In order to classify a time point
with respect to its adherence to the protocol, it is sufficient to simply classify the state assignment
associated with that time point.
To automate this context mapping, we use a highly interpretable naive Bayes classifier. We train the classifier using the posterior probabilities (we noticed that we can obtain very similar accuracy using just the modal estimates of the indicators
as an input to the classifier, which takes substantially less computational effort compared to computing the full posterior distribution of the indicators) of the indicators
associated with the training data as inputs and the corresponding binary labels
as outputs. For a new test point
, we can then compute the vector of probabilities
given the switching AR parameters
and
(properly denoted in
Figure S1b, Supplementary Material) and rescale them appropriately to appear as integer frequencies; we will write these vectors of frequencies as
. The multinomial naive Bayes classifier assumes the following probabilistic model:
where
denotes the training probability for attribute
k given the observation is from class
. This model can be then reversed (via Bayes rule) to predict the class assignment
, for some unlabeled input
:
with
enabling control over the prior probabilities for class adherence/violation of the protocols.
The multinomial naive Bayes is linear in the log-space of the input variables, making it very easy to understand; we demonstrate this by plotting a projection of the input variables and the decision boundary in 2D (
Figure 7). The naive Bayes classifier requires very little training data to estimate parameters, scales linearly with the data size and despite its simplicity has shown performance close to state of the art for demanding applications such as topic modeling in natural language processing, spam detection in electronic communications and others [
61]. One of the main disadvantages of this classifier is that it assumes, usually unrealistically, that the input variables are independent; however, this is not an issue in this application since the classifier is trained on a single feature. The multinomial naive Bayes classifier assumes that data in the different classification classes follow different multinomial distributions.
For different clinimetric tests, we need to train different classifiers because when the test protocols change, so does the association between the z’s and the u’s. However, the overall framework we use remains universal across the different tests and can be extrapolated to handle quality control in a wide set of clinimetric testing scenarios.