Domain Adaptation Methods for Lab-to-Field Human Context Recognition

Human context recognition (HCR) using sensor data is a crucial task in Context-Aware (CA) applications in domains such as healthcare and security. Supervised machine learning HCR models are trained using smartphone HCR datasets that are scripted or gathered in-the-wild. Scripted datasets are most accurate because of their consistent visit patterns. Supervised machine learning HCR models perform well on scripted datasets but poorly on realistic data. In-the-wild datasets are more realistic, but cause HCR models to perform worse due to data imbalance, missing or incorrect labels, and a wide variety of phone placements and device types. Lab-to-field approaches learn a robust data representation from a scripted, high-fidelity dataset, which is then used for enhancing performance on a noisy, in-the-wild dataset with similar labels. This research introduces Triplet-based Domain Adaptation for Context REcognition (Triple-DARE), a lab-to-field neural network method that combines three unique loss functions to enhance intra-class compactness and inter-class separation within the embedding space of multi-labeled datasets: (1) domain alignment loss in order to learn domain-invariant embeddings; (2) classification loss to preserve task-discriminative features; and (3) joint fusion triplet loss. Rigorous evaluations showed that Triple-DARE achieved 6.3% and 4.5% higher F1-score and classification, respectively, than state-of-the-art HCR baselines and outperformed non-adaptive HCR models by 44.6% and 10.7%, respectively.


Introduction
There is a great potential for context-aware (CA) systems to impact many fields, such as healthcare, smart homes, and security [1]. An important part of CA systems is Human Context Recognition (HCR), the process of determining the user's current state. Several definitions exist, but ours is as follows: Human Context is a tuple <Activity, Prioception> comprising of the user's current activity (e.g., walking, standing) and the phone's placement in the user's body (the "prioception") (e.g., in a bag, pocket, or hand). We focus on CA and HCR on smartphones, which are now almost ubiquitously owned and possess a wide variety of sensors such as accelerometers, gyroscopes, and position detectors. There are two popular study designs for collecting HCR datasets for supervised machine learning involving human participants: (1) scripted [2] or (2) in-the-wild [3]. Scripted studies involve participants carrying out a series of tasks in a prescribed sequence while being monitored by a human proctor and having their smartphone sensor data continuously recorded by an app. After the users' sensor data has been collected, human proctors label the data with the locations they were in. On the other hand, in-the-wild studies entail data collection in the the-wild) data samples [5,12]. Figure 2 presents an overview of the topic, its obstacles, and our strategy.   Challenges. For the application of UDA to the lab-to-field generalization of smartphone context recognition, two significant obstacles must be overcome. Initially, the previously noted data concerns with in-the-wild datasets (the diversity of phone placements, noisy labels, and the variety of smartphone models) must be resolved. Second, it is difficult to build a strategy for knowledge transfer from a scripted dataset to a more realistic, but significantly noisier, unscripted dataset with sparse labels.

10-seconds Window
Our approach. Recent empirical accomplishments of the triplet loss function in the facial recognition task [13,14], where changes of the same person's face pictures are tightly mapped in a learned embedding space, has inspired us. We believe that HCR sensor data can benefit from a similar approach even in cases where sensor signatures associated with the same context often vary. Our opinion is also consistent with the findings of Khaertdinov et al., who utilized triplet loss to reduce the impacts of subject variability and enhance model generalizability [15].
We present Triple-DARE, a Lab-to-field UDA approach that may harness the vast volumes of unlabeled smartphone HCR data collected in the wild, therefore reducing the requirement for human-annotated labels. Triple-DARE uses both handcrafted features and features autonomously extracted from raw sensor data by a CNN. Triple-DARE uses domain alignment and triplet losses to learn domain-invariant embeddings with discriminative capabilities for context predictions learned from unlabeled samples. Triple-DARE collects domain-invariant features that increase the effectiveness of predicting contexts under unobserved prioceptions.
In addition, to support our DA strategy, we used HCR datasets with coincident scripted and in-the-wild data with equivalent context labels collected in both studies [1]. These coincident datasets and identical context labels guarantee that there is a feature representation of contexts that is shared throughout scripted and unscripted datasets, which is a crucial need for the DA strategy. By only using context labels that were collected in a scripted study during model development, we are able to demonstrate that our method is applicable to HCR models that are implemented in realistic environments by employ-ingDA in order to reduce the impact of potentially noisy labels while maintaining HCR performance on a dataset collected in the wild. Triple-DARE outperforms state-of-the-art baselines by 3.79% and 1.89% gains in F1-score and accuracy, respectively, and outperforms HCR models without Triple-DARE by 39% and 14.7% in F1-score and accuracy, respectively.
State-of-the-art limitations. There is a paucity of research on laboratory-to-field generalization approaches for HCR. Previously proposed lab-to-field methods include importance re-weighting [9,16] and Positive Unlabeled (PU) classifiers [1]. DA has been used in the past to solve the problem of variable on-body locations of wearable sensors [5,17] but not for HCR. The majority of prior DA work for wearable sensors focuses on decreasing the global distribution gap across domains while learning common feature representations [5,17]. However, we observe that even if the global distribution is effectively aligned, samples from different domains with the same label may be mapped such that they are far apart in feature space. Thus, in addition to using a domain alignment loss [18,19], Triple-DARE improves intra-class compactness and inter-class separability by utilizing a joint fusion triplet loss [12,13] intended for multi-labeled datasets. Moreover, unlike other existing methods for dealing with domain shifts [1,9,17,20], we do not utilize target labels in the target (in-the-wild) dataset, instead following the UDA problem setting outlined by Chang et al. [5].
Contributions. The main contributions of this paper are:

1.
We provide Triple-DARE, a novel UDA deep-learning architecture that employs a scripted dataset to increase the HCR accuracy of predicting contexts in the wild. Triple-DARE employs a domain alignment loss for domain-independent feature learning, a classification loss to keep task-discriminative features, and a joint fusion triplet loss to improve intra-class compactness and inter-class separation; 2.
We carefully assessed Triple-DARE, comparing it to numerous state-of-the-art unsupervised domain approaches, including DAN [18], CORAL [19], and HDCNN [17], and bench-marking advances in HCR performance on target domains in multiple application scenarios. Our ablation study demonstrates that all component of Triple-DARE contributes non-trivially; 3.
We illustrate that Triple-DARE minimizes in-the-wild dataset problems when compared to state-of-the-art DA algorithms, delivering improved prediction accuracy on the target (in-the-wild) domain without the requirement for large amounts of source-labeled samples.
The rest of this paper is organized as follows. Section 2 includes the background. Section 3 reviews the related work. Section 4 describes our proposed approach. Section 5 presents our evaluation and findings. Section 6 outlines the limitations of our work and plans for future work. Section 7 finally concludes the paper.

Covariate Shifts
The term "Covariate Shifts" was first introduced by Shimodaira [21], and is described as changes in the distribution of the input x. While there are other types of existing shifts [10], the most researched type of shift is covariate shift. Covariate shift occurs when data are generated based on a model P(y|x)P(x) whose distribution P(x) varies between the training and test situations. While there is some ambiguity in the definitions of covariate shift in the literature, we found the definition provided by Moreno-Torres et al. [10] to be the most relevant, given by the following conditions: where P training (x) and P testing (x) represents training and testing input distributions, respectively. Collecting smartphone sensor data in the wild often results in naturally occurring variations in the data. When trying to leverage models trained on scripted data to improve performance on an in-the-wild dataset with similar context labels, we encounter a data shift problem known as covariate shift, where the distribution of features differs across training and test scenarios. Specifically, the covariate shift problem is caused by substantial differences between the distributions of features extracted from scripted and in-the-wild datasets [9][10][11]. More broadly, because real-world applications must face some type of dataset shits, it is critical to address the covariate shift problem for the successful deployment of machine learning models in the wild [10].

Sensor Data Collection Studies
Inaccurate labeling or unrealistic user behavior are two common problems with context datasets. There are two types of research designs used to gather HCR datasets: scripted [2] or in-the-wild [3]. Scripted studies are usually conducted in a laboratory setting. Participants follow a scripted series of steps to complete a series of tasks in a predetermined order, while an app on their smartphones continuously logs data from those devices' sensors. Human proctors annotate the sensor data with corresponding context labels. In unscripted ("or in the wild") studies, data is gathered over days while people live their lives in the actual world. A smartphone continually gathers smartphone sensor data as individuals go about their daily lives. Periodically, subjects annotate their data with labels for the contexts they have visited. While the scripted technique for HCR data collection produces accurate labels suited for supervised machine or deep learning that are exceptionally precise and consistent, the contexts visited and sensor data acquired in each context are not reflective of the actual world. HCR research conducted in the wild yields more accurate results. However, certain context labels may be missing since people forget to label when their lives get busy. Additionally, some labels may be incorrect due to human labeling errors [8].

DARPA WASH Project: Motivation Use Case
The Warfighter Analytics utilizing Smartphone for Healthcare (WASH), a DARPAfunded project, investigates passive smartphone evaluation of traumatic brain injury and infectious disease [22]. This will offer a current evaluation of the warfighter's battle readiness. Initially, the target groups are active-duty military personnel and veterans, but the findings will also apply to civilians. In the intended use case, the WASH smartphone application will passively collect sensor data throughout each day. Each day's data is then sent overnight to the cloud for processing. These data will be analyzed in the cloud by disease inference algorithms to provide a bioscore (or probability of illness) for each warfighter.
Program phases: The WASH program is separated into two sections. Phase one is identifying particular smartphone user scenarios for conducting targeted health evaluations. Phase two entails the development of real TBI and infectious illness assessment systems for smartphone users. In phase one, we conducted research and compiled a list of smartphone biomarkers indicative of TBI and infectious disorders, as well as their accompanying settings. Our team performed user surveys to acquire labeled data for these settings and developed HCR models to infer these smartphone contexts derived from labeled sensor data. In Table 1, the intended disease-specific tests or biomarkers relating to each of these settings are detailed. The University of Massachusetts Medical School's specialists in traumatic brain injury (TBI) and infectious diseases were consulted while compiling our list of illness tests and situations (UMMS). As an example, trembling hands are a symptom of TBI. In phase one, our team will perform user research and develop deep learning models to recognize smartphone users holding their devices. In the second step, we will analyze if the user's hand is shaking. This study is limited to context recognition. Actual context-specific disease assessment research is not covered.

Our Coincident Data Gathering Study Approach
Using an innovative coincident study design, we conducted scripted and in-thewild data collection studies to collect labeled data in the same contexts shown in Table 2. This coincidental study enables the use of machine learning techniques that combine the precision of scripted labels with the natural context visit patterns of studies conducted in the wild. Our in the wild study followed a similar methodology to the Extrasensory study. The smartphone app constantly collected sensor data from 103 participants' smartphones as they went about their daily lives. The users were subsequently prompted to selfreport context labels. Our scripted study was conducted in a specific laboratory, campus building, or route. The smartphone app systematically collected data from 100 participants who visited the contexts listed in Table 2. The scripted data collection session lasted approximately one hour per subject, and human proctors monitored and annotated the data manually. Table 2. Contexts for which data was obtained as part of our WASH Study Collected Contexts-split down into 25 binary labels.

Phone in Hand
Phone in Bag Phone in Table

Weakly Supervised Learning (WSL)
In supervised learning tasks, predictive models are trained on annotated training examples, common types of which are classification and regression models. A training example consists of an input feature vector (also known as an instance) and a label that is associated with it (or ground-truth). Due to the high costs associated with gathering labeled data, it is difficult to gather adequate labels of sufficient quality for fully supervised learning in many real-world scenarios, such as our study of HCR using data collected in the wild. This presents a challenge for fully supervised learning. Various types of weak (or inaccurate) labels can occur in such practical scenarios, including several encountered in our mobile HCR scenarios, requiring innovative learning methods. According to a recent survey by Zhou et al. [7], weakly supervised learning can be categorized into three types:

1.
Inexact supervision in which only coarse-grained labels are provided. Due to the nature of the annotation process of sensor data, only a few selected sub-segments of each training sensor segment can be considered accurate representatives of their respective labels. However, their precise length, as well as their position within the segment, is unknown; 2.
Inaccurate supervision in which data labels are not always correct. For example, in-thewild datasets often depend on self-reported labels. However, users may erroneously provide wrong labels as they might not recall which contexts they previously visited accurately; 3.
Incomplete supervision that utilizes unlabeled training data. When study participants get busy with their lives, they might forget to label the data in the dataset, which means that some of the context labels might be missing from the dataset.
For these various forms of weak labeling, innovative learning methods that are trained under weak supervision are desired [7].

Related Work
Lab-to-field generalization. Our Lab-to-field method tries to leverage a scripted dataset that contains high-quality, relatively cheaper to obtain, ground truth labels to improve HCR model performance on an in-the-wild dataset [9]. The ability of deep neural networks to generalize to real-world scenarios, where domain shift is expected, is a critical challenge in smartphone HCR developed for in-the-wild data [1,23]. Importance re-weighting [9,16] and Positive Unlabeled (PU) classifiers [1] are two methods that have been presented in the past to deal with covariate shifts. The transferability of HCR findings from the laboratory to the real world is an area that has received little attention. However, a related study employed importance re-weighting to modify a linear logistic regression model for application with data from wearable electrocardiograms (ECGs). When applied to deep neural networks, however, these techniques have a diminished impact on performance [24]. Unlike other existing methods for dealing with domain shifts [1,9,17,20], our approach does not require target domain labels.
Domain Adaptation (DA). Prior research has demonstrated substantial progress in adapting deep neural networks to various related domains [11]. Recent deep DA methods are either discrepancy-based approaches that minimize a discrepancy metric over feature distributions [18,19], or adversarial-based approaches [25] that aim to maximize domain confusion. The Deep Adaptation Network (DAN) [18] minimized the mean distance between two feature distributions in a Reproducing Kernel Hilbert Space (RKHS), effectively matching higher-order statistics of the two distributions. On the other hand, the deep Correlation Alignment (CORAL) [19] technique proposed matching the mean and covariance of two distributions. Other strategies have used an adversarial loss to maximize domain confusion [25]. The domain alignment loss, one component we utilized in Triple-DARE, is based on DAN.
DA for wearable sensor data. In ubiquitous computing, several DA techniques have been developed to transfer a trained model to a new dataset with similar characteristics [5,17,26,27]. Previous work has shown that DA can be used to unsupervisedly learn domain-invariant accelerometer [5,17] and gyroscope [5] features from sensor data by minimizing a discrepancy distance in the Convolutional Neural Network (CNN) embedding, thereby mitigating the effects of variability in wearable sensor placement. HDCNN [17] looked at whether or not a model pre-trained on smartphone data could be used with unlabeled smartwatch data. The researchers used Kullback-Leibler (KL) divergence and a discrepancy-based technique to transfer the trained model from smartphones to the unlabeled wristwatch data. Stratified Transfer Learning (STL) [26] is a DA method for adapting on-body sensor-based activity recognition tasks to various sensor placements (wrist, chest, leg, etc.). It also maps source and target domain data into the same subspace where distances can be computed, exploiting intra-affinity of classes to transform intraclass knowledge. UDA methods based on Variational Auto Encoders have been used for adapting models to work on another dataset, and have been applied on binary sensors for smart-homes applications [27]. DA was also used to adapt models to subject variability [20], using multi-domain adaptation to address target label shift by incorporating the target domain label distribution in the training process.
The majority of existing work solely focuses on domain-general feature representation learning with the goal of decreasing the global distribution disparity [5,17]. While STL proposed a way to perform intra-class transfer by minimizing the discrepancy between feature distributions of instances of the same class, this approach does not scale well to large-scale datasets, especially datasets with a large number of class labels. By employing a joint fusion triplet loss, our study expands upon previous efforts to enhance intra-class compactness and inter-class separability [12,13]. A summary of the related work is included in Table 3.

Problem Formulation
Our study makes use of two datasets annotated with the same context labels as those in Table 2: (1) a scripted dataset (source) with high-fidelity labels, and (2) an in-the-wild dataset (target) with identically labeled data. These two datasets are acquired using a study design with coincident data collection, where data was gathered for the same context labels in separate scripted and in-the-wild settings. With respect to UDA, there are source domain labeled samples and target domain unlabeled samples, both of which have distinct data distributions. Our objective is to use both labeled and unlabeled target data to train a classifier that performs effectively on the target domain. In a more formal sense, we have labeled samples , with n s and n t standing for the number of samples in the source and target domains, respectively. The feature space and label space are identical across the source and target domains X s = X t and Y s = Y t , but the marginal distribution is not the same (P s (x s ) = P t (x t )). The conditional distributions are assumed to be equal P s (y t |x s ) = P t (y t |x t ) in the two domains.
We refer to x as the feature vector, and y as a multi-label output vector representing the human context where each label used is a binary output (e.g., walking vs. not walking). Presumably, the source and target tasks are identical. At first, we use the labeled source dataset to train the HCR model. Once the HCR model has been trained, it may be used to detect unlabeled contexts in the target dataset by integrating unlabeled data from the target dataset into the training set.

Overview
In the model in Figure 3, the framework of Triple-DARE is depicted. Triple-DARE extracts two distinct kinds of feature sources from the source and target datasets: (1) Handcrafted features based on temporal and spectral information processed by a feed-forward neural network and (2) three-axis sensors' raw data is put into a convolutional neural network (CNN) that uses a soft attention method to identify prominent characteristics from the data. Triple-DARE consists of three main parts: (1) A domain alignment loss L d to generate domain-invariant embeddings; (2) a classification loss L cls in order to preserve task-discriminating characteristics; and (3) a joint fusion triplet loss mathcalL tri that improves intra-class compactness and inter-class separation in the learned embedding space by learning comparable contexts represented by differences in sensor inputs.  The final result is used for context predictions with multiple labels. For instance, according to our definition of context as an <Activity, Phone placement>, a context could be "Sitting", "In Bathroom" with "Phone In Hand" . Our ultimate aim is to minimize the cost function C(·) in order to execute context predictions by learning discriminative and domain-invariant embeddings: where θ are model parameters, λ 1 , λ 2 , and λ 3 are balancing coefficients. Subsequent sections elaborate on this procedure and each kind of loss objective. Each loss function is applied to the three feature encodings produced by our deep network, namely the MLP, CNN, and joint fusion encodings.

Feature Generation
From the raw sensory inputs for a specific smartphone context dataset, we generate two views. The first is a vector derived by manually applying handcrafted features to all accessible sensors. The second view is comprised only of raw three-axial sensors. We use distinct feature encoders for each input view type. Specifically, (1) Handcrafted feature encoding using a Multi-Layer Perceptron (MLP), which is adopted from Ref. [28] and (2) attention-based CNN encoder [4] for raw sensor data. The two resulting feature encodings are then concatenated to yield a joint fusion encoding.
We use data from five sensors: accelerometer, gyroscope, GPS, magnetometer, and phone status (discrete attributes such as whether the phone screen is locked or unlocked). At the sliding window level, we compute statistical, time-based, and frequency-based features for each of the sensor modalities (10-s in this application). Then, we implemented Z-score normalization z i = x i −x s through subtraction of the mean and division by the standard deviation. Handcrafted features, including 188 features borrowed from Ref. [3], are utilized to build a vector that is then put into a feed-forward network. Table 4 lists some of the handcrafted features incorporated in our work. Table 4. A small selection of the handcrafted features applied to accelerometer, gyroscope, and magnetometer data that we use, taken from Refs. [3,29].

Feature Formulation
Tri-axial sensors Features CNN's auto-learning capabilities employ raw sensor data from three axial sensors (accelerometer, gyroscope, and magnetometer). The CNN we utilized, which was adapted from the DeepContext [4], has a soft attention mechanism that aids in the learning of prominent features by assigning greater priority to those parts of the raw sensor data that are more indicative of the user's context. The intuition behind the design of this attention mechanism is similar to that proposed by Refs. [4,30]. The effectiveness of this architecture comes from using attention layers on features generated by single-sensor CNNs and features generated by CNNs that assessed the combined sensor outputs. This enables the model to emphasize CNN characteristics that are context-specific. For more details about the DeepContext CNN architecture, we refer the reader to Ref. [4].

Domain Alignment Loss
The objective of domain alignment loss is to transform the source and target feature encoding into a common feature distribution space in order to discover feature representations that are shared across domains. Gretton et al. [31] presented Multi Kernel Maximum Discrepancy Mean (MK-MMD) as an improvement for Maximum Discrepancy Mean (MMD), which we employ in our method. MMD is a non-parametric distance metric that can be employed to evaluate the disparity between marginal distributions [18]. MMD maps the feature representations of the source and target domains (X s and X t ) to the Reproducing kernel Hilbert space (RKHS) and then computes the mean distance between the two distributions in RKHS. MK-MMD has been proposed as an optimal kernel selection approach for MMD because it can find an ideal kernel created by a weighted combination of various kernels based on the source and target datasets [18]. Let φ(·) be a feature map defined as a combination of G positive kernels k u with their associated bandwidth β u 0, given as the following: where x s and x t represent feature embeddings for the source and target domain, respectively. The formulation of MK-MMD is thus defined as follows: where . H k is the RKHS norm. The domain alignment loss can be obtained by: MK-MMD is computed per network layer to measure the distance between the source and target domain data representations. N l indicates the number of layers, and we denote (X l s , X l t ) for the distributions of the source and target domains, retrieved from the lth layer in the network. d X l s , X l t is the MK-MMD calculated by Equation (5) between the source and target domains distributions evaluated on the lth layer embeddings. Intuitively, the domain alignment loss is a regularizer that minimizes the distance between the distributions generating source domain data and target domain data.

Classification Loss
The objective of classification loss is to use source domain labels to discover discriminative features for context predictions. Both domains utilize the same context labels for classification. The overall learning process is guided by the optimization of our model for context classification on the source domain. Given the availability of D s 's labels (labeled source domain data), the classification loss is defined as: where the classifier is denoted as f φ (·), N s represents the number of labeled training samples, Ψ is a binary cross-entropy function with inverse class frequency weighting that corrects for class imbalance, and (xi s ), y s ) represents labeled context data sampled from source domain data.

Triplet Loss
In an embedding space, the triplet loss is primarily utilized to group samples from the same or related classes together and push samples associated to different classes away. An empirical success was seen in the field of face recognition. This is because different images of the same person map very closely to the learned embedding space [13,14]. Since numerous variations in the sensor inputs can represent the same context, we think the same approach can be applied to sensor data.
Given three samples, an anchor sample x a (also called a query sample), a positive sample x p (one that belongs to the same class as the anchor), and a negative sample x n (i.e., a sample with a different class from the anchor), and with a distance function d, we can define triplet loss as follows: where α represents the margin between positive and negative samples and x represents an embedding of x for ease of notation. We reduce the triplet loss by pushing d(x a i , x p i ) towards zero and making d(x a i , x n i ) to be greater than d(x a i , x p i ) + α. In other words, pairs of positive samples are jointly grouped together, while positive and negative sample pairs are separated by some margin α. To put this in perspective, we want the network to learn a feature space in which the squared distance between all feature embeddings of the same context is minimal, while the squared distance across sensor contexts associated with different labels is large.

Joint-Fusion Triplet Mining
The process of constructing triplets (anchor, positive, and negative) for triplet loss calculations is known as triplet mining. The two main strategies for selecting triplets are offline and online. Finding triplets offline is not recommended as it requires a complete pass over the training set [14]. In accordance with the method described in Ref. [13], we employ an online triplet mining strategy that does not require a prior pass on the training set. Because discovering triplets across two domains necessitates the presence of target domain labels, one of the most prevalent solutions for UDA problems is to use the classifier trained on the source domain to generate pseudo labels for samples of the target domain during training [12]. During this procedure, it is vital to remember that the pseudo labels generated may not be accurate. Nonetheless, we reassign pseudo labels every few iterations since the accuracy of the classifier on the target dataset improves continuously throughout training. In addition, domain alignment loss can help improve the accuracy of the classifier on the target dataset by reducing distribution disparity. Consequently, the quality of the pseudo label can improve automatically.
Our joint-fusion triplet mining technique operates as follows: After concatenating two mini-batches of samples from the source and target domains into one mini-batch, triplets are generated. We need a concept of similarity between multi-labeled vectors in order to construct triplets that are compatible with our multi-labeling settings. First, we define a compatibility score between two binary labeled contexts y 1 , y 2 as the dot product between them: c(y 1 , y 2 ) = y 1 · y 2 (9) Due to the imbalanced nature of our dataset, we consider all the positive examples when constructing triplets. We use a strategy similar to Ref. [13] that focuses on triplets that contribute the most to the learning process, but we modify it by using our compatibility score to select triplets that meet the following condition: d(x a , x p ) + α > d(x a , x n ) & c(y a , y p ) > c(y a , y n ) (10) Our triplet mining strategy is detailed in Algorithm 1.   ((a, p, n)) ; end end end end return triplets

Experiments
We compared Triple-DARE and baseline models on both scripted and in-the-wild smartphone HCR datasets, where we performed multiple UDA use cases. Overall, Triple-DARE was used to obtain a robust representation from the scripted dataset (source), which was then applied to enhance HCR on the in-the-wild dataset (target).

Datasets
In-the-wild dataset: A total of 103 participants downloaded a smartphone app that passively collected data for 2 weeks as they went about their daily lives. Periodically, participants were asked to self-report the context labels that they visited. In addition to being more realistic, our in-the-wild dataset reflected a variety of manufacturer hardware because it was collected using individuals' smartphones.
Scripted dataset: The smartphone application collected information from 100 participants who visited predetermined locations. During the data collection session, which lasted about an hour per subject, human proctors oversaw and manually annotated the data. Both scripted and unstructured datasets were similarly preprocessed and characterized. The contexts were treated as vectors with multiple labels, with 10-s window size used to create segments. The number of samples for scripted and unscripted datasets is 21,846 and 631,026, respectively. In Table 5, we list the context labels used throughout this study. To increase the applicability of our model to unseen subjects, subject-wise cross-validation was used, wherein a given subject's data was included in either the training set or the test set, but never both. Each UDA experiment utilized 90% of source domain data for training, 10% of source domain data for validation, and 100% of target domain data for testing. Figure 4 displays data extracted from the two datasets, displaying only the accelerometer sensor readings for three context examples.

Scripted context dataset W Prioception
In-the-wild context dataset e.g., S Pocket refers to scripted contexts, annotated with "Phone In Pocket"

Baselines
We compared Triple-DARE with cutting-edge deep-learning-based DA models: (1) CORAL [19]: A cutting-edge UDA model that uses deep-coral discrepancy loss.

Implementation and Experimental Settings
(1) Hyper-parameters: Grid search was used to optimize the hyper-parameters of MLP and CNN. The learning rate is initialized at 1 × 10 −1 , balancing coefficients are initialized as λ 1 = 1, λ 2 = 0, and λ 3 = 0. Following the schedule outlined in Ref. [25], the balance coefficients and the learning rate are raised or lowered, making our model more confident in source labels and less sensitive to low-quality pseudo labels during the early phases of training.
The batch size is set to 256. Adam optimizer was used. All trials use the same backbone layers utilized by our DA technique: Two-layer handcrafted-features MLP with 16 hidden dimensions, single-layer MLP domain classifier with 32 hidden dimensions, and a convolutional neural network with attention blocks for individual and combined sensor layers, then an average pooling layer, adopted from Ref. [4]. All sensor data was fed into a three-layer CNN. The outputs are then concatenated and sent to another 3-layer CNN. Attention blocks are utilized to concentrate on prominent regions of inputs [4,30]. In triplet mining, pairwise distances are computed using Euclidean distance and α is set to 0.1. Following the LeakyReLU activation in the final context prediction layer is the Sigmoid activation.
(2) Evaluation Protocol: In addition to reporting classification accuracy, we used the F1 metric to evaluate HCR performance in the UDA setting due to the class imbalance in our context datasets. As the sizes of the source and target domain datasets may not be identical, random sampling is used to iterate through the target domain dataset. However, we evaluate our model on every sample in the dataset for the target domain.

Results and Findings
(1) Notations: First, we define the notations used in our experimental results. S Prioception is denoted for the scripted context dataset and W Prioception for in-the-wild dataset, e.g., S Bag refers to scripted contexts, annotated with "Phone In Bag".
(2) Overall Results: In Table 6, we compare the overall performance scores of our Triple-DARE algorithm to those of the baseline models. In the overall UDA tasks and Labto-field UDA tasks, Triple-DARE outperforms the baseline methods with a 4.5% increase in F1-score and 6.3% increase in classification accuracy. Figure 5 displays the performance per context label across all UDA tasks, demonstrating that our approach outperforms state-of-the-art methods across multiple context labels. In general, UDA methods have an advantage over classifiers that are trained solely on the source domain without leveraging unlabeled data. Particularly, UDA methods were of great assistance with the Jogging, Running, Going Up and Down Stairs labels, for which the user is unlikely to provide labels while performing these activities in the wild. Nevertheless, our method makes use of the high-fidelity labels acquired during the scripted study and enhances adaptation. As shown in Table 5, predictions for the labels Sitting and Walking are the most difficult, which may be due to a significant difference in target label distributions.  (3) Scripted contexts with cross-prioception UDA tasks: Figure 6 demonstrates that Triple-DARE consistently outperforms the baseline methods on all cross-prioception UDA tasks. The adaptation procedure has significantly benefited the UDA tasks with "Phone In Hand" as their target domain. This advantage is a result of the signal noise introduced when the phone is in motion. In the majority of instances, CORAL performs better than DAN. (4) Lab-to-field generalization UDA tasks: Figure 7 displays the results of our lab-to-field UDA generalization tasks. Massive differences in the scores obtained for "Phone in Pocket" versus "Phone in Bag" and "Hand" provide additional information about diversity placements. We hypothesize that when the phone is placed in a bag or held in the hand, the model is unable to map data from scripted and in-the-wild datasets to a common feature space. In adapting models learned on scripted data to make context predictions on in-the-wild data with a "Phone in Pocket" prioception, however, we observe a notable improvement over baseline methods of the current state of the art.  (5) Training under insufficient labels: As shown in Table 7, we analyzed the performance of our model as a function of the number of labels in the source domain. We investigated how the number of labels in the source domain affected the performance of our model. In Figure 8, we plot the prediction scores obtained across multiple scripted cross-prioception domains, averaged across various source domains. The shaded region in this figure represents the amount of variance obtained when utilizing various source domains. Small regions of shading indicate that the scores are highly consistent across experiments. We observed a substantial difference when the target is "Phone in Bag" versus "Phone in Hand" or "Phone in Pocket". Table 7 provides a more detailed version of this experiment. Triple-DARE attains superior prediction scores on the target domain using a small number of source labels, outdoing baseline methods in nearly all UDA tasks.

F1-micro
(6) Intra-class compactness and inter-class separation: To quantify the compactness and separation of learned feature embeddings, we employed the Silhouettescore Score = b i −a i max(b i ,a i ) , where b i is the shortest average distance between a point and every other point in any cluster, whereas a i represents the average distance between i and all data points belonging to the same cluster. This score accounts for both compactness and separation. To compute the Silhouette scores for the learned feature embeddings, we assign each instance one of the binary context labels as a cluster label. Then, we calculate the mean score across labels. In most UDA tasks, our Triple-DARE method achieves higher compactness and separation scores, as shown in Figure 9. In addition, in the majority of instances, CORAL achieves higher scores than DAN in most cases. Additionally, the quality of the learned feature embeddings can be viewed visually in Figure 10, which depicts the same context instances represented by feature embeddings learned using DAN and Triple-DARE. The visualization is obtained by projecting feature embeddings into a two-dimensional space using the T-distributed Stochastic Neighbor Embedding (TSNE) [32].
(7) Ablation Study: We conducted an experimental ablation (shown in Figure 11) to rank the utility of Triple-DARE for a variety of UDA tasks. The best results were seen when using all its parts together. To further understand the relative impact of each component in this ablation investigation, we employed a non-pretrained HCR model. While the triplet loss and the domain loss are both useful, they do not provide as much insight as joint training.

Limitations and Future Work
The assumption that the same number of sensors are available in scripted datasets and in in-wild datasets is one limitation of our methodology. We plan to investigate the possibility of using algorithms for lab-to-field recognition that utilize just a small fraction of sensors that are comparable in both domains. Increasing the model's resilience against missing sensors during inference is one way that our methodology might be improved. We hope that future studies in visual analytics will make use of our proposed strategy for representation learning for smartphone sensor data and the use of UDA for visualization.

Conclusions
The performance of machine learning HCR models on real-world datasets is hindered by diverse phone placements and smartphone models, as well as weak, noisy, or missing labels. The goal of lab-to-field methods is to improve the performance of HCR models by first training them on scripted HCR datasets and then modifying them so that they can be used for predicting context labels in comparable datasets that were collected in the wild. This is the first work we are aware of that uses lab-to-field techniques on HCR datasets collected from smartphones. This paper presents Triple-DARE, a UDA deeplearning model for HCR on smartphones, which is comprised of three components: (1) a domain alignment loss that utilizes MK-MMD (2) a classification loss and (3) a jointfusion triplet loss particularly designed for multi-labeled datasets. Triple-DARE learns domain-invariant features common to both datasets, decreasing the influence of noisy in-the-wild data by concentrating on salient areas in sensor inputs, and achieving a high F1-score for multiple UDA tasks on both scripted and in-the-wild context datasets. With its domain alignment loss, Triple-DARE outperforms state-of-the-art baseline approaches when it comes to mapping the source and target feature embedding into a standard feature distribution. In addition, the triplet loss improves discrimination by increasing intra-class compactness and inter-class separation while utilizing enormous amounts of unlabeled data. Triple-DARE outperforms other state-of-the-art DA baselines, increasing the F1-score and classification accuracy by 4.6% and 1.89%, respectively, and outperforming models with no adaptations by 10.7% and 14.7%. Funding: This work was supported in part by the Computer Science Department at Worcester Polytechnic Institute and the DARPA WASH under Grant HR00111780032-WASH-FP-031and in part by DARPA under Agreement FA8750-18-2-0077. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government.

Institutional Review Board Statement:
The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of WPI (WPI IRB File 18-0232 "Warfighter Analytics using Smartphones for Health (WASH)" on 28 February 2018).

Informed Consent Statement:
Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy agreements.

Acknowledgments:
In this section you can acknowledge any support given which is not covered by the author contribution or funding sections. This may include administrative and technical support, or donations in kind (e.g., materials used for experiments).

Conflicts of Interest:
The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: