Enhancing Human Activity Recognition in Smart Homes with Self-Supervised Learning and Self-Attention

Deep learning models have gained prominence in human activity recognition using ambient sensors, particularly for telemonitoring older adults’ daily activities in real-world scenarios. However, collecting large volumes of annotated sensor data presents a formidable challenge, given the time-consuming and costly nature of traditional manual annotation methods, especially for extensive projects. In response to this challenge, we propose a novel AttCLHAR model rooted in the self-supervised learning framework SimCLR and augmented with a self-attention mechanism. This model is designed for human activity recognition utilizing ambient sensor data, tailored explicitly for scenarios with limited or no annotations. AttCLHAR encompasses unsupervised pre-training and fine-tuning phases, sharing a common encoder module with two convolutional layers and a long short-term memory (LSTM) layer. The output is further connected to a self-attention layer, allowing the model to selectively focus on different input sequence segments. The incorporation of sharpness-aware minimization (SAM) aims to enhance model generalization by penalizing loss sharpness. The pre-training phase focuses on learning representative features from abundant unlabeled data, capturing both spatial and temporal dependencies in the sensor data. It facilitates the extraction of informative features for subsequent fine-tuning tasks. We extensively evaluated the AttCLHAR model using three CASAS smart home datasets (Aruba-1, Aruba-2, and Milan). We compared its performance against the SimCLR framework, SimCLR with SAM, and SimCLR with the self-attention layer. The experimental results demonstrate the superior performance of our approach, especially in semi-supervised and transfer learning scenarios. It outperforms existing models, marking a significant advancement in using self-supervised learning to extract valuable insights from unlabeled ambient sensor data in real-world environments.


Introduction
According to the United Nations World Social Report 2023, the global population aged 65 and older is projected to double over the next three decades, reaching 1.6 billion in 2050 [1].Many countries have proposed "Aging in Place" to assist older adults to live in their own homes and communities safely, independently, and comfortably, regardless of age, income, or ability level [2][3][4].Human activity recognition (HAR) based on sensors has made significant contributions to human-centric applications in smart homes for elderly care, encompassing home automation, ambient assisted living (AAL) systems, and remote monitoring [5,6].Among the three main types of sensors used-cameras, wearable, and ambient sensors-each has its predominant advantages and limitations.While camerabased sensors raise privacy concerns within the home environment, and wearable sensors are considered intrusive solutions, the adoption of ambient sensors is growing [7].Ambient sensors enable the collection of data streams to monitor older adults' daily activities in real-world environments [8].
In recent decades, deep learning models, such as convolutional neural networks (CNNs) and long short-term memory (LSTM) networks, have made significant advances in HAR applications [9].Wang et al. [10] mentioned that CNN models can extract features without prior knowledge, making them suitable for HAR tasks.LSTM networks, in particular, excel at capturing long-term dependencies and acquiring hierarchical representations from sequential ambient sensor data, a characteristic often observed in HAR [11,12].
However, achieving competitive performance with deep learning models in HAR requires a substantial amount of annotated data.Annotating activity data in real-world smart homes poses a significant challenge, as it is time-consuming and tedious for researchers and experts [13].For instance, the Soutien à l'Autonomie des Personnes Âgées; Support for the Autonomy of Older Adults (SAPA) project, in collaboration with the Domotique et Informatique Mobile (DOMUS) laboratory, Centre de Recherche Institut Universitaire de Gériatrie de Montréal (CRIUGM), and Le Laboratoire d'Intelligence Ambiante pour la Reconnaissance d'Activités (LIARA) [14], has installed thirty-eight smart home frameworks in private residences in collaboration with three Canadian universities since February 2019.The project aims to support cognitive assistance and telemonitoring of frail elderly individuals with cognitive deficits using Internet of Things (IoT) technologies deployed in senior living communities [15].Over four years, the project has generated massive amounts of ambient sensor data.However, the lack of ground truth data hinders the effectiveness of deep learning models.
Recently, self-supervised learning (SSL) has gained popularity in various fields such as computer vision (CV), natural language processing (NLP), speech, and audio processing [16].SSL becomes attractive because it eliminates the need for large-scale annotated datasets, which can be expensive and time-consuming to create [17].In the context of time-series classification tasks like HAR, contrastive learning has emerged as a prominent technique in SSL.The SimCLR framework [18] has proven to be a powerful transfer-based encoder for learning feature representations from unlabeled sensor data in HAR [19].Several researchers have proposed novel models, including SimCLRHAR [20], CSSHAR [19], and ClusterCLHAR [21], which are based on the SimCLR framework.These models have demonstrated impressive accuracy in HAR tasks using sensor data from smartphones and smartwatches.However, SimCLR framework-based models in HAR using ambient sensor data from real-world smart homes are still relatively uncommon.
Motivated by the success of SSL in HAR tasks with limited labeled data, and the effectiveness of LSTM models and self-attention in handling time-series sequence data [22,23], we propose a model called AttCLHAR that integrates self-attention, SimCLR, and sharpnessaware minimization (SAM).This model is specifically tailored for ambient sensor data in real-world smart homes.This journal paper is an extended version of our previous work [24], with the main extensions being: (1) an enhancement of the encoder to incorporate self-attention and LSTM for identifying the most relevant parts of the input sequence; and (2) the use of SAM in both the linear and fine-tuning scenarios to improve generalization, inspired by [25].In summary, the principal contributions of this work are as follows:

•
A novel adapted contrastive AttCLHAR framework is proposed, which incorporates two layers of one-dimensional CNN, one layer of LSTM, and a self-attention layer as the encoder.This encoder is designed to learn robust representations from time-series data.The proposed framework is evaluated on real-world smart home datasets, demonstrating its effectiveness in capturing and recognizing human activities in this context.

•
Extensive experiments were conducted on real-world smart home datasets to evaluate the proposed model's performance in HAR and compare it with three other models: SimCLR [24], SimCLR with SAM, and SimCLR with the self-attention layer (AttSimCLR).The results demonstrate that all the models' representations are highly effective for downstream tasks, and AttCLHAR performs better, especially in semisupervised learning.

•
Transfer learning experiment shows that the robust features extracted from the Aruba-2 dataset [26] can be successfully applied to the Aruba-1 dataset [26], achieving competitive performance when compared to the semi-supervised learning approach specifically applied to the Aruba-1 dataset.
The paper is structured as follows: Section 2 presents the related work on HAR on ambient sensors, SSL, SimCLR for HAR, and self-attention mechanism.The methodology of the proposed approach is introduced in Section 3. Comprehensive experiments on CASAS datasets are detailed in Section 4, and the results and discussion are presented in Section 5. Finally, Section 6 provides the conclusion.

Related Work
This section introduces the related work on HAR from ambient sensors and SSL, explicitly focusing on the SimCLR framework for HAR and the incorporation of the selfattention mechanism.

Human Activity Recognition from Ambient Sensors
Most older adults wish to age in place, even with incapacities.To support this, unobtrusive ambient sensors have emerged as the preferred solution for monitoring the daily activities of older adults living alone [27].The rise of remote elderly home monitoring has spurred numerous studies on HAR as a fundamental aspect of AAL [28].
Gochoo et al. [27] proposed a deep convolutional neural network (DCNN) for an unobtrusive activity recognition classifier using anonymous binary sensors, such as passive infrared motion and door sensors.Although the DCNN model effectively extracted features from binary activity images, it overlooked temporary dependencies between activities, such as "Wash dishes" mostly occurring after "Meal preparation".Hybrid models, comprising a combination of LSTM and CNN [29], were employed to address this limitation.These models considered the importance of upcoming short-term sensor data and the continuity from past activities in preceding long-term sensor data.The combination model successfully learned the spatiotemporal dependencies of the fused features.However, instead of highlighting crucial features from the sensor data, the model treated all the hidden states as equally important.Fahad et al. [30] performed activity recognition by applying a probabilistic neural network (PNN) to pre-segmented activity data obtained from sensors deployed at various locations in CASAS smart home datasets, including Aruba and Milan [26].However, the achieved performance was not competitive.
Due to the lack of annotations in real-world smart homes, Gupta et al. [31] proposed using kernel density estimation (KDE) for preprocessing data.This is combined with t-distributed stochastic neighbor embedding and uniform manifold approximation and projection for visualizing changes, thereby discovering patterns in unlabeled sensor data to track user activity changes.While this model is capable of detecting patterns in activities over longer time frames, precise HAR on unlabeled sensor data remains essential for AAL.

Self-Supervised Learning
Self-supervised learning (SSL) is a powerful technique that allows the extraction of feature representations by training encoders on massive amounts of unlabeled data.The training process comprises two main stages: a pretext task and downstream tasks, where the knowledge learned from the pretext models is transferred to specific tasks by fine-tuning the features [17].
Self-supervised learning can be categorized into two main approaches: contrastive and non-contrastive.Contrastive learning aims to distinguish between similar and dissimilar data within the input data [32,33].For instance, contrastive predictive coding (CPC) [34] learned representations by predicting future samples in the latent space, using powerful autoregressive models for generating predictions and employing a probabilistic contrastive loss.Momentum contrast (MoCo) [35] utilized a momentum encoder to extract representations from negative samples using a memory bank.SimCLR [18] proposed a large batch size of negative samples without a memory bank, improving contrastive learning performance.In contrast, non-contrastive learning models learn from positive examples only, consisting of original and augmented data.Examples of non-contrastive approaches include bootstrap your own latent (BYOL) [36], simple Siamese (SimSiam) [37], Barlow twins [38], and variance-invariance-covariance regularization (VICReg) [39], which present promising results from unlabeled data without relying on contrastive loss or pairwise comparisons.
Given the recent popularity of SSL in HAR research [19,40], this paper emphasizes SSL approaches for HAR, as they have demonstrated their effectiveness in capturing meaningful representations from unlabeled sensor data for HAR.

SimCLR for Human Activity Recognition
SimCLR [18] is a highly advanced self-supervised contrastive learning model, widely used in HAR research.It offers a straightforward framework for learning visual representations through contrastive learning.This is achieved by leveraging data augmentation, a nonlinear projection head, and a large batch size, all without the need for a memory bank.
An adapted version of SimCLR [40] was applied to HAR, specifically for recognizing activities such as walking down and up stairs, jogging, sitting, and standing.The results demonstrated competitive performance, especially in healthcare applications, even with limited labeled data.Another study introduced CSSHAR [19], which replaced SimCLR's backbone network with a custom transformer encoder to enhance feature representations extracted from unlabeled sensory data in HAR.ClusterCLHAR [21], following the SimCLR framework, proposed a novel contrastive learning approach for HAR by incorporating negative selection through clustering.The experimental results of ClusterCLHAR show competitive performance in both self-supervised and semi-supervised learning for HAR.
Although SimCLR has been extensively used for HAR based on smartphones or wearable sensors, there is currently a lack of research focusing on applying self-supervised learning to HAR based on ambient sensor data.

Self-Attention Mechanism
The attention mechanism has been widely employed in deep learning models.It was originally introduced by [41] and formed the basis for subsequent developments.Zeng et al. [42] proposed temporal and sensor attention for HAR, focusing on important signals and sensor modalities.This approach helps to interpret recurrent networks and gain insights into model behavior.Raffel and Ellis [43] asserted that attention mechanisms could address certain long-term memory problems.The work by Vaswani et al. [44] demonstrated that Transformers rely solely on attention mechanisms, eliminating the need for recurrence and convolutions.
Mahmud et al. [45] introduced a self-attention-based neural network model that eschews recurrent architectures, employing different attention mechanisms to generate higher-dimensional feature representations for classification.Researchers [23,46] applied attention to recurrent convolutional attention networks or LSTM networks for sensor-based human activity recognition.The self-attention mechanism proves effective in capturing important features in ambient sensor sequences.
Wang et al. [47] proposed a novel deep multi-feature extraction framework based on the attention mechanism (DMEFAM).Singh et al. [22] utilized a self-attention mechanism to learn crucial time points for decoding human activity.This paper leverages self-attention in the AttCLHAR model to extract important information from sequential time series ambient sensor data.

Methodology
This section elaborates on the components of the proposed AttCLHAR framework, which extends the SimCLR approach [18,19,40] for HAR.The framework comprises two primary stages: pre-training and fine-tuning.Figure 1 illustrates the overall architecture of the model.

Pre-Training
The pre-training stage aims to extract representative features from extensive amounts of unlabeled data and typically involves the following key components.
Preprocessing of unlabeled ambient sensor data involves extracting multiple relevant features or performing necessary transformations on raw, unlabeled sensor data.It aims to ensure the data are in an appropriate format for subsequent processing and analysis.
The augmentation module is a critical component of the proposed framework, as it transforms the preprocessed ambient sensor data into two distinct views, facilitating data augmentation.The views from the same sample are considered positive pairs, whereas views from different samples within the same batch are considered negative pairs.As emphasized in [18], proper data augmentation is essential for defining effective predictive tasks in self-supervised learning.Given a sample x, the argumentation model T produces two views, x i and x j .The details of augmentation selection will be discussed in Section 4.2.2.
A neural network-based encoder f (•) generates embedded representations s i and s j in the latent space from augmented input data x i and x j .Due to the suitability of 1D convolution for HAR using sensor data [48], the encoder comprises two layers of onedimensional CNN with ReLU activation, incorporating dropout after each convolutional layer, one LSTM layer, and an attention mechanism layer.The details of each of these layers and the training are presented in Section 4.2.3.The purpose of the attention mechanism is to focus on essential features rather than the entirety of the LSTM output.L2 regularization has been applied to mitigate overfitting.After the LSTM model generates the hidden state representation ⟨h t ⟩, an attention layer uses an attention matrix A to capture the similarity of any sequence while considering all neighboring sequences.Here, α t,t ′ ∈ A represents the attention weight between hidden state representations h t and h t ′ at timesteps t and t ′ , respectively.The utilized attention mechanism is multiplicative attention as follows [49][50][51]: where e t,t ′ represents attention scores, W a denotes the weight matrices for the non-linear combination of hidden states, and b a is the bias.Subsequently, the attention-focused context vector c t at timestep t is computed as the sum of the hidden states h t ′ at timesteps t ′ and their respective attention weights α t,t ′ to the current hidden state h t , as outlined in [49].At the end of the encoder for temporal sensor data, a 1D global max-pooling operation is then performed to extract the maximum vector over the steps dimension.The projection head is composed of three fully connected neural network layers.It receives latent space representations as input and projects them into a different space.This projection facilitates the calculation of the contrastive loss, which is essential for training the self-supervised learning model.The expression g(•) represents the neural network projection head to map the latent representation to another space for contrastive loss.A multilayer perceptron (MLP) with two hidden layers was used to obtain z = g(s) = W [2] σ(W [1] σ(W [0] s)), where σ is a ReLU activation function and W is weight of the MLP.The projected representations generated by the projection head are subsequently utilized to calculate the NT-Xent loss.
A contrastive loss function is employed to maximize the similarity between positive data pairs and minimize the similarity scores for negative data pairs [19].The NT-Xent loss (the normalized temperature-scaled cross-entropy loss) has been used in previous work [18,20], and is defined as follows: In the given equation, τ is a hyperparameter called a temperature parameter to scale the similarity scores between positive and negative pairs; sim(i, j) is a cosine similarity function; N is a batch size; and I [k̸ =i] is an indicator function, and the value is 1 when k is not equal to i.All the positive pairs l(2k − 1, 2k) and l(2k, 2k − 1) in the mini-batch calculate the loss.The 2k − 1 and 2k are positive pairs, and 2k − 1 combined with other views are negative pairs.Then, the model is passed SAM to improve model generalization by penalizing loss sharpness.When a given loss function L : W × X × Y → R + and training loss L S , SAM will get parameter w in the neighborhood ρ that has low loss as well as low curvature by optimizing the min-max objective, which is given by [25,52]: where ρ ≥ 0 is a hyperparameter and ρ ∈ [1, ∞]; λ∥w∥ 2 2 is a standard L2 regularization term with a hyperparameter ρ.In Equation ( 6), the p-norm is generalized slightly from an L2-norm in the maximization over ϵ [52].

Fine-Tuning
Fine-tuning is conducted on preprocessed, sparsely labeled ambient sensor data using the pre-trained encoder for feature extraction.During this phase, the previously frozen encoder weights are unfrozen to allow updates.The projection model used in pre-training is then discarded.For the downstream task of activity classification, a multilayer perceptron (MLP) prediction model with two layers is used.This model employs a ReLU activation function for the hidden layer and a Softmax activation function for multiclass activity classification, along with a cross-entropy loss function.

Experimental Setup
This section provides an overview of the experimental datasets and details the implementation of the experiments conducted to evaluate the proposed approach.

Datasets
The experiment utilized three smart home datasets: Aruba-1, Aruba-2, and Milan, publicly available from the CASAS project [26].These datasets involve motion sensors, door closure sensors, and temperature sensors, with the floor layout depicted in Figure 2  The Aruba-1 dataset is an annotated collection of sensor data gathered over approximately seven months in a resident's smart home, shown in Figure 2a.The resident conducted daily activities without supervision or scripted scenarios, with occasional visits from her children.This dataset comprises 6477 instances annotated with 11 activity categories, including Meal_Preparation , Relax, Eating, Work, Sleeping, Wash_Dishes, Bed_to_Toilet, Enter_Home, Leave_Home, Housekeeping, and Resperate.The raw sensor data includes information such as date, time, sensor ID, sensor state, activity, and activity status shown in Table 1.To facilitate evaluation and due to the limited number of instances for the Resperate activity, this activity, along with sensor data lacking activity labels in the Aruba-1 dataset, was excluded from the experiment.In contrast, the Aruba-2 dataset is an unannotated collection of raw ambient sensor data spanning nearly one year.This dataset maintains the same testbed layout and ambient sensor positions as the Aruba-1 dataset, as illustrated in Figure 2a.
The Milan dataset comprises labeled sensor data collected over three months from a single person living with a pet, as depicted in Figure 2b.During this period, the resident's children and grandchildren regularly visited.As outlined in [11], the dataset initially included fifteen distinct activity categories, which were later consolidated into ten activities of daily living (ADLs) for consistent result comparison.The selected ADL categories, commonly employed to assess an individual's functional health within a clinical context, are presented in Table 2.For instance, Dining Rm Activity corresponds to Eating, and Kitchen Activity maps to Meal_Preparation.The ten ADLs categories used in the experiment include: Meal_Preparation, Relax, Eating, Work, Sleeping, Bed_to_Toilet, Leave_Home, Medication, Bathing, and Other.The Other category is assigned to ambient sensor data when the resident is not engaged in any specific activity.

Implementation Details
This section provides the implementation details for each component of the proposed model AttCLHAR shown in Figure 1.

Data Preprocessing
The data preprocessing steps were consistent across the three datasets (Aruba-1, Aruba-2, and Milan).To ensure the interpretability of the raw sensor events presented in Table 1, several cleaning procedures were performed, including handling missing values, correcting format errors, and removing duplicate data.The locations of motion sensors and contact sensors were mapped in Table 3 for Aruba-1, Aruba-2 and Milan datasets.The feature extraction process was based on the AL Activity Learning-Smart Home approach [53], which involved generating a feature vector for each non-overlapping sensor sequence with a fixed window length of 30.The number of features, relative to the total sensors in the dataset, is presented in Table 4.The num_sensors represents the total number of sensors in the dataset.In particular, the Aruba-1 and Aruba-2 datasets, which collectively had 35 sensors, produced 84 features.Conversely, the Milan dataset, equipped with 33 sensors, resulted in 80 features.[53].

Feature Number
Feature The time of the last sensor event in a window (hour) 2 The time of the last sensor event in a window (seconds) 3 The day of the week for the last sensor event in a window 4 The window size in time duration 5 The time since the last sensor event 6 The dominant sensor in the previous window 7 The dominant sensor is two windows back 8 The last sensor event in a window 9 The last sensor location in a window 10 The last motion sensor location in a window 11 The window complexity (entropy calculated from sensor counts) 12 The change in activity level between two halves of a window 13 The number of transitions between areas in a window 14 The number of distinct sensors in a window 15 ∽ num_sensors+14 The counts for each sensor num_sensors+15 ∽ 2×num_sensors+14 The time since the sensor last fired The preprocessed data were then divided into fixed windows.Taking the Aruba-1 dataset as an example, the preprocessed sensor data, with a shape of (731106, 84) encompassing 84 features, was partitioned into windowed datasets using a sliding window (10, 84) with a 50% overlap.This process resulted in three-dimensional data with dimensions (samples, sliding window size, features) as (146220, 10, 84).The same data processing procedure was applied to the Aruba-2 and Milan datasets.Subsequently, the dataset was split into training, evaluation, and test sets based on the experiment's requirements.The proportions of these splits may vary depending on the specific experiment being conducted.Finally, the sensor features were standardized to have a zero mean and a standard deviation of 1.The normalization process ensures that the features are on a consistent scale, preventing the influence of different feature scales on the models' performance.

Augmentation Selection
The augmentation module applied five selected augmentation techniques mentioned in [40,54] to the ambient sensor data.These techniques included random signal inverting, reversing, time warping, random noise, and scaling.Experiments were conducted on the Aruba-1 dataset to identify the most effective augmentation function using the SimCLR framework [24] and AttCLCHAR.The performance of the augmented dataset was evaluated in both linear evaluation and fine-tuning scenarios, using F1 score and accuracy as evaluation metrics.The results are presented in Table 5.The performance analysis indicates that the scaling augmentation method with a scale of 0.1, in both linear and fine-tuning scenarios, outperforms the other augmentation methods on both models, SimCLR and AttCLCHAR.

Other Components
Table 6 shows the hyperparameters for the proposed model, AttCLCHAR.The input data for the three datasets differed, with dimensions (146220, 10, 84), (400083, 10, 84), and (83325, 10, 80) for Aruba-1, Aruba-2, and Milan, respectively.The one-dimensional CNN utilized two layers with a fixed kernel size of 3. The number of output channels for these layers was set to 32 and 64, respectively.L2 regularization with a learning rate of 1 × 10 −4 was applied.Additionally, a dropout rate of 0.1 was added after each convolutional layer.Following the convolutional layers, the model incorporated a one-layer LSTM with 64 units.The attention layer employed multiplicative attention and a regularizer with a strength of 1 × 10 −4 .The projection head consisted of three fully connected layers with dimensions of 256, 128, and 64, respectively.The outputs from the first two layers are processed through the ReLU activation function.The projected representations generated by the projection head are subsequently used to calculate the NT-Xent loss.For the NT-Xent loss calculation, a temperature parameter of 0.1 was employed.The model parameters were optimized using stochastic gradient descent (SGD) [55] with cosine decay, as described in [40].This optimization technique gradually reduced the learning rate during training, allowing the model to converge effectively.The initial learning rate was set to 0.1, and the decay steps were set to 1000.The pre-training batch size was 512, and the model was trained for 100 epochs.In the linear evaluation setup, the model parameters were optimized using the Adam optimizer with an initial learning rate of 0.001, as mentioned in [20].A batch size of 128 was used during the linear evaluation, which was performed for 300 epochs.For the fine-tuning phase, the Adam optimizer was employed with an initial learning rate set to 0.001.A batch size of 128 was utilized for training.Both linear evaluation and fine-tuning used cross-entropy as the loss function, and the evaluation metrics used were F1 score and accuracy.

Results and Discussion
This section presents experimental results of contrastive learning, semi-supervised learning, and transfer learning.The experiment was executed on an Intel(R) Xeon(R) W-2133 CPU @ 3.60 GHz, utilizing the TensorFlow library.The pre-training process involved a model size of 109,217 total trainable parameters (426.63KB) for Aruba-1 and Aruba-2, and 108,833 parameters (425.13KB) for Milan.Taking the Aruba 1 training as an example, the pre-training, which consisted of 100 epochs, took approximately two hours.Linear evaluation with 300 epochs took around ten minutes.Fine-tuning with 200 epochs required about fifteen minutes.

Contrastive Learning Scenario
This section evaluates the quality of contrastive learning features in the latent space with a linear evaluation protocol on the Aruba-1 and Milan datasets.The encoder was first pre-trained on all the Aruba-1 and Milan datasets without activity labels.After obtaining the representation vectors from the encoder, the encoder weights were frozen, the projection head was dropped, and a one-layer dense classifier was added.This classifier was trainable and used for activity classification.
Different proportions of training data were evaluated in the linear evaluation scenario using the Aruba-1 dataset.The training set was divided into proportions of 1%, 5%, 10%, 20%, and 30%, while the remaining data was used for evaluation and testing.The performances of four models, SimCLR [24], SimCLR with SAM, AttSimCLR (SimCLR with attention), and AttCLCHAR, were compared using F1 score and accuracy.The results are presented in Table 7.The Aruba-1 linear evaluation indicates that even with just 1% of the training data, all models achieved F1 scores and accuracy above 80.67% and 80.87%, respectively.When the training data proportion increased to 20% or more, all models surpassed 85% in both F1 score and accuracy except SimCLR with SAM.Although SimCLR performed better with 1% and 5% proportions, AttCLCHAR outperformed as the label proportions increased.SimCLR and AttSimCLR showed similar performance for all the different training proportions.These findings suggest that the proposed model, AttCLCHAR, can learn meaningful feature representations even without the availability of activity labels and has better performance compared to other models.For the evaluation of the Milan dataset, the same model settings were applied as for the Aruba-1 dataset, except for the training set proportions.The training set proportions of the Milan dataset were set to 10%, 20%, 30%, 40%, and 50%.This adjustment was made because linear evaluation performance tended to be lower when the training set constituted a small proportion of the data.The results are shown in Table 8, where the F1 score and accuracy of Milan's linear evaluation for different ratios range between 59.74% and 64.39% for SimCLR and 61.28% and 64.17% for AttCLCHAR.The model AttCLCHAR performs better when the proportions are below 40%, but the model SimCLR is better when the proportion reaches 50%.However, SimCLR with SAM outperforms other models in all the label fractions.These performances of the linear evaluation on the Milan dataset are not as good as the Aruba-1 dataset.This discrepancy can be attributed to several factors mentioned in [12] for the Milan dataset, such as the presence of a resident living with a pet, which introduces more noise, duplicate data in the dataset, and improper ordering of sensor activations during certain periods.These factors can affect the activity recognition model's performance and result in lower F1 scores and accuracy.3 visualizes the feature representations for Aruba-1 and Milan using the t-SNE algorithm [56] to project high-dimensional representations into two-dimensional space.These representations were generated by the outputs of the AttCLCHAR model's pretrained encoder using unlabeled data and were then evaluated using 20% of the total data.Aruba-1 shows superior visualization representation compared to Milan, aligning with the F1 scores and accuracy results presented in Tables 7 and 8.Although the model effectively learns the majority of activities, imbalanced activity impacts its performance, a challenge that will be tackled in future work.Overall, the proposed model demonstrates its ability to learn useful representations through extensive evaluation based on the above experiments.The F1 scores exhibit minimal variations for different proportions of training data in the linear evaluation of the Aruba-1 and Milan datasets, with only approximately a 2% or less difference.This suggests that contrastive learning features can effectively represent the underlying features of the data.

Semi-Supervised Learning Scenario
This section focuses on the fine-tuning of the proposed encoder and prediction MLP model using a small amount of labeled data from the Aruba-1 and Milan datasets.
For the Aruba-1 dataset, we compared the performances of four models: SimCLR [24], SimCLR with SAM, AttSimCLR, and AttCLCHAR, shown in Table 9.Although Sim-CLR exhibits superior performance with an F1 score of 89.16% with just 1% labeled data, AttCLCHAR outperforms it when the label proportion reaches 5% and beyond, surpassing the other three models.Notably, all four models achieve an F1 score of 88.20% and above, even with only 1% labeled data.Remarkably, AttCLCHAR demonstrates an impressive 6% performance increase with just 1% labeled data compared to the linear evaluation results presented in Table 7.These findings underscore the efficacy of AttCLCHAR, yielding competitive results even with a limited amount of labeled sensor data.Subsequently, AttCLCHAR was compared with other models, including the probabilistic neural network (PNN) for supervised learning [57], known for its proficiency with sparse and noisy sensor data.Additionally, comparisons were made with a deep neural network (DNN) [58] and a bidirectional long short-term memory (Bi-LSTM) neural network tailored for recognizing daily activities [59].Further, we assessed a Transformer-based semi-supervised learning (SSL) approach that integrated multiple sensor sequence inputs for HAR [57].The results, detailed in Table 10, demonstrate the competitive performance of AttCLCHAR.Even with only 10% labeled activities, AttCLCHAR achieves an impressive F1 score of 95.60%, surpassing the supervised learning models that necessitate the entire set of labels.Upon increasing the label rate to 20%, AttCLCHAR exhibits a remarkable F1 score of 97.04% and an accuracy of 97.02%, outperforming not only the supervised learning models but also the Transformer-based SSL approach.For the Milan dataset, we maintained consistency by applying the same fine-tuning settings as in the Aruba-1 experiment.Additionally, a similar approach was utilized as described in [11,12] for grouping activities during data preprocessing.Subsequently, the performances of SimCLR, SimCLR with SAM, AttSimCLR, and AttCLCHAR models were compared, as shown in Table 11.The results demonstrate that all the models can achieve an F1 score and accuracy exceeding 88.97% in the fine-tuning phase when the training set comprises approximately 20% or more of the data.AttSimCLR exhibits superior performance with label proportions of 20% and 30%, whereas the model AttCLCHAR outperforms the other three models when the label proportion reaches 40%.The model AttCLCHAR was then compared with four supervised learning models, including PNN [30], a HAR algorithm based on wide time-domain convolutional neural network and multienvironment sensor data (HAR_WCNN) [60], LSTM [11], and Ens2-LSTM, which combined the output of a Bi-LSTM and unidirectional LSTM (Uni-LSTM).The detailed results are shown in Table 12.The experimental results highlight the competitive performance of AttCLCHAR compared to other supervised learning models, as evidenced by an F1 score of 95.21% and an accuracy of 95.19% using 40% labeled activity.Figure 4 compares the performance improvement from the linear scenario to the fine-tuning scenario, presenting F1 score and accuracy for Aruba-1 and Milan using the AttCLCHAR model.When the label fraction is higher, the performance improvement from linear evaluation to fine-tuning is more significant compared to scenarios with a lower label fraction.In comparison to the improved performance from fine-tuning to linear evaluation for Aruba-1 and Milan, Milan exhibits a more significant enhancement, with an increase exceeding 31.8% when the training proportion reaches 30%.This greater improvement could be attributed to the challenges posed by the lower quality of the Milan dataset, which may affect the effectiveness of contrastive learning.However, the fine-tuning scenarios have demonstrated their effectiveness in refining and optimizing the parameters of the pre-trained encoder model, resulting in notable performance improvements for both the Aruba-1 and Milan datasets.Furthermore, it demonstrates that as the proportion of the training set for Aruba-1 increases from 1% to 10%, the F1 score of fine-tuning improves by a higher percentage.Similarly, Milan shows more performance improvement from 10% to 30% than from 30% to 50% for fine-tuning.This finding suggests that high accuracy can be achieved even with a small number of labeled samples.
These evaluations provide strong evidence that the proposed model achieves competitive performance by leveraging an encoder with limited annotated data.This finding is particularly significant, as it addresses the challenge of limited annotation availability in real-world environments.The results highlight the potential of the model to effectively learn from unlabeled data and overcome the limitations of relying solely on annotated datasets.

Transfer Learning Scenario
Transfer learning involves transferring knowledge from one dataset to another.The transfer learning performance was evaluated on the Aruba-1 and Aruba-2 datasets using linear evaluation and fine-tuning settings.
The proposed approach followed two steps for transfer learning.Firstly, the encoder was pre-trained on the Aruba-2 dataset using self-supervised learning.The encoder weights were then frozen, and the projection head was removed.Then, the downstream tasks were evaluated with linear evaluation and fine-tuning on Aruba-1.For the linear evaluation, only the output MLP model was trainable.It was trained on top of the pre-trained encoder's features.For fine-tuning, the entire network, including the pre-trained encoder, was trainable.
The experiments involved varying proportions of the training set, aligning with the linear evaluation scenario using SimCLR and AttCLCHAR models for the Aruba-1 dataset, as detailed in Table 7. Evaluation metrics, including F1 score and accuracy, are presented in Table 13.In the transfer learning scenario, the linear evaluation results for the SimCLR model showcase an enhanced F1 score and accuracy compared to all linear evaluations conducted on the Aruba-1 dataset, as shown in Table 7.This improvement can be attributed to the greater availability of unlabeled ambient sensor data in the Aruba-2 dataset compared to Aruba-1, which is used for pre-training.The enhanced representation learned through pre-training transfers better to downstream tasks.While AttCLCHAR's linear evaluation performances may not match SimCLR, its fine-tuning performances surpass SimCLR when training proportions exceed 5%, demonstrating competitive performance against AttCLCHAR's fine-tuning results in Table 9.These results underscore the promising performance of the proposed AttCLCHAR model.It indicates that pre-trained features on one unlabeled dataset can be effectively optimized for downstream tasks in another dataset with limited labeled data.This observation highlights the potential of training models in real-world settings, leveraging abundant unlabeled data and annotations from other datasets to enhance overall performance.

Conclusions
A novel approach, AttCLCHAR, is proposed based on the SimCLR framework with self-attention and SAM for HAR using ambient sensor data from smart homes.The model combines CNN, LSTM, and self-attention networks in the encoder to capture local and global features in time-series data.Extensive experiments on three publicly available smart home datasets compare the performances of four models: SimCLR, SimCLR with SAM, AttSimCLR, and AttCLCHAR.The results demonstrate the robust performance of AttCLCHAR in semi-supervised learning and transfer learning, outperforming other models and state-of-the-art deep learning models.This effectiveness allows the model to learn meaningful contrastive representations from a wide range of real-world unlabeled data.
Future research can enhance the proposed approach by addressing specific challenges in real-world applications.One such challenge is dealing with imbalanced daily activities in real-world datasets, where certain activities may occur more frequently than others.It is crucial to develop strategies to address these imbalances, ensuring that the model performs effectively across all activities, including those less represented.Additionally, the model can be tested on smart homes with different layouts and sensor installations to evaluate its transfer learning capabilities across diverse environments.As part of future work, we plan to implement and evaluate the model on unlabeled ambient data from the SAPA project of the DOMUS laboratory [14].The large volume of unlabeled data is essential to ensure a better representation learning during pre-training.This will enhance the model's scalability and feasibility for real-world applications.
By addressing the identified areas for improvement, such as considering imbalanced daily activities and testing the model in smart homes with various layouts and sensor installations, the AttCLCHAR can be further enhanced for human activity recognition in different smart home environments.This research has the potential to advance the field and contribute to the development of more effective and adaptable ambient sensing systems, ultimately benefiting the aging population and improving their quality of life in smart home environments.

Figure 1 .
Figure 1.The architecture of the AttCLHAR framework for HAR.
. Temperature sensors were excluded from feature extraction and model training.Both motion sensors and door closure sensors used the same feature extraction method in the model.(a) Aruba-1 and Aurba-2 (b) Milan

Figure 2 .
Figure 2. Layout of door and motion sensors in Aruba and Milan testbeds.

Figure
Figure3visualizes the feature representations for Aruba-1 and Milan using the t-SNE algorithm[56] to project high-dimensional representations into two-dimensional space.These representations were generated by the outputs of the AttCLCHAR model's pretrained encoder using unlabeled data and were then evaluated using 20% of the total data.Aruba-1 shows superior visualization representation compared to Milan, aligning with the F1 scores and accuracy results presented in Tables7 and 8.Although the model effectively learns the majority of activities, imbalanced activity impacts its performance, a challenge that will be tackled in future work.

Figure 3 .
Figure 3. Comparative visualization of Aruba-1 and Milan feature representations using t-SNE.

Figure 4 .
Figure 4.The comparison of performance improvement from the linear scenario to the fine-tuning scenario with F1 score and accuracy for Aruba-1 and Milan using the AttCLCHAR model.

Table 2 .
[24]activity categories within the Milan dataset were reorganized to create new categories of activities[24].

Table 3 .
Locations of motion and contact sensors in Aruba and Milan.

Table 4 .
Extracted features from raw ambient sensor data

Table 5 .
Comparison of experimental results of augmentation methods on the Aruba-1 dataset with linear evaluation and fine-tuning using SimCLR and AttCLCHAR.

Table 7 .
The performance of linear evaluation of different models with different training set proportions on the Aruba-1 dataset.

Table 8 .
The performance of linear evaluation of different models with different training set proportions on the Milan dataset.

Table 9 .
The performance of fine-tuning of different models with different training set proportions on the Aruba-1 dataset.

Table 10 .
Evaluation metrics for the semi-supervised learning scenario for the Aruba-1 dataset.

Table 11 .
The performance of fine-tuning of different models with different training set proportions on the Milan dataset.

Table 12 .
Evaluation metrics for the semi-supervised learning scenario for the Milan dataset.

Table 13 .
The performance of linear evaluation and fine-tuning with different training set proportions on the Aruba-1 dataset.