A Graph-Attention-Based Method for Single-Resident Daily Activity Recognition in Smart Homes

In ambient-assisted living facilitated by smart home systems, the recognition of daily human activities is of great importance. It aims to infer the household’s daily activities from the triggered sensor observation sequences with varying time intervals among successive readouts. This paper introduces a novel deep learning framework based on embedding technology and graph attention networks, namely the time-oriented and location-oriented graph attention (TLGAT) networks. The embedding technology converts sensor observations into corresponding feature vectors. Afterward, TLGAT provides a sensor observation sequence as a fully connected graph to the model’s temporal correlation as well as the sensor’s location correlation among sensor observations and facilitates the feature representation of each sensor observation through receiving other sensor observations and weighting operations. The experiments were conducted on two public datasets, based on the diverse setups of sensor event sequence length. The experimental results revealed that the proposed method achieved favorable performance under diverse setups.


Introduction
With the development of Internet of Things (IoT) technology and the progress of deep learning algorithms, there is currently widespread interest in smart home systems using sensors for improving the daily lives of humans. For example, detecting dementiarelated abnormal behaviours [1,2], analysing context [3], controlling indoor equipment automatically [4], and helping the disabled or patients to live independently better [5,6]. Human activity recognition (HAR) has been one of the research hotspots since it plays a significant role in ambient-assisted living (AAL). HAR is the process of utilizing sensors to understand human motion processes or behavioural logic, so as to infer human daily activities. If the occupant's daily activities can be identified with smart home systems, it can satisfy the demands for smart home systems to provide rational and automated services to residents.
In recent years, several studies have focused on deep neural networks (DNNs), convolutional neural networks (CNNs), and recurrent neural networks (RNNs) to investigate the solutions for HAR [7][8][9]. For the purpose of performing the primary task of HAR, deep learning algorithms can be employed to build a model, either in a supervised or unsupervised learning way. They are among data-driven methods, taking advantage of the massive volume of data collected using the sensors available in smart home systems. With the development of IoT-based smart home platforms, this study focuses more on HAR based on environmental sensor data instead of video or wearable sensor data. The research description of IoT-based HAR is presented in Section 2.
In general, the existing deep learning methods pay more attention to inferring the relationship between daily activities and sensor observations. Indeed, the type and location of sensors, as well as user habits, will affect the performance of HAR tasks. Several practical user cases collect data using multi-type sensors such as the CASAS dataset [10]. For smart home systems with multi-type sensors that can be activated by residents or at a specific frequency, ambient sensors are irregularly activated, and the time interval between each two successive activated sensors is varied. Irregularity means that the correlation between the observed values of every two continuously activated sensors may be inconsistent. Thus, HAR challenges still remain because of the sparsity and irregularity of the data [11]. Additionally, in smart home systems, HAR is expected to assist with other tasks; therefore, real-time performance is important. However, in the state-of-the-art options available, it is difficult to maintain high accuracy when identifying activities with fewer sensor events, which will cause delays in the smart home system.
In the proposed method, we sought to model the time interval between two sensor activation events and the relationship between their location information, thus automatically extracting features to better understand the context of sensor signals, so as to deal with sparsity and irregularity. Furthermore, we aim to introduce more relevant information to represent sensor events and further improve the feature representation of sensor event sequences using the proposed algorithm. The contributions of this paper are summarised as follows:

1.
A novel representation method incorporates sensor location information and timeslice information in smart homes and generates features for each sensor event through an embedding process; 2.
We first apply the time-encoding method and vector operation to automatically capture the correlations between the sensor events relying on their timestamps; 3.
Two novel parallel modules based on graph attention (GAT), namely a locationoriented GAT module and a time-oriented GAT module, are proposed to capture the correlations between different sensor events based on location information and the activated time information. These modules automatically improve the feature representation of the sensor event sequences; 4.
A new end-to-end novel framework, namely the time-oriented and location-oriented graph attention (TLGAT) network, is established to address HAR issues. The experimental results reveal that such a model achieves superior performance on public datasets compared with other state-of-the-art methods.
In the rest of the paper, Section 2 gives a brief overview of the state-of-the-art research based on deep learning, Section 3 explains the preparation process of data from two public datasets and describes the basic principles of several modules that make up our proposed framework. Section 4 presents the experiment setup and shows the evaluation results. Lastly, we summarise the results, highlighting the superiority of the method, and draw brief conclusions in Section 5.

Related Work
In this section, we will summarise the related research of deep learning methods for HAR in smart homes and describe graph neural networks in more detail.

Human Activity Recognition
The task of human activity recognition has been studied for decades. In smart home systems, there are typically two HAR models: wearable-sensor-based HAR and ambientsensor-based HAR [12]. Wearable-sensor-based HAR models use integrated sensors on mobile devices, clothes, or other portable devices, such as accelerometers, gyroscopes, and magnetometers. Ambient sensors are commonly installed at fixed locations to recognise activities, mainly including infrared sensors (e.g., RFID and PIR) and temperature sensors. In contrast, ambient sensors can detect multi-occupant activities in smart homes in a noninvasive manner for users, but it is harder to achieve accuracy using these sensors due to many influencing factors (e.g., people's movement, people's identity, etc.).
Currently, there are several kinds of methods of ambient-sensor-based HAR [13][14][15][16][17], including naive Bayes, hidden Markov models, support vector machines, conditional random fields, etc. Nevertheless, these traditional methods highly rely on handcrafted feature extraction techniques. With the inevitable development of deep learning in many fields, deep-learning-based studies on HAR are conducted to address the problems of feature extraction.

Deep Learning Algorithms
Deep learning is a promising field of artificial intelligence that is able to learn highlevel features from a great volume of raw data to achieve satisfactory results in HAR. In smart home systems, the existing HAR methods focus on the preprocessing strategy of transforming a series of sensor events into specific feature representations, which further model the spatial and temporal dependencies to predict the daily activities of occupants.
Deep-learning-based methods comprise CNN-based and RNN-based methods; the former is useful to capture local dependency, i.e., the importance of nearby observations correlated with sensor events. In addition, its variants can capture either one-dimensional (1D) or two-dimensional (2D) features of event sequences with different receptive field sizes and steps. Singh et al. [18] exploited a 1D-CNN structure of raw data sequences for activity recognition and extracted the local dependencies of activity signals in an invariant feature scale. Bouchabou et al. introduced natural language-processing technology to automatically generate feature representations of raw sensor observations and applied fully 1D-CNN to improve performance [19]. In these works, the way of reducing the receptive field of convolution kernel layers by layer can achieve better outcomes. Although these works focused more on the initial representations of raw sequences, they ignored the spatial and temporal characteristics of sensor event sequences. In addition, some researchers proposed a strategy of converting sensor event sequences into 2D images for corresponding investigations. In a smart home with only binary sensors, Gochoo et al. transformed activity sequences into binary images or RGB images, to apply a 2D-CNN model [7,20]. Later, they used coloured images to represent sensor event sequences containing non-binary sensors, such as temperature sensors [20]. In comparison, Mohmed et al. utilised grayscale images instead of coloured images and selected the AlexNet structure in the feature extraction phase [21,22]. Their experiments showed that the image construction method and 2D-CNN structures cannot achieve better performance. Such constructed images have sparse pixel information, which makes it difficult for 2D-CNNs to extract local features.
RNN is another advantageous deep-learning approach to deal with sequential dependencies. Previously, Arifoglu et al. explored and compared the performance of many types of RNN models in activity recognition for elderly people with dementia [23]. Liciotti et al. summarised several types of long-short term memory variants and evaluated these models in the activity recognition task of a smart home scenario [24], proving that RNN-based approaches have excellent effects in HAR. Furthermore, Bouchabou used the NLP wordencoding method and LSTM to encode sensor observation [25]. They could achieve slight improvement with the help of word-encoding technology, and the dependency of RNNs on capturing dynamic time-varying sequences is still limited.
As introduced above, deep-learning-based methods have shown superiority in HAR. Generally speaking, the above-mentioned methods statically represent the characteristics observed with each sensor, but none of them take into account time-specific and locationspecific characteristics of sensor events. In contrast, this study takes the activation moment and spatial location of the sensors into consideration, which automatically refreshes feature representations of sensor observations based on their data.

Graph Attention Network
The latest development, known as the graph attention network, is increasingly prevalent and has extensive applications in various fields [26], owing to its effectiveness and simplicity in transmitting and aggregating messages from neighbouring nodes in the graph structure. For instance, the data in the recommendation system can be represented as a graph structure to construct the user-item interaction [27]. Social networks can accordingly be denoted as topological graphs, to describe the relationships between people based on different factors (e.g., emails, transactions, etc.) [28,29]. Moreover, a sequence can be treated as a sequence graph, where each graph node can be connected with one or more nodes [30]. Subsequently, on the basis of the prior progress of graph convolutional networks (GCN) [31], GAT is able to produce node representations with an attention mechanism applied in each pair of nodes to distinguish the contribution of each node on the target node. A variant of the algorithm based on the key-value attention mechanism can be viewed as an extension and application of the GAT framework in activity recognition in a smart home setting. In our case, we formulate the links between sensor events as graphs, where nodes refer to sensor events and edges are the intensity of these connections. The details are described in Section 3.3.2. The variant takes advantage of GAT to determine the specific importance of neighbouring nodes for the central node, to improve the representation of each node.

Methodology
This section describes the proposed method in detail. The overall framework involves four steps, namely (i) the preprocessing of several open datasets in the CASAS, (ii) data representation, (iii) model architecture, and (iv) training and evaluation.
The flowchart of the proposed method is shown in Figure 1. The implementation process is described in detail in this section.

Dataset Description
In this study, two publicly available datasets in the CASAS project [10], namely ARUBA and MILAN, were used to evaluate the proposed approach. Each dataset is a file containing more than one million raw sensor events, which have been continuously recorded in the respective smart home testbed for several months. Table 1 summarises the types of sensors, the number of residents, the number of categories involved in daily activities, and the total raw sensor events in these two datasets [32,33]. Table 1. Dataset description (D = door switch sensor, M = motion sensor, T = temperature sensor).

Dataset
The  Figure 2 illustrates samples of the ARUBA's raw dataset, including annotations. One sensor event comprises a date, time, sensor ID, and sensor status or indication, which implies that a sensor is activated and changes in indication or status. In particular, the activity annotation will indicate the corresponding sensor events at the beginning or end of the activity.  Moreover, the opening datasets in the CASAS project provide the sensor layout in the smart home with image files.

Data Preprocessing
After acquiring the publicly available datasets, we preprocessed the data for validation experiments, and the preparation process can be divided into several phases.

Activity Segmentation
Based on the labels provided by the dataset, a series of activity samples were extracted, each of which represented a sensor event sequence recorded from the beginning to the end of an activity. For example, a total of 6413 annotated activity segments can be segmented from the ARUBU dataset. However, the activity segments are composed of different numbers of sensor events. Some activity segments last longer than others, so the maximum and minimum number of events are 2341 and 20, respectively.

Data Representation
Index Representation. The sensor ID, sensor status, or indication of a sensor event in the raw dataset are represented as one word. For example, in the ARUBA dataset, the motion sensor ID is represented as a word with four characters from 'M001' to 'M031'. The complete word list forms a limited-size word list. To make such discrete data usable using a neural network, we transformed each word into an index. Specifically, an N s -length word list, for recording the unique sensor ID, and an N v -length word list for recording the unique sensor status or indication were generated, respectively. The index tokens would then be passed through an embedding layer for the continuous feature vectors learned in an unsupervised manner.
Moreover, the layouts of sensors vary in different smart home settings. As shown in Figure 3, for both datasets, the index representations based on the layout of sensors in different rooms were manually defined. For example, the layout indexes of the sensors located in the same room should be the same. The key idea is to introduce sensor layout information based on the minimal semantic information location information of smart homes ( e.g., kitchen or bathroom), as there is a strong correlation between daily indoor activities and the continuous activation of multiple sensors in a specific space. First, a smart home is divided into several subspaces according to the room layout image file provided by the dataset, and then an index is defined for each subspace. Finally, layout index information is added to each sensor event. This idea is also consistent with the human understanding of a housing layout. Furthermore, in this study, we considered the fact that temporal features play a certain role in activity recognition because of the regularity of household daily activities. However, it is more accurate to state that there is a relationship between the activities within a certain time slice. In this method, the timeframe of 24 h is divided into 1440 time windows, and an index is assigned based on the time sequence. If the time recorded in the sensor event falls within a time slice, the assigned index is the position order of the time slice, which is defined as the time slice representation, avoiding the isolation of the temporal information contained within the slices. In general, daily household activities start at a certain time of the day and are continuously performed for a period of time, meaning that the characteristics of the activities in two different time windows may be related. Therefore, it is reasonable to use a time-slice representation to segment and analyse the temporal information.
Time Encoding. Herein, a high-dimension feature is generated to represent the time point but not the time slice. For each given fine timestamp t, it can be transformed into a multi-dimensional feature vector p t as follows [34]: where ∈ N + refers to the dimensionality of the generated time representation, and T is the time scale of the time-feature transformation, that is, the specific frequency. We set = 32 and T = 1440 (unit: minute) in all the experimental settings.

Sliding Windows
It is known that a smart home system will generate an uncertain length of sensor event sequences whenever a user performs any activity. Therefore, a series of event streams should be split through a fixed-size, event-count-based window. The window size is denoted as N, which represents the event count in a window. It is assumed that the residents perform daily activities to generate a sensor event sequence a i and denote an event as e j . As shown in Figure 4, the sliding-window method can transform an event sequence a i = {e 1 , e 2 , . . . , e l } into a i = {S 1 , S 2 , . . . , S l−N+1 }, where S k = {e k , e k+1 , . . . , e k+N−1 }. Thus, using the window per sensor event generates a large number of window samples, which is helpful for obtaining more data. Obviously, setting a different window size N, the complexity and accuracy of the recognition model also have significant differences. Commonly, the length of the window refers to the time latency of the prediction using past sensor events. The smaller the window length is, the lower the model precision is, but the better the real-time performance of the model.

Model Architecture
This section introduces the proposed model architecture in detail with the purpose of learning an appropriate sensor observation embedding for a given sample S k to predict an activity label.

Embedding Layer
In the first step, the prepared index representations of a sensor event e ind i = {si ind i , o ind i , l ind i , ts ind i } are passed through four embedding layers to a fixed-dimensional feature vector: ts emb = ED ts (ts ind ) where si, o, l, ts represent sensor ID, sensor observation, sensor location, and time slice, respectively. The superscript ind and emb are applied to distinguish between the indexed representation and the embedded representation.

Graph Attention Layer
This section introduces the core graph attention in detail, which was used in the proposed model. Overall, the embedding process can be learned using a hierarchical architecture composed of four levels to model the irregular intervals and latent spaces of the sensor events in the samples.
Generally, a graph attention layer is adopted to model the relationships between nodes in an arbitrary graph [35]. Therefore, for a given graph and one of its nodes u, it is proposed that a time-oriented graph attention layer and a location-oriented graph attention layer can borrow information from node u's neighbour nodes.
Firstly, a directed weighted graph G i = {V, E i } is built. For every sample, S k , node V represents a sensor event, and an edge E i describes the dependencies between the sensor events in sample S k . It can be determined that there is a bidirectional asymmetric information exchange between all sensor event nodes, so the graph is a kind of complete graph. The aforementioned two proposed graph attention layers use the same way of building the weighted graph.
Secondly, let u indicate a sensor event, whose original observation is X = {si emb , o emb } and location information l emb u at time t with time-slide representation ts emb u and timeencoding representation p u . The time-oriented and location-oriented production of observation embeddings T u and L u for sensor event u, as described in Figure 5, are determined using the following equations: Specifically, first, the time-oriented graph attention layer is used to capture the relationship between u and its adjacent nodes to estimate the enhanced observation embedding T u . There are N nodes in each sample S k , which indicates that an N-th attention weight α u = {α u,v α u,v ∈ [0, 1]} will be generated. The intermediate LeakyReLU nonlinearity operations are omitted (the negative input slope is 0.2), and dropout operations occur (the dropout rate is 0.2), followed by a normalisation operation; thus, the attention weight is calculated using the following equation: where A is a trainable weight matrix shared to every two nodes u and v. p u and p v are multi-dimensional vectors using the time-encoding method. σ is a SoftMax function for the normalisation of α u,v . Later, the time-oriented GAT layer computes the enhanced representation for each node as follows: The process of the location-oriented graph attention layer is similar to that of the time-oriented graph attention layer, which can be described with the following equations: where B is the trainable weight matrix in the location-oriented graph attention layer. It applies the location information to estimate the appropriate observation embeddings L u . Finally, high-level sensor event representations are obtained by concatenating X p u and X l u .

Overall Network Architecture
The overall network architecture based on GATs is presented in Figure 6, which includes several modules that are arranged in sequence. More details are provided in this section.

1.
The proposed method applies a total of four embedding layers [ED si , ED o , ED ts , ED l ] in the first layer, each of which converts each index representation into a 32-dimensional vector; 2.
The original embedding representation of sensor ID and sensor observation is processed with two parallel multiple GAT layers, which automatically optimise the characteristic of sensor events considering time-wise and location-wise specificity; 3.
The outputs of two parallel multiple GAT layers are connected with the location representation and time-slice representation. Then, they are passed through a fourlayer 1-D convolution layer with a kernel size of 5, followed by an average pooling layer to capture the high-level features of the sensor event sequence for the inference of daily activities;

4.
The outputs of the average pooling layer are fed to the fully connected layer to infer activity categories.

Embedding si
Embedding o Embedding ts Embedding l Activity ID Figure 6. The overall network architecture of the proposed TLGAT.

Model Complexity Analysis
This section introduces the computational complexity of the proposed model in detail. As mentioned in Section 3.2, the sensor event count in a window is denoted to N, which refers to the length of the sensor event sequence. In addition, O is commonly used to describe the asymptotic behaviour of the functions that define the time cost of the algorithms in the computer science field. F i and F i are used to represent the dimension of input and output vectors of the i th layer, where i = 1, 2, 3 refers to the embedding layer, the GAT layer, and the convolutional layer, respectively. C refers to a known constant.
Firstly, the embedding layer converts indexes into vectors. Since the number of indexes is constant, the computational complexity of an embedding layer is O(NF 1 ). Afterwards, the GAT changes the feature dimension of input vectors through matrix operation and computes the coefficient between every two sensor events, so its computational complexity is O(NF 2 F 2 + N 2 F 2 ). In addition, the computational complexity of a 1D convolutional layer is quadratic with kernel size (K), the numbers of channels (F 3 and F 3 ) as well as the length of input vector (N), and its computational complexity is actually O(NF 3 F 3 K 2 ). Therefore, the total computational complexity of the proposed model is described as follows: where α refers to the coefficient of each formula.

Experimental Setups
First, as previously described, the datasets were segmented using the sliding-window method with different window sizes. In general, the cost of the proposed methods and other existing methods are mostly related to the sequence length. Therefore, the smaller the sequence length, the faster an activity can be recognised. To verify the performance of such models under small sequences, the window sizes were set to 20, 25, 30, and 35 to verify the model performance in various latencies.
Next, to evaluate the proposed method, the data samples of each category were split into two parts, i.e., 70% for training and 30% for testing. Random-shuffle processing ensured that each model could be trained to achieve better generalisation performance.
Then, the proposed TLGAT method was compared with the state-of-the-art models for human daily activity recognition in publicly available smart home systems, including embedding-based FCNs (E-FCNs) [19], ImageDCNN [7,20], and ELMoBiLSTM [25]. As far as our method is concerned, the hidden dimension sizes of the embedding layer and the convolutional layer were empirically set to 32 and 128. Other hyperparameter settings are presented in Table 2. In particular, the Adam optimiser was used to train all the models in a specific number of iterations with an initial learning rate (LR) of 0.01, which was delayed at a specific rate during training. Each experiment was performed 30 times, with a trust gap of 99% using the statistical calculation based on three-sigma (3σ) limits. Eventually, all the experiments were carried out on a server, with an 11th Gen Inter(R) Core(TM) i7-11700K @3.60GHz, 32GB RAM, and an NVIDIA GeForce RTX 3070(8GB). Torch frameworks were used for the algorithm's implementation.

Training and Evaluation
The overall loss for training was computed using the cross-entropy function, which is a suitable measurement option for the difference between the predicted and true probability distributions.
In this task, the sample share of each category was imbalanced. The performance of the models was measured using the F 1 -score (mean).
where TP, TN, and FN refer to true positives, true negatives, and false positives, respectively.

Comparison with State-of-the-Art Models
To show the satisfying performance of the proposed model, we compared the F 1 -score of this model and other state-of-the-art models with the Aruba dataset. Tables 3 and 4 show the overall results of the overall performance under different sensor event window sizes. In total, the proposed model outperformed the traditional state-of-the-art models in HAR under the delay of a different number of sensor events. The F 1 -score increased as the sensor event window size increased. When the length of the sensor event sequences was no more than 35, longer sequences could provide more information for these models to infer the activities. Furthermore, it seems that TLGAT can automatically generate more relevant features than other state-of-the-art methods on small sequences and can further provide a better recognition performance. Furthermore, the F 1 -score values of all activities were determined except for "Enter Home" and "Leave Home" of the ARUBA dataset as well as "Bed to Toilet" and "Eve Meds" of the MILAN dataset (for which there were a low number of samples for testing); the results are shown in Tables 5 and 6. It can be observed that this model performed well in the classification of each activity. In particular, it seems that it was more difficult to infer some activities using the small sensor event sequences with lengths of 20 and 25 because these activities were more complex than others. If more than 30 sensor events were used, the TLGAT could better represent the sensor event sequence and improve the performance of inferring complex activities.

Ablation Experiments
In this work, the proposed model consists of several modules based on GAT and CNN. To validate the effect and importance of the proposed GAT-based modules, i.e., the time-oriented and location-oriented GAT modules (measured with the F 1 -score on the test set), we separately trained four kinds of models, namely TLGAT, LGAT, TGAT, and baseline, in the ARUBA and MILAN datasets. Among them, the first model is our final model architecture, the second and third models are the versions of the first one after removing the time-oriented GAT or the location-oriented GAT module, respectively, and the last model removes both of these two modules. Once the architectures of the above models were identified, ablation studies were performed on the datasets.
Each trial was performed using the parameters defined in Section 4.1, so as to obtain the model with optimal performance through training. Tables 7 and 8 demonstrate the experimental results of the four models under different sensor event window sizes in the ARUBA and MILAN datasets separately. Obviously, compared with the baseline CNN model, the improvement in the F 1 -score of the models using the time-oriented or locationoriented GAT module could slightly provide positive improvement in recognizing activities to some extent. These two modules can learn the temporal and spatial correlations between different sensor events from the sensor sequence, so as to improve the expression of the sensor sequence. Moreover, merging these two modules with a feature concatenation strategy can effectively leverage the superiority of both modules. Therefore, TLGAT was found to have favourable performance in HAR.

Conclusions and Discussion
This paper introduces a novel GAT-based model, namely TLGAT, for HAR in IoTbased smart home systems. TLGAT is able to learn the time-oriented and location-oriented dependencies between sensor events and leverages a feature concatenation strategy, thus significantly improving the recognition precision of HAR tasks. The ability of the graph attention structure enables TLGAT to naturally capture the correlations between different sensor events about irregularly activated time intervals and the layout of the sensors. We validated the effectiveness of the proposed architecture on two public datasets, and extensive experiments showed that it consistently outperformed three state-of-the-art models. In addition, through ablation studies, it was found that this framework applied rational strategies to integrate the proposed GAT modules as well as achieved remarkable improvement. Therefore, GAT can provide a better understanding of HAR tasks.
In addition, there are still several challenges in multi-resident HAR tasks because of data complexity, redundancy, and heterogeneity. We plan to explore the GAT-based methods of understanding data and address these challenges. Moreover, we aim to evaluate other methods such as fuzzy Windows methods [36], to explore a novel framework to improve HAR tasks in smart homes. Institutional Review Board Statement: Ethical review and approval were waived for this study due to experiments conducted on already published public datasets, without the immediate participation of humans.
Informed Consent Statement: Patient consent was waived due to experiments conducted on already published public datasets, without the immediate participation of humans.