Device-Free Human Activity Recognition with Low-Resolution Infrared Array Sensor Using Long Short-Term Memory Neural Network

Sensor-based human activity recognition (HAR) has attracted enormous interests due to its wide applications in the Internet of Things (IoT), smart homes and healthcare. In this paper, a low-resolution infrared array sensor-based HAR approach is proposed using the deep learning framework. The device-free sensing system leverages the infrared array sensor of 8×8 pixels to collect the infrared signals, which can ensure users’ privacy and effectively reduce the deployment cost of the network. To reduce the influence of temperature variations, a combination of the J-filter noise reduction method and the Butterworth filter is performed to preprocess the infrared signals. Long short-term memory (LSTM), a representative recurrent neural network, is utilized to automatically extract characteristics from the infrared signal and build the recognition model. In addition, the real-time HAR interface is designed by embedding the LSTM model. Experimental results show that the typical daily activities can be classified with the recognition accuracy of 98.287%. The proposed approach yields a better result compared to the existing machine learning methods, and it provides a low-cost yet promising solution for privacy-preserving scenarios.


Introduction
Human activity information plays a crucial role in human health monitoring and life enhancement in intelligent buildings. The advent of the Internet of Things (IoT) era has integrated human activity recognition (HAR) with intelligent life, bringing great changes to human life [1]. For example, home automation tasks, e.g., intelligent control of the lighting and air conditioning, can be achieved by HAR. Moreover, it can be implemented in monitoring the health of the elderly to obtain their health condition indirectly.
The conventional approach to detect human activities mainly relies on wearable devices [2], where users monitor their activity by wearing a wearable device on their bodies. For example, accelerometers have been applied to joints and torso positions to recognize the movements of each part [3][4][5]. Inertial sensors and gyroscopes have been incorporated into movable devices for HAR [6]. Igor B et al. [7] proposed a method based on an accelerometer combined with the SVM algorithm to identify common behaviors of humans. Gui H et al. [8] proposed a wearable device using RFID combined with a three-dimensional accelerometer to realize the recognition of human activities. Xue Q et al. [9] proposed a hidden Markov model (HMM)-based data analysis approach, where the approach learns from raw RFID data and applies this to analyze the data. HAR based on RFID is also a wearable device approach. The difficulty for the portability and convenience of wearable devices to meet the requirements of most use scenarios is readily apparent. The camerabased approach has been applied to HAR with artificial intelligence, and image recognition has made enormous strides [10]. Images were collected automatically by camera for HAR neural network, a powerful variation of the recurrent neural network (RNN) architecture, is utilized to learn the representative features from the sequential infrared signals and build the classification model. LSTM has great advantages in dealing with time series data. The infrared signal used for training is composed of time series data. The input of LSTM is composed of three parts, namely, the input at the current moment, the output at the previous moment and the long-term state at the previous moment which saves the long-term information. In particular, the long-term state retains information from the previous moment.
In addition, LSTM has three special gate structures: a forget gate, an input gate and an output gate. Therefore, the network has the function of long-term memory, which improves the relevance of time series data. To reduce the influence of temperature variations, a noise reduction method called J-filter is combined with the Butterworth filter to handle the original infrared signals. In addition, a real-time HAR interface equipped with the LSTM model is designed to realize high-accuracy recognition processes.
The main contributions of this paper are as follows:

1.
A low-resolution infrared array sensor of 8×8 pixels is applied. It is a low-cost, privacy-protective and device-free sensor that establishes a good trade-off between performance and cost.

2.
A noise reduction method called J-filter is proposed to address the heterogeneity issue of the background temperature for infrared signal preprocessing. The original absolute infrared signals are replaced by the relative values extracted by J-filter to eliminate the influence of the temperature and improve the robustness for HAR.

3.
A deep learning model based on LSTM is used to extract infrared signal features. The features of low-dimensional infrared signals can be depth mined in the model training process. The well-trained LSTM model can quickly and accurately recognize the human activities through the designed real-time interface.
The rest of this paper is organized as follows. In Section 2, we describe the proposed framework of HAR. The experimental results and analysis are presented in Section 3. Section 4 draws conclusions from the results and discusses our future work plan.

System Framework Overview
An HAR method based on deep learning is proposed. A low-resolution infrared array sensor of 8 × 8 pixels is leveraged in the HAR approach. The system is divided into four steps, as shown in Figure 1: infrared signal acquisition, signal segmentation and preprocessing, model training and real-time detection. The infrared signals of different activities are collected by the infrared array sensor installed on the ceiling of the room. With the Message Queuing Telemetry Transport (MQTT) protocol, the data receiver obtains data by subscribing to related topics directly. After collecting the data, the data stream is segmented according to a specific number of frames, and the signal samples corresponding to different activities are represented by continuous multi-frame data. After the segmentation, the data are preprocessed, and the J-filter eliminates the background noise caused by the temperature variation. The data are converted into a standard data format and then enter the feature extraction process. The infrared signal is trained through the constructed model based on deep learning, and the effect of the model is tested. After training the model, the infrared signal collected in real time is input to the real-time detection module for judgment. Then, the information of human activities is output.

Infrared Signal Acquisition
The Grid-EYE AMG8833, a low-resolution infrared array sensor of 8 × 8 pixels, made by Panasonic is used in this study. As shown in Figure 2, the sensor contains 64 thermocouples with 8 vertical rows and 8 horizontal rows. The output temperature of the thermocouples ranges from −20 to 80 • C, and the temperature accuracy is ±2.5 • C. It has a detection range of 7 m [28]. It is suitable for HAR because of its low price, ease of installation and steady performance. The synchronous serial bus I 2 C can be supported by the sensor, and the temperature value obtained by the sensor is returned to the connected microprocessor through I 2 C. The microprocessor chooses ESP8266, as it has the advantages of low power consumption and stable performance [29]. The processor supports the standard IEEE 802.11 b/g/n protocol, and NodeMcu can be developed through the WiFi serial port communication module.  The infrared signals of different activities are collected by the low-resolution infrared array sensor. Then, the signals are sent to ESP8266. The sensor collects signals at a fixed frequency to generate sequential infrared signals. The sequential infrared signal is published and transmitted through the WiFi serial port communication module of the microprocessor and the MQTT protocol. The sequential infrared signal corresponding to different activities of humans is obtained after the data receiver subscribes to the topic. A server acts as a data receptor, and the sequential infrared signal is stored in the server. Data analysis and processing are performed concurrently on this server.

Segmentation
Human activity occurs in a time frame. A single infrared signal frame cannot completely reflect human activity changes, particularly for dynamic activity. Therefore, a period of infrared signals (hereinafter referred to as sequential infrared signals) is used to reflect the human activity. Infrared signals are collected for a period of time for each activity to ensure data continuity. It becomes a necessary step to segment the sequential infrared signals. Human activity is not shown for small-cut length data, while if the length of each sample is too large, the cost of training increases. It is challenging to cut the sequential infrared signals into suitable size samples. As shown in Figure 3, the sequential infrared signals of m frames are divided into a sliding window of length L. In other words, a sequential infrared signal with m frames is segmented. The segmented sequential signal generates a sample with L frames. Eventually, there are m/L samples that can be generated in total by the sequential signal with m frames.

Signal Preprocessing
The infrared signal is interfered with by many factors during the collection process: for example, the influence of temperature changes and thermal dissipation of electrical equipment. It is essential to filter and reduce the noise from the collected infrared signals. Figure 4 shows the heatmaps of infrared signals corresponding to the standing activity collected by the sensor at 14 • C, 17 • C and 20 • C. Through observation, it can be found that a larger disturbance appears in the heatmaps when the room temperature changes. It can be seen that the temperature change has a great influence on the infrared signal under the same activity, leading to the difficulty in extracting features under different activities. Reducing the effect of temperature changes on infrared signals is an absolute priority. A more realistic noise reduction method is proposed specifically for infrared signals, called Jfilter. After the infrared signal of a frame is input to the J-filter, the elements of the frame are sorted by size first, and the smallest value T min is selected. T min is the lowest temperature signal detected by the sensor at this moment, and the current indoor temperature can be indirectly reflected by T min . Then, the relative change value T n of each element in the frame at the current temperature is obtained by subtracting T min from the element T n in the frame. The element at the original position in the frame is covered by T n . Finally, the updated infrared signal of the frame is used as the output after noise reduction. Figure 5 corresponds to the original infrared signals of Figure 4 and heatmaps of the infrared signal after noise reduction by the J-filter. Through observation and comparison, it can be found that the infrared signals corresponding to the same activity at different temperatures after J-filter processing have a more obvious characteristic distribution. Under the application of the J-filter, the influence of temperature changes is eliminated to a large extent, which is more beneficial to the data feature extraction in subsequent model training. The effect of the J-filter will be demonstrated in subsequent chapters.  The fluctuation of the sensor itself and the random noise generated in the environment can easily cause signal mutation. This signal mutation is not caused by changes in human activity or location. The Butterworth filter can reduce fluctuations to achieve smooth data [30]. For an abrupt infrared signal, the Butterworth filter is used to reduce its influence on the real signal. The effect of the Butterworth filter on the infrared signal is shown in Figure 6. Figure 6a is the original sequential infrared signal corresponding to the activities of lying, sitting, standing and walking. Figure 6b is the infrared signal processed by the J-filter and the Butterworth filter. By comparison, it can be easily found that the infrared signals corresponding to various activities have more obvious differences after being processed by the J-filter and the Butterworth filter. Due to the complexity of human activity and the influence of factors such as the changes in human location, it is impossible to accurately achieve HAR through the similarity matching of infrared signals. The deep learning method is used to extract the features of the data automatically and train the model to obtain the depth-sensing HAR model of the sequential infrared signals.

Model Training
LSTM has the ability to process long-term information [31]. It solves the problem of gradient disappearance in recurrent neural networks (RNN) [32]. In particular, it has achieved great success in processing time series data [33]. LSTM is selected to train time series infrared signals. The structure of LSTM is shown in Figure 7. The main structure of the network is the rectangular part in the middle. There are three inputs in LSTM totally, which are the input value x t at time t, the output value h t−1 at the previous time and the long-term state C t storing long-term information at the previous time. Compared with an RNN, there are three more control structures in LSTM: a forget gate, an input gate and an output gate, in order to realize the control of the long-term state C t . It is thanks to these special structures that LSTM has control over the long-term state C t . The forget gate controls the long-term state C t−1 at the previous moment, and it controls how many long-term states C t−1 at the previous moment are retained for the long-term state C t at the current moment by calculating the probability, which can be expressed as where f t is the output of the forget gate, σ represents the use of sigmoid as the activation function in the forget gate, W f is the weight matrix and b f is the bias term of the forget gate. The input gate mainly controls the input at the current moment. How many inputs will be saved to the current long-term state C t is also controlled by calculating the probability. The input gate consists of two parts: they are sigmoid activation and tanh activation. The first part uses the sigmoid function to process the data and then outputs them. The second part uses the tanh function to process the data and obtain the outputC t . The corresponding processing formula is as follows: where W i and W C are the weight matrices of the sigmoid activation part and the tanh activation part in the input gate, and b i and b C represent the bias terms of the two activation parts. Combine the two parts: C t is the updated product of the combination of the current time and the long-term stateC t and C t−1 at the previous time. Then, the output h t−1 at the previous time and x t at the current time are processed by the activation function sigmoid to obtain the output o t : where W o is the weight matrix of the output gate, and b o is the bias matrix. Further, through the tanh activation function, the long-term state C t at the current moment is processed by the product of the value and o t to obtain the output h t at the current moment: Each one-dimensional input sample will obtain an output, and the sequential infrared signals corresponding to the L frame will obtain L output values h t . The softmax [34] activation function is further used to normalize the output at the current moment: whereŷ t is the output value obtained after the above processing of the sequential infrared signals of the L frames, τ is the activation function softmax, V is the weight matrix and c is the bias matrix. Through the above formula, the output of each moment will be integrated into the calculation of the next moment. Thanks to the correlation between the input and output of the previous moment and the current moment in the LSTM, the network has the "memory" ability. Finally, the outputŷ t at the last moment is used as the final output. Theŷ t is a one-dimensional vector whose element values are between 0 and 1, and each element corresponds to an activity label. The element value is the probability value calculated by Formula (9). Next, compare the size of each element and output the element position k corresponding to the largest value. It is the activity tag value corresponding to the sequential infrared signal of the L frame. By comparing the output value with the label value, the gradient descent [35] method is used to iteratively update the parameters in the network through the backpropagation [36] process of the network. The collected sequential infrared signals will be trained through the above method. The model is tested and saved during each step of training. The training is over, and the optimal parameters and results obtained during the training process are fixed and saved.

Real-Time Detection
The collected infrared signals are sent to the server through the MQTT protocol in real time. Once the collected infrared signals reach the set number of frames, they are converted into a format that conforms to the model input and sent to the trained LSTM model to obtain probability values corresponding to the activities. The value of the tag corresponding to the largest probability is selected as the predicted value and input into the real-time detection module. It has a picture switching function in the real-time detection module. This function can switch to the corresponding picture according to the activity corresponding to the input value of the prediction and display the real-time activity through the picture. The received data are cleared and repeat the above steps to achieve real-time detection.

Experimental Setup and Data Description
The proposed method was verified in an indoor environment, as shown in Figure 8. The detection area was in the rectangular red area in Figure 8, with an area of approximately 6.6 m 2 , and a length and width of 3.3 m and 2 m, respectively. The infrared array sensor of 8 × 8 pixels, Grid-EYE AMG8833, was installed about 3 m directly above the center of the detection area. The sensor was connected with ESP8266 for data collection, and the frequency was 20 frames/s. The infrared signal was trained and tested in the server, and it was uploaded to the server through a TP-Link886N router. The main hardware configuration of this server has the following: Intel Core i7-8750H CPU, 16G memory and an NVIDIA GeForce GTX1060 GPU. Regarding the software, the server was equipped with the Windows 10 operating system, the deep learning framework was based on Keras 2.2.4 and the construction of the network structure and programming were carried out in Python 3.6.4. There are five states of lying, standing, sitting, walking and empty. The four activities and the corresponding infrared signals' heatmaps correspond to (a), (b), (c) and (d) in Figure 9. During the experiment, an experimenter acted as the experiment area. The experimenter was a male 175 cm tall and 75 kg in weight, a typical body shape in the population. The experimenter collected the infrared signals of lying at the six sampling points shown in Figure 10a. The infrared signals of standing and sitting were collected at 12 sampling points in Figure 10b. The human walked irregularly in the detection area to collect walking infrared signals. In the process of data acquisition, the participant performed different activities in as many different directions as possible. The dataset contains the signal in different directions for each activity.
The infrared signals with different activities in the room temperature range of 14 to 20 • C were fed into the model for training. The robustness and effectiveness of the model were improved. A low-pass Butterworth filter was utilized in this paper, and the cut-off frequency and order were 0.15 and 5, respectively. There were 43,200 frames of infrared signals collected in each of the five states, and the total of the infrared signals in the five states was 216,000 frames. Every 20 frames of infrared signals were taken as a sample, and the total number of samples was 10,800. After processing the collected data with the J-filter and the Butterworth filter mentioned in Section 2, the samples of each state were labeled. The label values of the five states of lying, standing, sitting, walking and empty correspond to 0, 1, 2, 3 and 4. The sample was randomly shuffled and divided into the training set and testing set according to the ratio of 80% and 20%.

Results
Before training, the sequential infrared signals need to be processed into a type suitable for model input. A sample has 20 frames of data, and each frame of data has 64 infrared signals. Each 20 × 64 dimension matrix signal is converted into a 1 × 1280 dimension vector. Therefore, there are 1280 nodes in the input layer of LSTM. Other structural parameters of LSTM are as follows: the number of nodes in the hidden layer is 640 and each hidden layer node is connected to a full connection layer with 100 backward nodes, and there is a ReLU [37] activation layer behind the full connection layer. Finally, the softmax activation function is used to normalize the data to obtain the output. Additionally, the timesteps of LSTM are set to 16. The training steps total 500 times. After each training is completed, the testing set is used to test the effect of the model and save the training and test results of each step. The training and testing results of each step are plotted. As shown in Figure 11, the model was trained to step 99, and the recognition accuracy of the training set reached 100%. At the end of the training, the recognition accuracy of the testing set based on the model constructed by LSTM reached 98.287%. Specifically, it can be seen from the confusion matrix in Figure 12 that the test accuracy of the model for the five states is as high as 96% or more. In particular, the recognition accuracy of lying reached 100%, and only a very small number of false recognition cases occurred in the other states. Through the above results, the proposed method effectively extracts the data characteristics of infrared signals in different human activities and achieves high recognition accuracy for different human activities.  It is worth mentioning the tested samples include infrared signals collected at room temperature from 14 to 20 • C. The effectiveness of the J-filter at different indoor temperatures is further verified. The strong robustness of the proposed method and model has been verified.
The PyQt5 package was used to write a visualization program to verify the actual detection effect of the proposed method. PyQt5 uses Python to encapsulate all classes of Qt. The server obtains the infrared signals collected by the sensor by subscribing to MQTT. Once 20 frames of infrared signals are collected, they are processed by the J-filter and the Butterworth filter and sent to the trained model for detection. As shown in Figure 13, the real-time predicted value can be seen on the left, and the display interface on the right shows the picture corresponding to the recognized activity. Through the setting of the slot function, the picture switching function required for the compilation interface display is realized. The current activity of humans in the detection area is shown through the display interface. The real-time detection module is used to test the activity detection of an on-site human. During the real-time detection process, the experimenters performed the following four activities in the detection area: lying, standing, sitting and walking. The situation in the empty state is also detected. The experimental results are presented in Figure 14. The predicted value of real-time detection is consistent with the real movement, and it is very effective for real-time detection in lying, standing, walking and empty states. According to statistics, the real-time recognition accuracy rates of lying, standing, sitting, walking and empty states reached 100%, 96.250%, 85%, 98.750% and 99.583%, respectively, and the overall recognition accuracy reached 95.917%. The results show the proposed method is still effective in practical applications and achieves a high real-time detection accuracy. The high practical value of this method has been proved.

Impact of the Frames in the Sample
The various frames of samples are used as research objects. Different numbers of frames are selected as samples for experimentation. The frame numbers of 10, 20, 40 and 80 are selected as the number of frames for each sample. The more frames the sample contains, the longer it takes to train the model, as shown in Figure 15. The increase in the number of frames does not lead to an increase in the accuracy of HAR but shows a downward trend when the number of frames exceeds 20. Ten frames of infrared signals are used as a sample, although the training time of the model is the shortest, and the 96.644% accuracy of the model at this time is not as good as the 98.287% accuracy of 20 frames as a sample. The experimental results are analyzed, and the sequential infrared signals' characteristics corresponding to different human activities cannot be well extracted by the model with the length of signal segmentation being too small. Nonetheless, the cost of model training increases if the signal segmentation is too large. The number of frames for each sample of 20 is chosen after considering the model performance and training cost.

Impact of the Number of Hidden Layer Nodes
Although some parameters of the deep learning network cannot play a decisive role in the effect of the training model, they still have a greater impact on the detection results. The impact of the number of hidden layer nodes on the model effect was researched. Five types of numbers are selected for the experiment, and the number of hidden layer nodes is set to 160, 320, 640, 800 and 1280. Figure 16 shows the experimental results under different numbers of hidden layer nodes. The number of hidden layer nodes increases from 160 to 640, and the test accuracy of the model increases from 95.093% to 98.287%. The loss of the model is also reduced from 0.2668 to 0.1257. The number of nodes is further increased, yet the test accuracy of the model shows a downward trend, and the loss value gradually increases. The increase in the number of hidden layer nodes improves the test accuracy and effectively extracts features of the data. The model also has better data-fitting capabilities and appears to over-fit, resulting in a decrease in test accuracy if there are too many nodes. The experimental results show the hidden layer nodes at 640 are the best choice.

Impact of the Timesteps
Timesteps [38] are one of the important parameters of LSTM. The impact of timesteps on the model was studied. The timesteps were set to 4, 8, 16, 32 and 64 for model training. After 500 steps of training, the impact of the model on the accuracy of HAR under different timesteps can be seen in Figure 17. Although the recognition accuracy of the model was 95.046% with the timesteps set to 4, the training time of the model was the longest-the training model took 14.45 × 10 ∧3 s. It is easy to find that the training time of the model decreases with the increase in timesteps, but the downward trend gradually slows down. However, the highest recognition accuracy reached 98.287%, and the training time was reduced to 7.59 × 10 ∧3 s as the timesteps increased to 16. Compared with timesteps set to 4, the recognition accuracy of the model is not only improved by 3.241% but also the training time is reduced by nearly half. The recognition accuracy of the model did not increase with the increase in the timesteps but rather declined. The change in timesteps has an impact on the recognition accuracy of the model; the impact on the training time is particularly significant. Comprehensive consideration of the recognition accuracy and training time of the model and setting the timesteps to 16 represent a cost-effective choice.

Impact of the Filtering Algorithms
The impact of the preprocessing of infrared data on the experimental results is discussed. The effectiveness of the combination of the J-filter and the Butterworth filter method is also verified in this section. The raw data and the data processed by background subtraction, removing the average pixel temperature and just the J-filter were fed into the same model for training. The training results are shown in Table 1. The data processed by the J-filter and the Butterworth filter have a higher accuracy, recall and F1-score than the others. Compared with background subtraction and the method of removing the average pixel temperature, the accuracy of the proposed method is improved by 7.731% and 6.018%, respectively. In addition, it can observed that the accuracy of just the J-filter reaches 96.644%, which is obviously better than background subtraction (90.556%) and removing the average pixel temperature (92.269%). It can be proved that the J-filter is a very suitable filtering approach for infrared signals.

Comparative Analysis
To further prove the efficiency of the proposed method, a comparison was carried out between the proposed approach based on LSTM and other machine learning methods. Extreme learning machine (ELM) [39], support vector machine (SVM) [40], k-nearest neighbors (KNN) [41] and convolutional neural networks (CNNs) [42] were selected to train and test the same sequential infrared signals. The recognition results of each algorithm in different activities are presented in Figure 18. The recognition accuracy of the algorithms is ranked from high to low: for the recognition algorithms based on LSTM, CNN, KNN, SVM and ELM, the corresponding recognition accuracy is 98.287%, 89.242%, 87.490%, 79.108% and 53.402%. It is easy to observe the proposed method is higher than the other algorithms in terms of the recognition accuracy of a single activity and the overall recognition accuracy. The algorithm proposed can still achieve more than 96% accurate recognition of walking in the case of other algorithms misrecognizing walking. The equilibrium of the proposed algorithm to the recognition effect of different activities is verified. Compared with other algorithms, the advantages of the algorithm proposed are more obvious. The results in this paper are compared with wearable device-based [7], WiFi-based [17] and other infrared-based approaches [22,25,27]. The same activities, sitting and standing, are selected as the references and shown in Table 2. Compared with WiFi-and wearable device-based approaches, only one sensor is needed to achieve higher accuracy with the proposed method . The low-resolution infrared sensor proposed in this paper achieves accuracy similar to the higher-resolution infrared method. It can be observed that the proposed method can utilize low cost and, meanwhile, maintain the optimal performance.

Conclusions
This paper presented an HAR approach based on a low-resolution infrared array sensor using a deep learning network. In the proposed HAR method, the infrared thermal images of 8 × 8 are collected from the infrared sensor for recognition. In the data processing stage, the combination of the J-filter and the Butterworth filter is performed to reduce the noise of the raw infrared signal and eliminate the impact of temperature fluctuations. The infrared signal matrix is converted to a temporal sequence for the training stage. The LSTM model network is utilized to automatically extract the characteristics of the infrared signal and train the detection model. The results indicate that the daily activities of lying, standing, sitting, walking and empty can be accurately distinguished with an accuracy of 98.287%. Our approach yields a better performance compared with other machine learning methods.
The proposed device-free HAR system offers a low-cost and high-accuracy sensing solution for application in privacy-preserving smart home scenarios. Future work is needed to extend the system to more participants and conduct the research about transition activities. The performance assessment of infrared sensors at different heights is also planned for the future.