Privacy-Preserved Fall Detection Method with Three-Dimensional Convolutional Neural Network Using Low-Resolution Infrared Array Sensor

Due to the rapid aging of the population in recent years, the number of elderly people in hospitals and nursing homes is increasing, which results in a shortage of staff. Therefore, the situation of elderly citizens requires real-time attention, especially when dangerous situations such as falls occur. If staff cannot find and deal with them promptly, it might become a serious problem. For such a situation, many kinds of human motion detection systems have been in development, many of which are based on portable devices attached to a user’s body or external sensing devices such as cameras. However, portable devices can be inconvenient for users, while optical cameras are affected by lighting conditions and face privacy issues. In this study, a human motion detection system using a low-resolution infrared array sensor was developed to protect the safety and privacy of people who need to be cared for in hospitals and nursing homes. The proposed system can overcome the above limitations and have a wide range of application. The system can detect eight kinds of motions, of which falling is the most dangerous, by using a three-dimensional convolutional neural network. As a result of experiments of 16 participants and cross-validations of fall detection, the proposed method could achieve 98.8% and 94.9% of accuracy and F1-measure, respectively. They were 1% and 3.6% higher than those of a long short-term memory network, and show feasibility of real-time practical application.


Introduction
With the trend of global aging, various social problems have emerged one after another such as economic, political, and social development issues; population projections; and so on [1]. A large number of countries are stepping up research on population aging [2]. As the number of elderly people in hospitals and nursing homes surges, expectations for advanced systems that accurately alert them to dangerous motions and emergencies in real-time are increasing [3], and the combination of security care and intelligent management can solve this problem. By developing artificial intelligence and leveraging big data, many everyday applications can be automated. [4]. Although some nursing systems that focus on monitoring and physical condition measurement have been developed, there are still some requirements that have not been significantly satisfied, which shows that this research field has broad application prospects.
There are various forms of motion that need to be detected. Of these, falling is one of the most dangerous ones and should be paid attention to [5]. It might cause serious consequences or severe 1.
We proposed a motion detect method which can mainly detect falling in consideration of privacy preservation for elderly people using low resolution infrared array sensor. In our proposed system, one sensor can cover a 4 × 5 m area for detection, and the area can be expanded by using more than one independent sensor. It means that our system has high flexibility and scalability.

2.
We proposed to adapt a three-dimensional convolutional neural network as a suitable classifier for the proposed system. It could detect an occurrence of falling with very high accuracy of 98.8% and F1-measure of 94.9% in real-time, which were 1% and 3.6% higher than those of a long short-term memory network, respectively. It means that the proposed system could achieve construction of an alert system for the health care and support caretakers.
This paper is organized as follows: Section 2 provides an overview of relate work about fall detection. Section 3 shows materials and methods of proposed system. Section 4 explains detail of experiments, and Section 5 shows the result of the experiments. After that, Section 6 discusses the results. Finally, Section 7 concludes this study.

Related Work
There mainly exist two kinds of fall detection systems: systems based on portable devices and systems based on external sensing devices. First, portable or wearable device-based methods require sensors such as accelerometers and gyroscopes to be attached to parts of the human body, and these sensors are used to detect the acceleration and angular velocity of specific parts of the body [14]. Mathie et al. used an accelerometer attached to the waist to detect falls [15]. The movement of the human body from standing to falling causes acceleration to increase suddenly, so the fall can be judged accordingly. Bianchi et al. introduced a barometric pressure sensor as a tool for height measurement to improve the existing human fall detection technology based on accelerometers. The device is worn on the waist, records the acceleration and barometric pressure data, then processes the data, and finally uses the trained hierarchical decision tree model to perform human fall detection [16]. Haoyu et al. considered the risk level of a fall including the direction of the fall and whether it was supported by a hand by constructing a three-layer structure with machine learning algorithms based on wearable sensors [17].
Of the methods based on external sensing devices, camera-based methods take advantage of the rapid development of image processing technology. Computer vision-based motion detection takes a video of a person's motion with an optical camera and uses advanced image processing algorithms to determine whether there is a frame with a motion feature to detect [18]. De Miguel et al. developed a fall detection system based on a camera for the elderly that applies various algorithms to extract better features and used a K-nearest neighbors algorithm to achieve recognition [19]. Yu et al. used an enhanced one-class support vector machine as a recognition algorithm and obtained features including the differences of barycenter position and orientation of a person over a time period as input [20]. Merrouche et al. used a depth camera to detect a fall by combining human shape and movement. They analyzed shape features and movement of the human and tracked the human head through particle filter while expressing movement by the covariance of the center of mass distance over time [21]. There are other types of external sensing devices. Chelli et al. and Wang et al. used Wi-Fi to detect human motions with a support vector machine (SVM) [22,23]. Mokhtari et al. developed a fall detection system based on an ultra-wide band (UWB) radar with an SVM [24]. Sadreazami et al. and Liang Ma et al. also used a UWB radar for fall detection with a convolutional neural network (CNN) [25,26].
During actual application, different living environments and their limitations should be considered to ensure the convenience, reliability, and practicability of the system. For instance, wearable devices inevitably lead to decreased comfort. Moreover, in private areas such as restrooms and bathrooms, privacy should be protected, which means that the application of an optical camera is restricted in such situations. When designing the system, these problems should be considered carefully [27].
According to a survey, infrared sensors have been used for human detection [28], target tracking [29], and motion recognition [30]. High-resolution sensors and infrared sensor arrays are commonly used in motion detection but are insufficient for privacy protection and have a high price. Some studies which focused on fall detection using low resolution infrared sensor are summarized in Table 1. The accuracy of each study is discussed in Section 7. Among them, Tao et al. proposed a privacy-preserved fall detection system using an infrared ceiling sensor network [31], which was similar to our proposed system. The resolution of the sensors used is quite low, just 4 × 5 pixels, so they adopted a sensor network, which made the system complex.

Materials and Methods
This section explains the sensor used in this study, system construction, system flow chart of proposed methods, data processing of thermo images, the method of target existence detection and positioning, and two kinds of classifiers selected in this study.

Infrared Array Sensor
An infrared sensor is a kind of sensor that uses the physical properties of infrared light to make measurements. Infrared sensors are usually used for non-contact temperature measurement, gas composition analysis, and non-destructive testing, and are widely used in the fields of medicine, military, space technology, and environmental engineering [42].
Commonly, infrared sensors are divided into two types: near-infrared (NIR) and far-infrared (FIR). Compared with the NIR sensor, the FIR sensor has a stronger anti-interference ability. Compared with other infrared sensors such as passive infrared detectors and single-point infrared sensors, infrared array sensors can provide more comprehensive information including position information, movement states, surface temperature, and others [43].
In this research, from the perspective of versatility and privacy, one detection method based on a low-resolution infrared array sensor is proposed. In the proposed motion detection system, the low-resolution infrared array sensor can protect the privacy of the elderly, which means, it can be used in some private places such as locker rooms and restrooms. This method can effectively recognize elderly people's fundamental motions and detect their emergencies.
A far infrared thermal sensor named MLX90640, manufactured by Melexis, was applied in this research [44]. The 32 × 24 thermopile elements in the sensor can detect infrared radiation of objects in the detection area and return a low-resolution thermal image that reflects the temperatures of the objects [32]. The normal working temperature of the sensor varies from −40 to 85 • C, which means that this sensor may perform well in all common conditions. The sampling rate of the sensor is programmable up to 64 Hz, which means that this sensor can be used for real-time detection [45,46]. An I2C compatible digital interface is provided for data transmission with a transmission frequency of up to 1 MHz. In this research, the sensor was connected to a micro-controller unit, M5Stack including a Wi-Fi module, through a GROVE I2C port as shown in Figure 1. The data collected by the M5Stack can be sent to a PC with Wi-Fi via routers in a separate location such as a nurse station or a staff operation room in a remote monitoring system. The data gathered in the PC was continuously processed to detection a fall, and when detected, the system announce an alert to staff.
In this system, the acquisition frequency of the sensor was set to 15 Hz considering the data size to be processed because the data processing time can be relatively short. An example of an infrared array image is shown in Figure 2. Here, Figure 2a shows the real hand image and Figure 2b shows the RGB colored infrared array image. Originally, the infrared image is grayscale (8 bits); however, here it is shown in color for clear and easy viewing.

System Flow Chart of the Propsed Method
The flow chart of the proposed motion detection system is shown in Figure 3. The processing methods are mainly divided into three parts: preprocessing, target detection, and motion detection. In the first part, the raw data frames obtained from the sensor are denoised, and then separated into two parts, background and objects. The background can be removed to extract an area of the objects. In the next part, methods of target presence detection and target location are used for target detection. Finally, in the last part, a three-dimensional convolutional neural network (3D CNN) [47] is applied to motion detection to deal with the time series image data. A long short-term memory (LSTM) network with feature extraction is also applied for comparison. An example of an infrared array image is shown in Figure 2. Here, Figure 2a shows the real hand image and Figure 2b shows the RGB colored infrared array image. Originally, the infrared image is grayscale (8 bits); however, here it is shown in color for clear and easy viewing. An example of an infrared array image is shown in Figure 2. Here, Figure 2a shows the real hand image and Figure 2b shows the RGB colored infrared array image. Originally, the infrared image is grayscale (8 bits); however, here it is shown in color for clear and easy viewing.

System Flow Chart of the Propsed Method
The flow chart of the proposed motion detection system is shown in Figure 3. The processing methods are mainly divided into three parts: preprocessing, target detection, and motion detection. In the first part, the raw data frames obtained from the sensor are denoised, and then separated into two parts, background and objects. The background can be removed to extract an area of the objects. In the next part, methods of target presence detection and target location are used for target detection. Finally, in the last part, a three-dimensional convolutional neural network (3D CNN) [47] is applied to motion detection to deal with the time series image data. A long short-term memory (LSTM) network with feature extraction is also applied for comparison.

System Flow Chart of the Propsed Method
The flow chart of the proposed motion detection system is shown in Figure 3. The processing methods are mainly divided into three parts: preprocessing, target detection, and motion detection. In the first part, the raw data frames obtained from the sensor are denoised, and then separated into two parts, background and objects. The background can be removed to extract an area of the objects. In the next part, methods of target presence detection and target location are used for target detection. Finally, in the last part, a three-dimensional convolutional neural network (3D CNN) [47] is applied to motion detection to deal with the time series image data. A long short-term memory (LSTM) network with feature extraction is also applied for comparison.

Data Preprocessing
After the data frames are collected from the sensor, preprocessing including denoising and background subtraction is executed. In consideration of the noise distribution. Gaussian filtering, which is a linear smoothing filter, has been commonly used to reduce noise in image processing [48], was chosen. In this research, the data size and noise intensity were considered when determining the filter; therefore, the smallest size filter shown in Equation (1) is sufficient. (1) Figure 4a shows a typical frame of raw data where the target exists. Data after the Gaussian filtering is shown in Figure 4b. After denoising, background subtraction is executed to extract the target. It is a common method used to record some frames when there is no target in the area, and these data are regarded as the stable background.

Data Preprocessing
After the data frames are collected from the sensor, preprocessing including denoising and background subtraction is executed. In consideration of the noise distribution. Gaussian filtering, which is a linear smoothing filter, has been commonly used to reduce noise in image processing [48], was chosen. In this research, the data size and noise intensity were considered when determining the filter; therefore, the smallest size filter shown in Equation (1) is sufficient.
(1) Figure 4a shows a typical frame of raw data where the target exists. Data after the Gaussian filtering is shown in Figure 4b. After denoising, background subtraction is executed to extract the target. It is a common method used to record some frames when there is no target in the area, and these data are regarded as the stable background.

Data Preprocessing
After the data frames are collected from the sensor, preprocessing including denoising and background subtraction is executed. In consideration of the noise distribution. Gaussian filtering, which is a linear smoothing filter, has been commonly used to reduce noise in image processing [48], was chosen. In this research, the data size and noise intensity were considered when determining the filter; therefore, the smallest size filter shown in Equation (1) is sufficient. (1) Figure 4a shows a typical frame of raw data where the target exists. Data after the Gaussian filtering is shown in Figure 4b. After denoising, background subtraction is executed to extract the target. It is a common method used to record some frames when there is no target in the area, and these data are regarded as the stable background.
(a)  In real applications such as wards of a hospital, there is usually some electronic equipment that heats when operational, such as heart rate monitor devices, which may be detected and judged as a human and cause false detection. To deal with this problem, background subtraction was applied. The distribution of the data of the background is shown in Figure 4c. In real applications such as wards of a hospital, there is usually some electronic equipment that heats when operational, such as heart rate monitor devices, which may be detected and judged as a human and cause false detection. To deal with this problem, background subtraction was applied. The distribution of the data of the background is shown in Figure 4c.
In general, non-human items do not move frequently, so we collected a relatively long series of data and used the average as the initial background, renewing by where BG i means the background in the i-th frame and FG i means the foreground data of the i-th frame. The weight α is a constant decided by the application environment. This background subtraction and renewing method shows suitable performance without much calculation cost. Figure 4d shows the data of Figure 5 after background subtraction.

Target Existence Detection and Positioning
After the background subtraction, firstly, the target detection is performed for each frame. It can be assumed that temperatures of a human are higher than those of the environment and emit more heat radiation, so the human image consists of spots with higher temperatures in the thermal image. In the application environments of hospitals and nursing rooms, a few moving non-human items can emit a large amount of heat radiation, so the aim of human detection can be simplified as the detection of larger heating spots. First, the local highlighted temperature pixel and the 5 × 5 neighborhood pixels are selected. Then, the selected pixels with temperatures at least 1 • C higher than the background are marked. When the size of a marked area is more than or equal to N H pixels, it indicates that this area can be detected as a human [44], as follows: where, N H is a threshold for human detection. In Figure 5, the highest temperature pixel of the detected target is shown as a black point. Here, N H is also decided based on the target size in the system environment.
Sensors 2020, 20, 5957 8 of 22 In general, non-human items do not move frequently, so we collected a relatively long series of data and used the average as the initial background, renewing by where means the background in the i-th frame and means the foreground data of the i-th frame. The weight is a constant decided by the application environment. This background subtraction and renewing method shows suitable performance without much calculation cost. Figure  4d shows the data of Figure 5 after background subtraction.

Target Existence Detection and Positioning
After the background subtraction, firstly, the target detection is performed for each frame. It can be assumed that temperatures of a human are higher than those of the environment and emit more heat radiation, so the human image consists of spots with higher temperatures in the thermal image. In the application environments of hospitals and nursing rooms, a few moving non-human items can emit a large amount of heat radiation, so the aim of human detection can be simplified as the detection of larger heating spots. First, the local highlighted temperature pixel and the 5 × 5 neighborhood pixels are selected. Then, the selected pixels with temperatures at least 1 °C higher than the background are marked. When the size of a marked area is more than or equal to pixels, it indicates that this area can be detected as a human [44], as follows: where, is a threshold for human detection. In Figure 5, the highest temperature pixel of the detected target is shown as a black point. Here, is also decided based on the target size in the system environment. The representative pixel with the highest value is not always the middle point of a human, because it is influenced by clothes, gestures, and noise. To correctly express the position of the target and obtain better performance, the target position should be corrected. The barycenter of the neighborhood in the local highlighted pixels is calculated by Equation (4).
where ( , ) is the coordinate of the barycenter in this frame of the image, , is the value on point ( , ). This process is repeated with the renewed middle point and the same neighborhood size until the barycenter stop moving. Figure 6 shows the adjusted center point of the target with a black point [45]. The barycenters obtained from sequential frames can be used to detect a moving direction, a moving distance, and a speed of the target in this system. The representative pixel with the highest value is not always the middle point of a human, because it is influenced by clothes, gestures, and noise. To correctly express the position of the target and obtain better performance, the target position should be corrected. The barycenter of the neighborhood in the local highlighted pixels is calculated by Equation (4).
(x c , y c ) = x,y a x,y x x,y a x,y , x,y a x,y y x,y a x,y , where (x c , y c ) is the coordinate of the barycenter in this frame of the image, a x,y is the value on point (x, y). This process is repeated with the renewed middle point and the same neighborhood size until the barycenter stop moving. Figure 6 shows the adjusted center point of the target with a black point [45]. The barycenters obtained from sequential frames can be used to detect a moving direction, a moving distance, and a speed of the target in this system. The motion detection is based on the result of the target tracking. When there are multiple targets in the frame, motion detection can be performed accordingly, and when one target is judged to disappear from the area, the motion detection for the target is stopped.

Three-Dimensional Convolutional Neural Network
A traditional two-dimensional convolutional neural network (2D CNN) for video images with continuous frames can easily lose the information on the target time axis, resulting in low recognition accuracy. To address this problem, a method based on a 3D CNN is proposed [49]. Three-dimensional CNN has received considerable attention over the years for use in consecutive frames understanding. The main reason for its success is the effective extraction of spatio-temporal features from consecutive frames. The 3D convolution kernel is used to extract temporal and spatial features to capture the motion information of the object [50,51].
In Figure 7a, the traditional 2D CNN uses a 2D convolution kernel to convolve the image. The time and space feature information in the continuous figures can be better dealt with by using a 3D CNN. The 3D convolutional neural network includes a convolutional layer, a pooling layer, a fully connected layer, and a softmax layer. The 3D convolutional layer expands the dimensions of the convolutional neural network based on the basis of the 2D convolutional neural network. The size of the convolution kernel and the filter size of the pooling layer in each layer of the network structure are all upgraded to three dimensions. The output of the convolutional layer is cube data. In the fully connected layer, neurons are connected to all neurons in the adjacent layer. The input of the fully connected layer converts the feature space to a neuron vector and then uses matrix multiplication to flatten the input feature vector. The input of the previous layer flattens the convolved 3D feature vector to a neuron vector. The final output layer is the softmax layer, and the last neural vector will calculate the probability of each category. In Figure 7b, the 3D CNN uses a 3D convolution kernel to perform convolution operations on the image sequence (video) [52,53]. The time dimension of the convolution operation in the figure is three, that is, the convolution operation on three consecutive frames of images. The 3D convolution in Figure 7b forms a cube by stacking multiple consecutive frames, and then the convolution operation is performed using the cube [54]. In this structure, each feature map in the convolutional layer is related to multiple adjacent consecutive frames in the previous layer, which is how it can capture motion information. The value of a certain position of a convolution map is obtained by convolving the local receptive field of the same position of several consecutive frames in the previous layer [55]. The motion detection is based on the result of the target tracking. When there are multiple targets in the frame, motion detection can be performed accordingly, and when one target is judged to disappear from the area, the motion detection for the target is stopped.

Three-Dimensional Convolutional Neural Network
A traditional two-dimensional convolutional neural network (2D CNN) for video images with continuous frames can easily lose the information on the target time axis, resulting in low recognition accuracy. To address this problem, a method based on a 3D CNN is proposed [49]. Three-dimensional CNN has received considerable attention over the years for use in consecutive frames understanding. The main reason for its success is the effective extraction of spatio-temporal features from consecutive frames. The 3D convolution kernel is used to extract temporal and spatial features to capture the motion information of the object [50,51].
In Figure 7a, the traditional 2D CNN uses a 2D convolution kernel to convolve the image. The time and space feature information in the continuous figures can be better dealt with by using a 3D CNN. The 3D convolutional neural network includes a convolutional layer, a pooling layer, a fully connected layer, and a softmax layer. The 3D convolutional layer expands the dimensions of the convolutional neural network based on the basis of the 2D convolutional neural network. The size of the convolution kernel and the filter size of the pooling layer in each layer of the network structure are all upgraded to three dimensions. The output of the convolutional layer is cube data. In the fully connected layer, neurons are connected to all neurons in the adjacent layer. The input of the fully connected layer converts the feature space to a neuron vector and then uses matrix multiplication to flatten the input feature vector. The input of the previous layer flattens the convolved 3D feature vector to a neuron vector. The final output layer is the softmax layer, and the last neural vector will calculate the probability of each category. In Figure 7b, the 3D CNN uses a 3D convolution kernel to perform convolution operations on the image sequence (video) [52,53]. The time dimension of the convolution operation in the figure is three, that is, the convolution operation on three consecutive frames of images. The 3D convolution in Figure 7b forms a cube by stacking multiple consecutive frames, and then the convolution operation is performed using the cube [54]. In this structure, each feature map in the convolutional layer is related to multiple adjacent consecutive frames in the previous layer, which is how it can capture motion information. The value of a certain position of a convolution map is obtained by convolving the local receptive field of the same position of several consecutive frames in the previous layer [55]. The flow chart of the 3D CNN classifiers applied in this system is shown in Figure 8. In this research, the 3D CNN has a structure as follows:  A convolutional layer with 32 filters and 5 × 5 × 5 cube size for each filter;  A ReLU layer to complete the activation;  A max-pooling layer with a cube size of 3 × 3 × 3;  A fully connected layer with 128 points;  A ReLU layer with 50% dropout to avoid overfitting;  A fully connected layer with 8 points;  A softmax layer and a classification layer.
Each parameter is decided according to the complexity of the target and the number of classes [46]. The effect of convolutional layers is convolving the input and the result is passed to the next layer. The number of free parameters can be manifestly reduced by the convolution operation, allowing it possible that the network to be deeper with fewer parameters. The vanishing gradient and exploding gradient problems that occur during backpropagation in traditional neural networks are prevented by applying regularized weights over fewer parameters. Pooling layers can combine the neuron clusters outputs into a single neuron just at one layer to reduce the dimensions of the data in the next layer, reducing the risk of overfitting. In our research, a pre-test was performed to choose the best size of convolution kernel and pooling window by comparing detection accuracy. In the pretest, 50 fall data were gathered from each of the 10 participants as a dataset, and 80% of the dataset was used as training, and 20% was used as testing.

A Long Short-Term Memory
An LSTM network has a structure used to process sequence data that changes with time, space, and other factors. The LSTM network can solve the long-term dependence of the data model, that is, the time-based dependence of data. In continuous frame data, the LSTM network predicts the timeseries sequence data of the future frame based on the previous sequence data, and expresses the spatio-temporal characteristics in a vector, as input of the cascaded an LSTM unit. The output of the LSTM unit contains the spatio-temporal information. The LSTM network adds a forgetting unit to The flow chart of the 3D CNN classifiers applied in this system is shown in Figure 8. In this research, the 3D CNN has a structure as follows: A softmax layer and a classification layer.
Each parameter is decided according to the complexity of the target and the number of classes [46]. The effect of convolutional layers is convolving the input and the result is passed to the next layer. The number of free parameters can be manifestly reduced by the convolution operation, allowing it possible that the network to be deeper with fewer parameters. The vanishing gradient and exploding gradient problems that occur during backpropagation in traditional neural networks are prevented by applying regularized weights over fewer parameters. Pooling layers can combine the neuron clusters outputs into a single neuron just at one layer to reduce the dimensions of the data in the next layer, reducing the risk of overfitting. In our research, a pre-test was performed to choose the best size of convolution kernel and pooling window by comparing detection accuracy. In the pre-test, 50 fall data were gathered from each of the 10 participants as a dataset, and 80% of the dataset was used as training, and 20% was used as testing. The flow chart of the 3D CNN classifiers applied in this system is shown in Figure 8. In this research, the 3D CNN has a structure as follows:  A convolutional layer with 32 filters and 5 × 5 × 5 cube size for each filter;  A ReLU layer to complete the activation;  A max-pooling layer with a cube size of 3 × 3 × 3;  A fully connected layer with 128 points;  A ReLU layer with 50% dropout to avoid overfitting;  A fully connected layer with 8 points;  A softmax layer and a classification layer.
Each parameter is decided according to the complexity of the target and the number of classes [46]. The effect of convolutional layers is convolving the input and the result is passed to the next layer. The number of free parameters can be manifestly reduced by the convolution operation, allowing it possible that the network to be deeper with fewer parameters. The vanishing gradient and exploding gradient problems that occur during backpropagation in traditional neural networks are prevented by applying regularized weights over fewer parameters. Pooling layers can combine the neuron clusters outputs into a single neuron just at one layer to reduce the dimensions of the data in the next layer, reducing the risk of overfitting. In our research, a pre-test was performed to choose the best size of convolution kernel and pooling window by comparing detection accuracy. In the pretest, 50 fall data were gathered from each of the 10 participants as a dataset, and 80% of the dataset was used as training, and 20% was used as testing.

A Long Short-Term Memory
An LSTM network has a structure used to process sequence data that changes with time, space, and other factors. The LSTM network can solve the long-term dependence of the data model, that is, the time-based dependence of data. In continuous frame data, the LSTM network predicts the timeseries sequence data of the future frame based on the previous sequence data, and expresses the spatio-temporal characteristics in a vector, as input of the cascaded an LSTM unit. The output of the LSTM unit contains the spatio-temporal information. The LSTM network adds a forgetting unit to

A Long Short-Term Memory
An LSTM network has a structure used to process sequence data that changes with time, space, and other factors. The LSTM network can solve the long-term dependence of the data model, that is, the time-based dependence of data. In continuous frame data, the LSTM network predicts the time-series sequence data of the future frame based on the previous sequence data, and expresses the spatio-temporal characteristics in a vector, as input of the cascaded an LSTM unit. The output of the LSTM unit contains the spatio-temporal information. The LSTM network adds a forgetting unit to the structure of a recurrent neural network (RNN) to solve the problem of gradient disappearance and gradient explosion during long sequence training, making it perform better in longer sequences [56]. The LSTM network replaces hidden and output layer nodes with memory cells based on the RNN, which consists of input gates, forget gates, and output gates to control the update of the memory state vectors [57,58]. It can alleviate the problem of gradient loss during recursive neural network training.
Although the nature of the LSTM network originally takes advantage in prediction problems, some studies used the LSTM network or hybrid methods with it as a classifier for time series data of human motions [34,59]. In this research, an LSTM network was used with 128 hidden neurons, 6 dimensions at each time step, and a fully connected layer with 8 hidden neurons. Here, the time step size is the number of frames. Considering the characteristics of human motions, six features were extracted and input into the LSTM network. The flow chart of the LSTM network is shown in Figure 9. the structure of a recurrent neural network (RNN) to solve the problem of gradient disappearance and gradient explosion during long sequence training, making it perform better in longer sequences [56]. The LSTM network replaces hidden and output layer nodes with memory cells based on the RNN, which consists of input gates, forget gates, and output gates to control the update of the memory state vectors [57,58]. It can alleviate the problem of gradient loss during recursive neural network training. Although the nature of the LSTM network originally takes advantage in prediction problems, some studies used the LSTM network or hybrid methods with it as a classifier for time series data of human motions [34,59]. In this research, an LSTM network was used with 128 hidden neurons, 6 dimensions at each time step, and a fully connected layer with 8 hidden neurons. Here, the time step size is the number of frames. Considering the characteristics of human motions, six features were extracted and input into the LSTM network. The flow chart of the LSTM network is shown in Figure 9. Before obtaining features, the Otsu method is applied to separate the target from the complete picture [60]. The Otsu method is an efficient algorithm for binarizing images automatically obtaining a threshold from a bimodal situation, which divides the image into two parts, the background and the target, according to the grayscale characteristics of the image. After that, six features are extracted for detection: a moving distance, an area size, a change rate of area size of the target, the highest temperature, the average temperature, and a directional distribution ratio.
Among them, the change rate of area size is calculated by dividing the area size of the target in the frame by that of the previous frame as follows: The directional distribution ratio is calculated from directional distribution of the target. This is a spatial statistics algorithm that calculates lengths of the ellipse in the x and y directions by centering on the average center of the element space distribution, thereby defining the axis of the ellipse containing the element distribution [61]. This feature is defined as the ratio between the longer axis and the short one as shown in Figure 10. The equation to calculate the directional distribution ratio is as follows: Before obtaining features, the Otsu method is applied to separate the target from the complete picture [60]. The Otsu method is an efficient algorithm for binarizing images automatically obtaining a threshold from a bimodal situation, which divides the image into two parts, the background and the target, according to the grayscale characteristics of the image. After that, six features are extracted for detection: a moving distance, an area size, a change rate of area size of the target, the highest temperature, the average temperature, and a directional distribution ratio.
Among them, the change rate of area size is calculated by dividing the area size of the target in the frame by that of the previous frame as follows: change rate o f area size = Area n Area n−1 The directional distribution ratio is calculated from directional distribution of the target. This is a spatial statistics algorithm that calculates lengths of the ellipse in the x and y directions by centering on the average center of the element space distribution, thereby defining the axis of the ellipse containing the element distribution [61]. This feature is defined as the ratio between the longer axis and the short one as shown in Figure 10. the structure of a recurrent neural network (RNN) to solve the problem of gradient disappearance and gradient explosion during long sequence training, making it perform better in longer sequences [56]. The LSTM network replaces hidden and output layer nodes with memory cells based on the RNN, which consists of input gates, forget gates, and output gates to control the update of the memory state vectors [57,58]. It can alleviate the problem of gradient loss during recursive neural network training. Although the nature of the LSTM network originally takes advantage in prediction problems, some studies used the LSTM network or hybrid methods with it as a classifier for time series data of human motions [34,59]. In this research, an LSTM network was used with 128 hidden neurons, 6 dimensions at each time step, and a fully connected layer with 8 hidden neurons. Here, the time step size is the number of frames. Considering the characteristics of human motions, six features were extracted and input into the LSTM network. The flow chart of the LSTM network is shown in Figure 9. Before obtaining features, the Otsu method is applied to separate the target from the complete picture [60]. The Otsu method is an efficient algorithm for binarizing images automatically obtaining a threshold from a bimodal situation, which divides the image into two parts, the background and the target, according to the grayscale characteristics of the image. After that, six features are extracted for detection: a moving distance, an area size, a change rate of area size of the target, the highest temperature, the average temperature, and a directional distribution ratio.
Among them, the change rate of area size is calculated by dividing the area size of the target in the frame by that of the previous frame as follows: The directional distribution ratio is calculated from directional distribution of the target. This is a spatial statistics algorithm that calculates lengths of the ellipse in the x and y directions by centering on the average center of the element space distribution, thereby defining the axis of the ellipse containing the element distribution [61]. This feature is defined as the ratio between the longer axis and the short one as shown in Figure 10. The equation to calculate the directional distribution ratio is as follows: The equation to calculate the directional distribution ratio is as follows: where D L and D S represent the length of the longer axis and shorter axis of the detected ellipse, respectively.

Experiment
This section explains the experimental environment and setting, dataset of the target motions, and experimental ethics.

Experiment Environment
The sensor with a visual angle of 110 • × 75 • was attached to the ceiling of the room above the center of the detection area, as shown in Figure 11. The experiments were conducted in an area of which size is 4 m × 5 m. In addition, the experiment environment was complex as it contained chairs, desks, and computers, an accurate simulation of the actual application environment. As the figure shows, the height of the experiment room was 2.8 m, which is a common height of the intended application environments such as hospitals and nursing rooms.
where and represent the length of the longer axis and shorter axis of the detected ellipse, respectively.

Experiment
This section explains the experimental environment and setting, dataset of the target motions, and experimental ethics.

Experiment Environment
The sensor with a visual angle of 110° × 75° was attached to the ceiling of the room above the center of the detection area, as shown in Figure 11. The experiments were conducted in an area of which size is 4 m × 5 m. In addition, the experiment environment was complex as it contained chairs, desks, and computers, an accurate simulation of the actual application environment. As the figure shows, the height of the experiment room was 2.8 m, which is a common height of the intended application environments such as hospitals and nursing rooms. During the experiments, the environment temperature was set around 25 °C, which was a comfortable condition for people and is common in real application conditions. All the participants were asked to make the target motions that were detected in the proposed motion detection system by following their habits.

Motion Dataset
In this motion detection research, there were eight target motions, as shown in Table 2, and each motion was recorded for 3 s to obtain motion frames. It was repeated 50 times to construct a dataset. Among them, the motion type 'one motion' means that this motion started at the beginning of the frames and finished by the end of the frames. The motion type 'static motion' means that this motion started at the beginning of the frames and kept the posture until the end with slightly movement. The motion type 'continuous motion' means that this motion continued from start to end of the frames with changing direction randomly. During the experiments, the environment temperature was set around 25 • C, which was a comfortable condition for people and is common in real application conditions. All the participants were asked to make the target motions that were detected in the proposed motion detection system by following their habits.

Motion Dataset
In this motion detection research, there were eight target motions, as shown in Table 2, and each motion was recorded for 3 s to obtain motion frames. It was repeated 50 times to construct a dataset. Among them, the motion type 'one motion' means that this motion started at the beginning of the frames and finished by the end of the frames. The motion type 'static motion' means that this motion started at the beginning of the frames and kept the posture until the end with slightly movement. The motion type 'continuous motion' means that this motion continued from start to end of the frames with changing direction randomly. A total of 16 participants' data were collected, which means that 800 sets of data were collected for each motion. During the experiment, the participants were asked to perform corresponding actions in different directions in the detection area. For static movements, like sitting and standing, the participants were asked to hold for five seconds. For other dynamic actions, the participants were required to follow their habits. In all the motions, the experiments were performed many times in different positions of the detection area to ensure the diversity of data obtained throughout the experiments.

Experiment Ethics
The risk of experimenting with elderly people is prohibitively high. As the resolution of the infrared array sensor used in this system is relatively low, it can be supposed that the characteristics of the data acquired from young people are roughly similar to those of the elderly. Therefore, the participants in this study were all in their twenties.
It was required to obtain fall motion data from the experimenters, which might be dangerous. Therefore, the safety of the participants was considered. During the experiment, a sofa and a large sponge pad were prepared to protect the experimenters.
All participants gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of H29-230.

Results
This section explains calculation setting and cross-validation results of two classifiers, selection of input frame number, computation time of training and testing for two classifiers, and application to multi-target detection.

Cross-Validations and Results
In this research, datasets from 16 participants were collected from the infrared area sensor. After gathering the data, a series of data processing methods such as noise reduction and background removal were performed. However, it is unclear how many frames of data can accurately represent a motion suitable for one detection. Hence, 15 frames were tentatively set to the sequential data length. The 15 frames extracted from the sequential frames were directly input into the 3D CNN to detect a motion. In the LSTM network, after calculating the six features, a 6 × 15 feature matrix was used for detection.
In this system, the parameter α of Equation (2) was set to 0.999 by considering the frame rate of the sensor, and the parameter N H of Equation (3) was set to 8 by considering the target size in the actual data. In the training processes of both classifiers, the learning rate was set to 0.001, and the number of maximum epochs was set to 100.
In order to improve the versatility and reliability of the system, multiple experiments were used to examine the effects of individual differences. The detection performance was evaluated Sensors 2020, 20, 5957 14 of 23 using commonly used cross-validation [62]. The cross-validations were performed on data from 14 participants for training and data from two participants for the testing. This process was repeated eight times by changing the combination of two testing data set (8-fold cross-validations). For the experiments, the results and the correct answer rates of every motion under the LSTM network and 3D CNN are shown in Tables 3 and 4, respectively.  Table 5 shows the average, maximum, and minimum correct answer rates of each motion for the 3D CNN cross-validation.

Input Frame Namber Selection
A human motion is a continuous process involving a change of body status. From the observation of the motions in the experiment, all the motions other than 'walking' could take 0.7 to 1 s from the start to end of the motion. This means that, in the obtained data at 15 Hz sampling rate, the motion appeared within 15 frames. In order to confirm the proper input frame number between 10 and 20 frames were tested for comparison. The result of this experiment is shown in Figure 12. the motion appeared within 15 frames. In order to confirm the proper input frame number between 10 and 20 frames were tested for comparison. The result of this experiment is shown in Figure 12.

Computation Time
The computation time of each classifier at the 15 frames input data with and without a graphics processing unit (GPU) for the cross-validations in Section 5.1 is shown in Table 6. The calculations were executed on a laptop PC with an Intel i7-4710HQ CPU (max: 3.5 GHz) and an NVIDIA GF850M GPU. The LSTM network, with or without a GPU, was six to nine times faster than the 3D CNN.

Multi-Target Detection
In the actual application of the system, it is very likely that multiple targets will exist in the detection area at the same time. Therefore, by using the Otsu method, the number of targets can be counted from a binarized image. The areas of different targets in the image were divided into the different images with the background of the lowest temperature, and then motion detection was performed for each image as shown in Figure 13.

Computation Time
The computation time of each classifier at the 15 frames input data with and without a graphics processing unit (GPU) for the cross-validations in Section 5.1 is shown in Table 6. The calculations were executed on a laptop PC with an Intel i7-4710HQ CPU (max: 3.5 GHz) and an NVIDIA GF850M GPU. The LSTM network, with or without a GPU, was six to nine times faster than the 3D CNN.

Multi-Target Detection
In the actual application of the system, it is very likely that multiple targets will exist in the detection area at the same time. Therefore, by using the Otsu method, the number of targets can be counted from a binarized image. The areas of different targets in the image were divided into the different images with the background of the lowest temperature, and then motion detection was performed for each image as shown in Figure 13. the motion appeared within 15 frames. In order to confirm the proper input frame number between 10 and 20 frames were tested for comparison. The result of this experiment is shown in Figure 12.

Computation Time
The computation time of each classifier at the 15 frames input data with and without a graphics processing unit (GPU) for the cross-validations in Section 5.1 is shown in Table 6. The calculations were executed on a laptop PC with an Intel i7-4710HQ CPU (max: 3.5 GHz) and an NVIDIA GF850M GPU. The LSTM network, with or without a GPU, was six to nine times faster than the 3D CNN.

Multi-Target Detection
In the actual application of the system, it is very likely that multiple targets will exist in the detection area at the same time. Therefore, by using the Otsu method, the number of targets can be counted from a binarized image. The areas of different targets in the image were divided into the different images with the background of the lowest temperature, and then motion detection was performed for each image as shown in Figure 13.  Data of two and three targets in the detection area were obtained to conform the capability of the system for multi-target situations. In total, 100 kinds of situations for two and three targets, in which one target fell while the others were walking, were processed and classified with the 3D CNN. The results are shown in Table 7.

Discussion
In this section, we discuss correct answer rates of motion detection in terms of for each motion and frame rates, performance metrics and comparison result with other studies, and limitation of the proposed system.

Ccorrect Answer Rates of Motion Detection
(1) Each motion detection As seen in the results shown in Table 3, the average correct answer rate of the LSTM network for eight kinds of motions in this research was 82.5% for 16 participants. It can be concluded that the LSTM network performs well in the classification of the proposed eight motions, and especially for the motion of falling, the accuracy rate is over 92%. However, some of the correct answer rates could not reach 80%. The reason for these results is that the six extracted features cannot contain enough information for the classifier to distinguish the eight motions. As seen in the results in Table 4, the average correct answer rate for the 3D CNN of the eight motions in this research was 94.2% for the data of 16 participants. It can be concluded that the 3D CNN can perform better than the LSTM network in the classification of the eight motions, and automatically extract space and temporal information from the original image data. However, in some cases, the falling was misclassified as walking. The reason for these results is probably that among eight motions, falling and walking have more variety of movements than the others.
In Table 5, for each motion, the difference between the average and the minimum is obviously larger than the difference between the maximum and the average. The reason could be that a few participants' motions were much different from the others', resulting in a large part of the motion being divided into other motions. In order to avoid this problem, more data need to be collected to eliminate individual differences.
(2) Frame rates According to the results of Figure 12, the test using 15 frames has a highest correct answer rate than the test using others in the LSTM network. For the 3D CNN, correct answer rates are close between 15 and 20 frames, those between 10 and 14 frames are slightly smaller. Therefore, considering the calculation time, it can be said that the setting of 15 frames is better than the others. Of course, it depends on the sampling rate, and thus it should be selected based on the target movement.

(3) Computation time
On the PC with a GPU, the LSTM network could provide full real-time motion estimation at speeds of 15 fps; meanwhile, the 3D CNN can realize 8 fps estimation. The calculation time required for the 3D CNN is obviously longer than that of the LSTM network. However, due to the low-resolution grayscale image used in this system, it does not affect the actual application process.
(4) Multi-target detection As the number of targets increases, the accuracies of motion detection and participant number estimation decrease. The main reason for this result is that when the targets are close to one another, the system regards them as one lager target and makes a misjudgment. It is possible that if a higher-resolution sensor is used, the accuracy might increase; however, the privacy invasion problem is also considered. It becomes a trade-off between privacy and accuracy; therefore, a suitable sensor resolution and setting need to be considered.

Performance Metrics and Comparison
In order to compare our system to other studies, the effectiveness of the classifiers is confirmed by five metrics [63]. Each metric is expressed as follows: where, TP, FP, TN and FN are true positives, false positives, true negatives, and false negatives, respectively. In these metrics, the seven motions other than falling are categorized into non-falling (negatives). Table 8 shows performance and characteristics of our proposed system and other studies using infrared sensors.
The recent research of [26] used a CNN with an LSTM network for fall detection using UWB and compared their proposed methods to other research with a CNN in the cross-validation results. The results of their methods showed better evaluation values of 95.78%, 98.04%, and 95.33% for accuracy, recall, and specificity, respectively.
In the research of [30] using pyroelectric infrared sensors for fall detection, their proposed method with a second-layer random forest had 92.51% mean accuracy. In the research of [31], which used infrared sensors and detected falling, the proposed method with an SVM showed evaluation values of 92.45%, 98.00%, and 95.14% for precision, recall, and F1-measure, respectively. In the research of [37], an infrared sensor with 32 × 32 resolution was used and could detect only falling and walking using pixel number threshold method, In the research of [38], using deep convolutional neural network, they could achieve high accuracy; however, because of the small number of participants, the availability of the method is unclear.
From the comparison results, even though the performance metrics of these studies cannot be compared directly due to different experiment settings, it can be said that our proposed method with a 3D CNN is sufficiently accurate and highly useful. Note: the symbol '-' means the corresponding term is not indicated in the paper clearly.

of 23
The research of [31] used very unique and interesting idea to realize motion detection using multiple low-resolution infrared sensors. Each sensor was connected to sensor network and the detection area was easily expanded in the system. However, to detect motions in an area of 3 × 3.75 m, they needed 20 sensor nodes, which leads to an increase in system construction cost. In contrast, our proposed system uses one device to detect motions in an area of 4 × 5 m, and we can expand the detection area by using several devices independently. When we need to watch a wide area, some devices can be located on the grid on the ceiling and all devices are connected with Wi-Fi. They can also be placed separately at the entrance of a room or in a dangerous place. This means that our system has high flexibility in the construction of a fall detection system.

Limitation of the System
After subtraction of background in a thermo image, in almost all cases, temperature distribution of a human is within 5 • C as shown in Figure 5. The noise equivalent temperature difference of this sensor is 0.1 to 0.25 K, which depends on a sampling rate. Therefore, we can detect the target human as a gradient object by using this sensor. However, when the height of the sensor to be attached on the ceil is changed to a certain extent, the pixel size of the target could be changed. It may cause decrease in detection accuracy. Provided that we use a higher resolution infrared sensor, the accuracy can be improved more; however, as mentioned above, the risk of privacy invasion should be considered at the same time.
When a heat source other than a human exists in the detection area and its temperature fluctuates frequently, the system may misidentify it as a human. Provided that the position of the heat source is known in advance, a masking process can be applied on the thermo image in the preprocessing to ignore the effect of the heat source.

Conclusions
In this research, a human motion detection system based on an infrared array sensor was developed for recognizing elderly people's fundamental motions and detecting emergencies while protecting the privacy of people to be cared for. Compared with existing human motion detection systems, the proposed system was faced with fewer limitations and had a wide application. The six features were extracted considering the characteristics of human motions for the LSTM training. The 3D CNN was applied to motion detection and the input was time series image data.
During the human motion detection experiments, 50 sets of data for each motion in eight motions, including falling, standing, standing-to-sitting, sitting, sitting-to-standing, bowing, crouching, and walking of 16 participants were obtained, which means that 800 sets of data were collected for each motion.
In this research, the correct answer rates of the LSTM network and the 3D CNN were 82.5% and 94.2%, respectively. The results show that the proposed system with the 3D CNN has the ability to detect eight motions among 16 participants with high accuracy.

Conflicts of Interest:
The authors declare no conflict of interest.
Ethical Statements: All participants gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of H29-230.