Using Direct Acyclic Graphs to Enhance Skeleton-Based Action Recognition with a Linear-Map Convolution Neural Network

Research on the human activity recognition could be utilized for the monitoring of elderly people living alone to reduce the cost of home care. Video sensors can be easily deployed in the different zones of houses to achieve monitoring. The goal of this study is to employ a linear-map convolutional neural network (CNN) to perform action recognition with RGB videos. To reduce the amount of the training data, the posture information is represented by skeleton data extracted from the 300 frames of one film. The two-stream method was applied to increase the accuracy of recognition by using the spatial and motion features of skeleton sequences. The relations of adjacent skeletal joints were employed to build the direct acyclic graph (DAG) matrices, source matrix, and target matrix. Two features were transferred by DAG matrices and expanded as color texture images. The linear-map CNN had a two-dimensional linear map at the beginning of each layer to adjust the number of channels. A two-dimensional CNN was used to recognize the actions. We applied the RGB videos from the action recognition datasets of the NTU RGB+D database, which was established by the Rapid-Rich Object Search Lab, to execute model training and performance evaluation. The experimental results show that the obtained precision, recall, specificity, F1-score, and accuracy were 86.9%, 86.1%, 99.9%, 86.3%, and 99.5%, respectively, in the cross-subject source, and 94.8%, 94.7%, 99.9%, 94.7%, and 99.9%, respectively, in the cross-view source. An important contribution of this work is that by using the skeleton sequences to produce the spatial and motion features and the DAG matrix to enhance the relation of adjacent skeletal joints, the computation speed was faster than the traditional schemes that utilize single frame image convolution. Therefore, this work exhibits the practical potential of real-life action recognition.


Introduction
Recently, the lifespans of the world's population are increasing, and society is gradually aging. According to the report of the United Nations [1], the number of elderly people (over 65) in the world in 2019 was 703 million, and this is estimated to double to 1.5 billion by 2050. From 1990 to 2019, the proportion of the global population over 65 years old increased from 6% to 9%, and the proportion of the elderly population is expected to further increase to 16% by 2050. In Taiwan, the report of the National Development Council indicated that the elderly population with age over 65 will exceed 20% of the national population at 2026 [2].
Taiwan will enter a super-aged society in 2026. This means that the labor manpower will gradually decrease in the future. Thus, the cost of home care for elders will significantly increase. In homecare, the monitoring of elderly people living alone is a major issue. The behaviors of their activities have a high relation with their physical and mental health [3,4]. Therefore, how to use artificial intelligence (AI) techniques to reduce the cost of home care is an important challenge.
The recognition of body activities has two major techniques. One is physical sensors, like accelerometers [5,6], gyroscopes [7], and strain gauges [8], which have the advantage that they can be worn on the body to monitor dangerous activities throughout the day and the disadvantage that few activities can be recognized. Therefore, the physical sensors are typically not used to identify the daily activities. Another technique is the charge-coupled device (CCD) camera [9,10], which has the advantage of being able to recognize many daily activities and the disadvantage that it can only be used to monitor the activities of people in a local area. Thus, it is suitable to be used in a home environment.
Many previous studies have used deep learning techniques to recognize daily activities, including two-stream convolutional neural networks (CNNs), long short-term memory networks (LSTMNs), and three-dimensional CNNs (3D CNNs). For the two-stream CNN, Karpathy et al. used context stream and fovea streams to train a CNN [11]. The two streams proposed by Simonyan et al. were spatial and temporal streams, which represent the static and dynamic frames of each action's film [12]; however, the spending time for each action was different.
Thus, Wang et al. proposed a time segment network to normalize the spatial and temporal streams [13]. Jiang et al. used two streams as the input combined with CNN and LSTMN [14]. Ji et al. proposed 3D CNN to obtain the features of the spatial and temporal streams [15]. The two-stream methods using image and optical flow to represent the spatial and temporal streams had a better performance for recognizing activities compared with the one-stream methods. However, the weakness is that its doubled data amount requires more time to train the model.
Studies have used skeletal data as the common input feature for human action recognition [16][17][18], where the 3D skeletal data were typically obtained by use of the depth camera. In these studies, the number of recognized actions was less than 20 [17], and the skeletal data had to be processed to extract the features. Machine learning methods were used in these studies. The spatiotemporal information of skeleton sequences was exploited using recurrent neural networks (RNNs) [19,20].
Both the amount of data and recognized actions were less than in the video datasets [16][17][18][19][20]. An RNN tends to overemphasize the temporal information and ignore the spatial information, leading to low accuracy. However, the advantage of methods employing skeletal data is that these requires less training data and training time compared to those using image data. Hou et al. used a CNN to recognize actions with skeletal features [21]. Therefore, an effective method to encode the spatiotemporal information of a skeleton sequence into color texture images that could be recognized by a CNN is a relevant issue.
A directed acyclic graph (DAG) consists of a combination of nodes and edges. Each node points to another node by an edge. These directions do not become a circle graph that will end at the extremities. DAGs are usually used to represent causal relations amongst variables, and they are also used extensively to determine which variables need to be controlled for confounding in order to estimate causal effects [22]. The physical posture of people can be described by the positions of the skeletal joints. The adjacent joints have a causal relation when the body is moving. Thus, we can define the DAG of skeletal joints to explain the relations of physical skeletons.
This study aims to recognize the daily activities with films recorded by CCD cameras. To reduce the large amount of data for model training, we transferred body images to physical postures with an open system, AlphaPose [23]. The posture information is the skeleton sequences captured from the films of actions to build the spatial and motion features. These features all include both the spatial and temporal characteristics of actions. The relations of adjacent joints were used to build direct acyclic graph (DAG) matrices, the source matrix, and the target matrix.
These features are expanded by the DAG matrices as color texture images. The linear-map CNN has a two-dimensional linear map at the beginning of each layer to adjust the number of channels. Then, a two-dimensional CNN is used to recognize the actions. A structure with two streams was used to increase the accuracy of the action recognition. The datasets (NTU RGB+D) used in this study is an open source supported by Rapid-Rich Object Search Lab, National Technological University, Singapore [18].
In our work, a total of 49 actions, including daily actions, medical conditions, and mutual actions, were considered for action recognition. The total number of films was 46,452 for the cross-subject and cross-view sources. Of the cross-subject sources, 32,928 films were used for training, and 13,524 films were used for testing. Of the cross-view sources, 30,968 films were used for training, and 15,484 were used for testing. The experimental results show that the performance of our method was better than those in the previous studies. Figure 1 shows the flowchart of action recognition in this study, which has three phases. In the feature phase, RGB images were processed by AlphaPose [23] to obtain the coordinate values of the skeletal joints of the subject in an image as a vector. A film contained 300 images that were used to build a posture matrix as the feature. We defined the spatial features and motion features by the coordinate values of the skeletal joints for each film. Spatial features are the position information of skeletons and joints, and motion features are the optical-flow information of skeletons and joints.

Methods
Each feature was expanded into two features of source and target by DAGs. In the recognition phase, a 10-layer linear-map CNN was used to recognize the activities. The cross-subject and cross-view evaluations were used to test the performance of this linearmap CNN. In the output phase, the results of the spatial and motion features were fused to show the recognized actions.

NTU RGB+D Dataset
The datasets of action recognition supported by the Rapid-Rich Object Search Lab, National Technological University, Singapore [24] were used in the study. There were 56,880 files, including 60 action classes. Each file consists of RGB, depth, and skeleton data of human actions. All actions were recorded by three Kinect V2 cameras. The size of the RGB images was 1920 × 1080. There were 40 classes for daily actions, 9 classes for medical conditions, and 11 classes for mutual actions. Forty distinct subjects were invited for this data collection.
The physical activities of only a single person were recognized, and the sample size of 46,452, including the 49 physical activities, was used in this study. To ensure standard evaluations for all the reported results on the benchmark, two types of action classification evaluation (cross-subject evaluation, and cross-view evaluation) were used [24]. In the cross-subject evaluation, the sample sizes for training and testing sets were 40,320 and 16,560, respectively. In the cross-view evaluation, the sample sizes for training and testing sets were 37,920 and 18,960, respectively.

Spatial and Motion Features
The RGB images were processed by the AlphaPose [23] to obtain the coordinate values of skeleton joints of people in the image, and the format is shown in Equation (1).
where x i and y i are the coordinate values of ith joint, c i is the confidence score, and M is the index of the joints. According to the coordinate values of the joints, the spatial and motion variables were defined, as shown in Table 1. n is the index of the frames. The spatial variables are the joint data (v i ), and skeleton data (s i,j ). The motion variables are the motion data of the joints and skeleton (m vi and m si ). Thus, there are four features in this study, F v , F s , F mv , and F ms . Table 2 shows the indexes and definition of 18 joints and 17 edges in the body. The 17 edges, e i , are defined as the relations between two adjacent joints, i-1th and ith. Table 1. The indexes of the joints and relations between every two adjacent joints at the 17 edges.

Directed Acyclic Graph
DAG was used to describe the relations of 18 joints. The nodes of DAG represent the joints, and the flows represent the edges. Each edge has a source joint and a target joint. Thus, two DAG matrices, the source matrix and target matrix, can be defined. If the ith joint is the source point of the jth edge, the element (j, i) of the source matrix is set as 1. Otherwise, the element (j, i) is set as 0.
The target matrix is set as the source matrix. Then, each row of the source and target matrices is normalized to avoid overvalues. To match the size of feature, e(0, 0) = 1 is defined as a virtual edge. The sizes of the source and target matrices are 18 × 18. Figure 2a is the source matrix, S, color-none is 0, color-black is 1, and color-original is 0.25. Figure 2b is the target matrix, T.

Input Features
We used 300 frames for every film. The joint data built the joint feature (F v ), the skeleton data built the skeleton feature (F s ), the joint-motion data built the joint-motion feature (F mv ), and the skeleton-motion data built the skeleton-motion feature (F ms ). Figure 3 shows the contents of a feature with x and y values of a data. Thus, the information of the film was reduced to four 600 × 18 matrices for four features, F v , F s , F mv, and F ms . The F v was expanded by the DAG matrix into two features, F vin and F vout .
F s was expanded by the DAG matrix to F sin and F sout ; F mv was expanded to F mvin and F mvout ; and F ms was expanded to F msin and F msout . Table 3 shows the contents of the spatial feature (F spatial ) and motion feature (F motion ). F saptial is the combination of F spatial-joint and F spatial_skeleton . F motion is the combination of F motion-joint and F motion_skeleton . We used the two features (F spatial and F motion ) to evaluate the performance of the linear-map CNN.   Figure 4 shows the structure of a 10-layer linear-map CNN. The linear-map was used to adjust the number of channels at the beginning of each layer. Batch normalization (BN) can overcome the disappearance of the learning gradient and, thus, use a larger learning rate. In the CNN, the kernel size is a 9 × 1 matrix, the stride is (1,1), and the padding is (4,0). In the input feature, columns represent the different joints, and rows represent the time sequence of the actions. The relation of the adjacent joints was enhanced by the DAG matrix. Thus, the kernel of convolution is a 9 × 1 matrix. Table 4 shows the detailed information of the linear-map CNN. The output layer has 49 nodes representing the 49 action classes. The optimal method was momentum. The batch number is 32.

Statistical Analysis
According to our proposed method, a film is considered as true positive (TP) when the classification action is correctly identified; false positive (FP) when the classification action is incorrectly identified; true negative (TN) when the action classification is correctly rejected, and false-negative (FN) when the action classification is incorrectly rejected. Here, the performance of the proposed method was evaluated using these parameters,

Results
In this study, the hardware employed was CPU Intel Core i7-8700 and GPU GeForce GTX1080. The operating system was Ubuntu 16.04LTS software, the development system was Anaconda 3 at python 3.7 version, the tool of deep learning was Pytorch 1.10, and the compiler was Jupyter Notebook. We evaluated the performance of DAG with the crosssubject and cross-view sources, and four features (F spatial_joint , F spatial-skeleton , F motion_joint , and F motion_skeleton ). At last, we used the two-stream concept, class score fusion for F spatial and F motion , to evaluate the performance of the proposed method with cross-subject and cross-view sources. Table 5 shows the results without the DAG transfer. The best feature is F spatial under the cross-subject source, which resulted in an accuracy and F1-score of 99.3% and 82.8%, respectively. There were 10 actions with recall rates below 70%: 0, 4,9,10,11,16,28,29,43, and 48. The worst feature is F motion under the cross-subject source, its accuracy and F1-score are 99.2% and 79.6%, respectively. There are 11 actions with recall rates below 70%: 3, 10, 11, 16, 24, 28, 29, 31, 36, 43, and 45. Table 6 shows the results with the DAG transfer. The best feature was F spatial under the cross-view source; its accuracy and F1-score were 99.9% and 96.2%, respectively.
Only four actions, 10, 11, 28, and 29, had recall rates below 70%. The worst feature was F motion under the cross-subject source, which obtained an accuracy and F1-score of 99.1% and 79.1%, respectively. There were 10 actions with recall rates below 70%: 2, 10, 11, 16, 28, 29, 31, 43, 44, and 45. We found that the DAG transfer could significantly improve the recognition rate of different actions, not only for spatial features but also for motion features. Table 7 shows the results of class score fusion with and without DAG transfer. We found that the performance of DAG transfer used in the cross-view source was better than used in the cross-subject source, with an accuracy of 99.9% vs. 99.5% and F1-score of 94.7% vs. 86.3%. The recall rates for all 49 actions were not below 70%.
We used the two-dimensional joint and skeleton features to perform the training and testing of the linear-map CNN, which could reduce the running time more than those using two-or three-dimensional joint and skeleton images. Table 8 shows the training and testing time with and without DAG transfer. We found that the GPU could process about 30 frames/second (fps) in the training phase, and process about 125 fps in the testing phase. Although the DAG transfer required time to process, the delay time was about 30 min in the training phase. The maximum testing time was 141 s.

Discussion
In this study, we used DAG transfer and the two-stream method to improve the accuracy of action recognition. When the input features were transferred with the DAG matrices, the precision, recall, specificity, F1-score, and accuracy were improved by 1.2%, 1.1%, 0.1%, 1.3%, and 0.1%, respectively, for the cross-subject source, and were improved by 9.1%, 7.4%, 0.1%, 7.4%, and 0.5%, respectively, for the cross-view source. In the two-stream method, previous studies have typically used the spatial and temporal, or optical flow features to perform the active score fusion [11][12][13][14]. They also proved that the performance of two streams was better than one stream.
We used joint and skeleton sequences as the spatial motion features that had temporal characteristics. We utilized 300 frames to describe an action. Thus, the spatial feature of one action included the spatial and temporal characters. However, the motion relations of the joints and skeletons were different from one action to another. Therefore, we defined the motion variables, mv i and ms i , as shown in Table 1, to establish the motion features.
The comparison of our results with the previous studies under the recall rate is shown in Table 9. These studies all used cross-subject and cross-view sources from the NTU RGB+D database to recognize actions and also used three-dimensional characteristics of each posture as the input features [19,[25][26][27][28][29][30][31]. Our method had the best recall rates of 86.1% and 94.7% in the cross-subject and cross-view sources.
We analyzed the actions with lower recall rates in the cross-subject and cross-view sources in Tables 5-7. The four actions that often had lower recall rates were A10 (reading), A11 (writing), A28 (phone call), and A29 (playing with laptop). Figure 5a is the posture of reading, and Figure 5b is the posture of writing. The subject is standing up, looking down, and holding something. The difference between the two images is only in the gestures of two hands. However, according to the description of the body posture in Table 2, only the right and left wrist joints are marked, which cannot show the gestures of two hands.
The postures of the subject making a phone call (in Figure 5c) and using the laptop (in Figure 5d)) had the same problem. The difference between the two images was also in the gestures of the two hands. These actions were difficult to recognize using spatial features, such as the movement trajectories of the arms, elbows, and wrists. Thus, the results of our method with the DAG transfer and two-stream method in Table 7 show that no action had a lower recall rate in the cross-view source. Table 9. These studies all used cross-subject and cross-view sources in the NTU RGB+D database to recognize the actions, which also used the three-dimensional characteristics of each posture as the input features [19,[25][26][27][28][29][30][31]. Our method had the best recall rates in the cross-subject and cross-view sources at 86.1% and 94.7%.

Conclusions
The large scale of the collected data in the NTU RGB+D database enabled us to apply the posture-driven learning method for action recognition. The posture information represented by the skeleton data was obtained from the 300 frames of the film. The joint and skeleton sequences were used to build spatial and motion features that included the spatial and temporal characteristics of the actions. The relations of the adjacent skeletal joints were used to build the DAG matrices.
The spatial or motion features were expanded by DAG matrices as color texture images. The expanded features all indicated that the relations between adjacent joints were enhanced. Our method effectively reduced the amount of data for training the linear-map CNN, and its performance was superior to the previous schemes using deep learning methods. Notably, since the computation speed can reach around 125 fps in the testing phase with GPU, our scheme could be used to monitor the daily activities of elders in real time in home care applications.