Skeleton Graph-Neural-Network-Based Human Action Recognition: A Survey

Human action recognition has been applied in many fields, such as video surveillance and human computer interaction, where it helps to improve performance. Numerous reviews of the literature have been done, but rarely have these reviews concentrated on skeleton-graph-based approaches. Connecting the skeleton joints as in the physical appearance can naturally generate a graph. This paper provides an up-to-date review for readers on skeleton graph-neural-network-based human action recognition. After analyzing previous related studies, a new taxonomy for skeleton-GNN-based methods is proposed according to their designs, and their merits and demerits are analyzed. In addition, the datasets and codes are discussed. Finally, future research directions are suggested.

HAR is supported by multi-modalities. Specifically, one kind of modality is structured data, e.g., images or videos and auxiliary data, such as semantic information. The common use of sensors (including cameras) and cloud databases makes structured data easy to be captured and shared. Moreover, they are visually or semantically informative, e.g., the shape or motion difference of subjects, the space-time trajectory [10] and the names of joints.
With the help of carefully designed representation learners, such as deep-learning (DL) models, these informative representations are obtained in a task-related way so as to help solve the problem more accurately. However, the performances are upper-bounded by the data, which emphasizes less on the intrinsic relations between the joints of skeletons. The other is unstructured data that are non-Euclidean, such as human skeletons. Extractors, e.g., Openpose, Google PoseNet and Nuitrack, are capable of working in real-time and thus generate sufficient skeleton graphs.
These poses contain intrinsic information among spatial joints and temporal frames as well as 3D information if the depth data are offered. Additionally, compared with an image that requires a storage space proportional to the image width, height and number of channels, skeletons only require the 3D coordinates and confidence score of every joint, and normally there are no more than 30 joints, which decreases the storage cost significantly.
Moreover, while image-based methods suffer from varied brightness, changing of backgrounds, chromatic differences, different subjects etc, 3D skeletons can work on various scenes once they are detected. As HAR should label the same activity with the same label even when performed by different persons under different conditions or styles, a skeleton graph is undoubtedly a promising choice.
Models to find representations of human skeletons are classified into three categories.
• The traditional method is handcrafted descriptors, such as principle components analysis (PCA) based on 3D position differences of joints [11], selecting joint pairs by top-K Relative Variance of Joint Relative Distance [12]. These descriptors are interpretable; however, they are limited as they tend to extract shallow and simple features and normally fail to find significant deep features. • The other idea is redefining the problem a deep learning problem in Euclidean space, such as serializing the graph nodes into a sequence and then adopting the well-known Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) etc. In this way, deep features are extracted mechanically but without paying attention to the intrinsic spatial and temporal relations between graph joints, e.g., the serialization of joints ignores their natural structures in skeletons. • Recently, Graph Neural Networks (GNNs), especially graph convolution networks (GCNs), have come into spotlight, and were imported into skeleton graphs. The earliest milestone is ST-GCN [13] ( Figure 1). Thereafter, multiple works based on ST-GCN were proposed. Among them, 2s-AGCN [14] ( Figure 2) is another typical work, which adopted an attention mechanism. As GNNs are professional in discovering the intrinsic relations between joints, GNN HAR methods have achieved a new state-ofthe-art (SOTA).  Therefore, this survey focuses on skeleton GNN HAR, which regards the input skeleton as a graph and process it via GNNs. It is true that there are many HAR surveys; however, most of them emphasize sensors, DL methods, datasets or HAR applications. The rapid development of GNNs and emergence of GNN HAR methods call for a constant update. As far as we know, there is only one survey that analyzed GCN-based methods; however, it did not cover all aspects and missed some approaches. In this respect, our survey updates and completes the aforementioned existing survey. The main contributions of this survey are as follows: 1. New taxonomy: We propose a new taxonomy for previous methods, which relate to GNNs and skeleton graphs. They are grouped into spatial-based approaches, spatiotemporal-based approaches and generated approaches. Figure 3 illustrates the idea. Their common frameworks are also summarized. 2. Comprehensive review: Apart from analyzing methods, we also review the categories of skeleton graphs in applications and the construction of them. 3. Abundant resources: To give a complete summary for the skeleton-GNN-based HAR, we collected commonly used datasets and published codes. The details of each collected dataset and method are summarized in Tables A1 and A3, respectively, in Appendices A and B. 4. Future directions: Further directions are presented and discussed after having a look at the challenges in this field, with the hope of offering some inspiration for the benefit of other researchers.
This paper is organized as follows: Section 2 gives a literature review on GNNs and HAR. Then, Section 3 analyzes skeleton graphs more specifically, leading to Section 4, which focuses on how to build skeleton graphs. Afterwards, the new taxonomy of methods on skeleton GNN HAR is proposed in Section 5, and their common frameworks are described in Section 6, which gives a comprehensive review for all related approaches. Various datasets are collected in Section 7, and then we end with a discussion of the challenges and our final conclusions in Sections 8 and 9, respectively. The Appendices A and B collect all mentioned datasets and methods. A detailed framework of this survey is presented in Figure 4.
(a) Figure 3. Illustration of (a) the spatial-based approach, (b) the spatiotemporal-based approach and (c) the generated approach. Among the figures, green trapezoid blocks denote the GNN-HAR model with dashed lines marking the unfixed model structure; pink trapezoid blocks denote action classifiers; orange trapezoid blocks for hidden states and purple blocks for tasks rather than HAR. The multiple skeletons together in (b,c) stand for skeleton graphs of video clips. The purple arrow in (c) denotes supervision from tasks, such as adversarial learning and knowledge distillation.
Although these surveys attempted to review the new emergence of HAR, only Tasweer et al. [32] focused on GCN-based approaches. Papers [26,29] mention GCN-based methods but do not take them as their main purpose.
Tasweer et al. set their point on GCNs, and proposed a taxonomy that categorizes GCN-based HAR into five architectures, which are spatiotemporal GCN, RNN-Attention GCN, Two-Multi Stream GCN, Encoder-decoder GCN and Miscellaneous GCN. They are the first to discuss such taxonomy for GCN-based HAR; however, they emphasized this taxonomy without analyzing other aspects, such as the types of graphs, the construction of graphs, and thus it is not a thorough survey. Moreover, they mix the category of GCN methods and generalized frameworks together, e.g., the 'Two-multi Stream' is a common framework, while 'RNN-Attention GCN' is a method with specific networks. Their taxonomy is kind of confusing when one attempts to understand a proposed method systematically. This motivated our paper where a deeper analysis of skeleton-GNN-based HAR is offered.
Specifically, rather than mixing GCN categories and frameworks together, we summarize the ways to construct skeleton graphs, the category of used networks, and the frameworks that can be generalized, into four sections. Sections 3 and 4 discussed the categories and structures of skeleton graphs, Section 5 collects the main approaches, and Section 6 demonstrates the common frameworks. For graphs, this paper emphasizes the categories of skeleton graphs in applications and their structures and also introduces the ways to build them.
For approaches based on whether to create an end-to-end model and the input skeletons, this paper classifies skeleton-GNN-based HAR methods into spatial approaches, spatiotemporal approaches and generated approaches. Among them, the spatial approaches and spatiotemporal approaches use static models and are trained in end-to-end mode. However, for generated approaches, models are first trained in a non-end-to-end way. A non-task-specific model is trained before, and the task-specific model is obtained based on this pretrained model.

Input Graphs
Although the most direct idea is extracting a graph for each frame, the used graphs vary in real applications.

Spatial Graphs
The simplest idea is taking the skeleton from each frame as an independent graph. In this way, the graph size will be the number of skeleton joints, and the graph links are the physical connections. If representing the graph as G = (V, E), where the node set V is a set consisting of body joints, featured by 3D/2D coordinates and confidence scores; and the edge set E contains the links of physical connections.
Usually, edges can be represented as an adjacency matrix A for further utilization, with each item denoting whether there is a link between two nodes. Concretely, A ij = 1 denotes there is a link between joint i and joint j, and A ij = 0 for null edges.

Spatiotemporal Graph
The other idea is extending the spatial graph by connecting skeleton joints along temporal dimension to build a spatiotemporal graph, e.g, the graph used in ST-GCN. In this way, both the spatial and temporal information are combined to use. However, taking the meaning of edges into consideration, this graph is heterogeneous since one type of edges coming from physical connections and the others explicitly display temporal relations. Figure 5a demonstrates the idea.
(a) The spatiotemporal graph (b) The interaction graph Figure 5. The input graphs, (a) with the green color to denote the temporal dimension and blue to denote the spatial dimension; (b) demonstrates shaking hands [47].

Interaction Graphs
When performing GNNs on human-interaction applications, a graph that consists of two or more skeletons was proposed [48], one example is shown in Figure 5b. In this way, nodes are heterogeneous since they come from different subjects, and edges are also heterogeneous because one type of edges are the physical connections within each skeleton, and the other type is the connections between subjects.

Generated Graphs
The graphs mentioned before were all manually built; however, in some applications [49], the authors assumed that the graph was redundant or uncompleted, which is frequent if there are occlusions, and thus they attempted to generate a new graph. In this way, the graph is modified with automatically generated nodes or edges, which are beyond the idea of spatial, temporal or interaction links. Figure 6 demonstrates an example where action specific edges are added. Figure 6. The generated graph [49], where (a) the original skeleton for the action phone call and (b,c) illustrates the inferred action-specific skeletons for the action phone call with new edges in green.

Problem Definition
The problems to be solved for different input graphs vary. In the spatial graph, the main point is how to extract spatial information and then perform the temporal aggregation to obtain long-term information. For the spatiotemporal graph, the essential difficulty is to process the spatial and temporal information simultaneously. In the interaction graph, another problem is how to preserve interactions. As for the generated graph, adding the relevant information is the main challenge.
Moreover, when the graph is built as a directed graph (see next section), the main problem is how to pass messages efficiently. Since, in this graph, messages can only pass along the predefined directions; therefore, the graph convolution kernels working on undirected graphs have to be modified to fit. Based on the forms of graphs used in HAR, the next section will take a deep look at the structure of skeleton graphs.

Graph Structures Based on Directions
Before extracting the spatial and temporal information, the essential step is to build a graph. Based on whether the input graphs are directed or undirected, usually, graphs are further combined with GNNs after converting edges as a connection matrix, such as an adjacency matrix. For directed graphs, connection matrices are more complex since they have to denote the predefined directions to pass messages accordingly.

Undirected Graphs
Most methods are built upon undirected graphs. ST-GCN is one well-known milestone. Most Graph Neural Networks (GNNs) are first proposed for undirected graphs, among which, messages are conducted in bi-directions. The adjacency matrix in this case is symmetrical and therefore leads to even-handed passing of information on each orientation. One clear demerit of the undirected graph is that it does not cover direction information. For example, bone information, which represents the direction and length of a bone, has been proven to be a good modality for skeleton graph-based HAR [16,50].

Directed Graphs
Methods built on directed graphs are rare, e.g., [50][51][52]. Though the computations are more expensive in these graphs since not each connection direction is equal, they are capable of emphasizing the messages for action related parts, such as arms for clapping and legs for jumping.
Lei Shi et al. [50] and B. Fu et al. [51] handled joints and bones information simultaneously and thus proposed to take joints as nodes and bones as edges. The center of gravity in the skeleton is defined as the root node, and the direction of each edge (bone) is determined by the distance between the node and the root node. The node closer to the root node is designated to point to the node farther from the root node.
Q. Zuo et al. [52] built a directed graph for each body part, more precisely, left arm, right arm, left leg, right leg and trunk. The connection matrices used in their directed graph consist of a self-loop matrix (an identity matrix), an inward and an outward adjacency matrix. The inward and outward connections are determined with respect to the center point in that part. Each connection matrix is tackled by one GCN layer, and finally features from these three matrices are fused together as the part features.
J.L. Gao et al. [53] regarded an undirected spatial temporal graph as two directed graphs with opposite directions for each edge, and therefore the message passing between joints is bi-directional, leading to two directed graphs, namely the focus graph and diffusion graph. During message passing, they first convey information forwardly through focus graphs and then transform the updated features back via diffusion graphs.

The Construction of Graph Topology
The construction of graph topology is essential since topology holds a great value of structural information. Here, the topology not only covers the edges in the graph but also the nodes. In this respect, the frame-level, edge-level or node-level will be discussed. The construction absorbs ideas from handcrafted topology and learnable topology, while handcrafted topology is simple and cheap in computation, learnable topology is more adaptive as it attempts to build a new graph to preserve relevant information.

Handcrafted Graph Topology Modality Level
Rather than directly take the skeletons as a graph, some papers created the skeleton graph from implicit features to complete the information that original skeleton graphs missed. For example, J. Cai et al. [54] took the joint-aligned optical flow patch sequence as an orthogonal cue to the skeleton sequence, and then tackled features from this sequence as an implicit graph.

Frame Level
In this level, the easiest way is downsampling the frames, such as what Z. Liu et al. [55] proposed. They chose a frame every d frames and named this operation as dilated windows with the aim to enforce the information across spatiotemporal dimension but decrease the redundant information aggregated from an increasingly larger spatiotemporal receptive field.

Subgraph Level
Some researchers are more interested in subgraphs-in other words, sub-connection matrices, e.g., divided or factorized adjacency matrices. Since, after division, different matrices can be flexibly manipulated with different weights, this provides the ability of paying diverse attention on different parts. This idea also helps to capture local information.
• Body-part partition Papers, such as [56][57][58][59][60][61][62], directly divided the original skeleton graph into several body parts. Typically, the group comprising left arm, right arm, left leg, right leg and trunk, is intuitive and easy to be implemented. Normally, our limbs are more flexible than our trunk and interact more with other parts. Moreover, when people are moving, the diverse parts of the human bodies are capable of making distinct gestures. Based on this, many strategies can be designed to assign different weights strategy to these parts. • Distance-based partition In this category, the definition of distance mainly focuses on centrifugal and centripetal partition, which divides the neighbors of each node into two or more parts. For one node v i , a node v j in the centripetal part is closer to the gravity center than v i , and a node v k in the centrifugal part is farther away from the gravity center than v i . Usually, the gravity center is the average of all skeleton joints. Although this idea does not partition the graph topology explicitly, the adjacency matrix is implicitly classified into different groups with an allowance of applying different weights. The idea was first proposed by ST-GCN that divides node v i 's neighbors into the group v i , the group v i 's centripetal joints and the group v i 's centrifugal joints. The other case comes from [63], which suggests making use of neighbor bones of node v i and therefore augmenting the three partitions to five, with the addition of centripetal bones and centrifugal bones, where the centripetal bones are those that are closer to the gravity center than v i , and the centrifugal bones are those that are farther from the gravity center than v i . • Multiscale partition One partition is based on geometry. For example, B. Parsa et al. [64] performed GNN on node-level, part-level and global-level graphs. The global level graph is the output of the group average pooling on the part-level graph, and the part-level graph is the output of group average pooling on the node-level graph (the original skeleton graph). The other partition directly makes use of downsampling so as to implement it mechanically. For example, Y. Fan et al. [65] conducted two more downsampling operations to extract additional graphs of different scales from the original graph.
Edge Level Apart from the common used self-loop matrix and adjacency matrix, some researchers argue that there are more implicit edges. Some emphasize the edges in temporal dimension.
O. Yuya et al. [66] found new edges by a proposed temporal extension module, which adds connections to multiple adjacent joints on inter-frames. To expand the sampling area for the temporal dimension gradually, the proposed temporal extension module is applied between conventional spatial graph and temporal graph convolution. P. Ghosh et al. [67] added more temporal connections that can span multiple timesteps, e.g., the left arm joint at timestep t can have connections with corresponding joint at timestep t + 1, t + 2, · · · rather than only t + 1 in ST-GCN. To capture the dependence of non-physical connections between joints, some researchers simply add edges between joints, while others focus on edges between the parts of interest.
Z. Bai et al. [68] fully connected each joint in the skeleton with other joints arguing the division of root, centripetal and centrifugal in ST-GCN is not optimal.
R. Zhao et al. [69] added edges between limbs and head, while treating all other joints independently and named this graph as the global graph. The original skeleton graph was taken as the local graph. Finally, they computed the sum of the output from both local and global graphs.
Y. Li et al. [70] focused on hand gesture recognition. They added three types of edges, one by linking the tip of each finger except the pinkie with the base of the finger to its right, and the tip of the pinkie is linked with the base of the ring finger. The second type is to link the third joint of each finger except the pinkie to the second joint of the little finger. The third type is to link the tip and third joint of the same finger. Among them, the first type offers a way to measure the distance of two adjacent fingers, the second better measures the opening degree of the hand. The third type directly provides the information of some actions, such as grabbing in which fingers bend.
Moreover, some methods inherit the characteristics of actions and physical nature, such as symmetry and movements. Q. Zuo et al. [52] added a symmetric matrix and edge matrix, among which, the symmetric matrix considers the symmetric structure of the human body, and the edge matrix tends to contain significant movements of some edge joints. The inspiration comes from the fact that, if one waves his hands, then the movement of hands is more clear than the arms, considering the acceleration.
Additionally, some methods choose a constant hyperparameter to generate edges. In addition to the self-loop and common adjacency matrix, X. Hao et al. [71] added global hyperedges capturing global information and local hyperedges capturing local information. These edges are generated by varying the regularization parameter of sparse representation objective functions. In other words, global and local hyperedges are constructed under different sparsity assumptions, controlled by a hyper parameter β. The choice of β is in a specified range.

Learnable Graph Topology
Learning graph topology is more adaptive, which allows finding more valuable information and keeping the information of interest.

Frame-Level
At the frame-level, except for manually selection, an automatic mechanism armed with reinforcement learning (RL) is preferred. To build a RL system, three components-namely rewards, values and the set of actions-are unavoidable. For each action, such as choosing one frame or dropping one frame, a predefined reward will be assigned. After evolving for a sequence of actions, values are used to estimate the current state. The objective of a RL system can be defined to obtain a highest value at the preferred state. After carefully designing the reward mechanism, RL is capable of choosing key frames and ignoring frames with unclear motions or other irrelevant information.
Y. Tang et al. [72] proposed the Progressive Reinforcement to detect the best key frame, the set of actions consists of a shift to the left, shift to the right and staying still. The state is at the chosen frame at time t, and rewards are [1, −1] at the first iteration for correctly predicting or not; after that, the rewards are [Ω, −Ω], where Ω is a constant. Deep Q-learning and policy gradient are adopted in two branches. Frame distillation network (FDNet) and GCN promote each other mutually, as GCN provides rewards for FDNet and FDNet selects key frames to refine GCN. The better the GCN is, the more accurate rewards will be provided.

Edge-Level
To capture more information among physically nonadjacent joint but action-related or task-related joints, most methods focus on learning new connections. Some authors formalize the learning procedure as a learnable matrix by modifying existing methods. L. Shi et al. [73] adopted two new parameterized adjacency matrices rather than the original adjacency matrix, namely B k , C k , where k denotes the index of layer in the model.
Moreover, B k , the global graph learned from data, represents the graph topology that is more suitable for the action recognition task; C k is the individual graph learned by normalized embedded Gaussian function, which has unique topology for each sample.
H.Y. Yang et al. [74] proposed a learnable matrix that can learn pseudo connections, covering the dependencies between connected joints and joints that are not connected. Other researchers propose an explicit inference module to generate a new adjacency matrix.

•
For every three consecutive frames, X. Gao et al. [75] proposed learning a new graph with a graph regression (GR) module. The graph regression problem is formulated as the optimization of the graph Laplacian matrix L. For intra-joints, the weights for weakly connected and strongly connected joints are different, where strong connections include physical connections and some physical disconnections among joints, and weak connections denote potential connections, such as those between head and hands. For the inter-joints, connections between corresponding joints along the temporal dimension and their neighborhoods are assigned with different weights, and others are set as zero. • To capture the intrinsic high-order correlations among joints, B. Li et al. [76] proposed spatiotemporal graph routing, consisting of a spatial graph router (SGR) and temporal graph router (TGR). SGR captures the connectivity relationships among joints based on sub-group clustering. TGR focuses on structural information with the correlation degrees of joints trajectories. • M. Li et al. [77] estimated actional links (A-links) and structural links (S-links), where A-links are estimated by encoder-decoder (AE)-based A-links inference module (AIM), and S-links are estimated by high-order polynomials of an adjacency matrix. The A-links capture the latent dependencies among joints, and S-links indicate higher order relationships. • F. Ye et al. [78] proposed a joint relation inference network (JRIN) to aggregate the spatiotemporal features of every two joints globally and then infer the optimal relation between every two joints automatically. The relations of joints are quantified as the optimal adjacency matrices. • F.F. Ye et al. [79] estimated edges by joint-relation-reasoning (JRR). JRR is trained by RL, optimized with policy gradient. In detail, the state equals to E ⊗ M, where E contains the global edges information, M represents the connectivity weights of every tow joints, and ⊗ denotes the element-wise product; rewards comes from the output of GCN; action is the output of JRR, which indicates temporal relevance of every two joints under the current action.

•
To extract the implicit connections and properly balance them for each action, W.S. Chan et al. [49] created three inference modules, namely the ratio inference, implicit edges inference and bias inference. The finally estimated matrix is the combination of the output from these three modules. The ratio of the implicit and structural edges is vital. Adjacency matrices A and M bias present the structural edges. M bias is updated with back propagation and is kept the same for all actions.
Other works show their interest in context-enriched skeleton, samples variety etc. Fanfan Ye et al. [80] proposed learning a context-enriched dynamic skeleton topology with a Context encoding Network (CeN). CeN simply consists of three 1 × 1 convolutional layers and permutations, which maps the input tensors into an adjacency matrix. The convolution is alongside the joint coordinate dimension, temporal dimension and then joint dimension, and thus CeN can generate sample different graphs.
K. Liu et al. [63] proposed learning additional connections among joints and bones for various action samples. Precisely, the adaptive joint-bone adjacency matrix and adaptive joint adjacency matrix are all learned by softmax, which uses normalized embedded Gaussian functions to measure similarity.

Node-Level
Rather than focusing on new connections, some methods aim to remove redundant information or aggregate messages from multiple nodes. Some works learned new vertexesin other words, they aggregate multiple nodes as one node. Heidiri et al. [81] proposed a spatiotemporal bilinear network (ST-BLN) with no requirement of predefined adjacency matrix. ST-BLN forces the attention matrix to be symmetric. The selection of the nodes in the first layer will lead to an aggregation of joint information or the addition of new nodes.
Other works selected nodes first. To distinguish the most informative joints for each stream, Y.F. Song et al. [82] only passed the information from unactivated joints to the next stream. The activated class activation maps (CAM) obtained from previous GCN streams are accumulated as a matrix to inform the new stream about which joints have been already activated. Others aim at learning node embeddings.
• W.W. Ding et al. [83] emphasized the learning of localized correlated features. By projecting each part of human body into a node, a fully connected similarity graph is formed to capture relations among the disjoint and distant joints of the human body. The learned mapping of spatial matrices and temporal matrices can determine which part of the human body across several consecutive frames should be mapped to a node in the similarity graph. • W.J. Yang et al. [84] merged nodes in the same part of the skeleton into one node. Each new generated node takes the weighted summation of the original nodes that it covers as its feature, using trainable weights. This integration is done part-wise and channel-wise. • Y.X. Chen et al. [85] proposed structural pooling since the motion information contained in human body is highly related with the interaction of five body parts, and therefore graph convolution on the graph with these five-part nodes can capture more global motion information. By graph pooling, the new compressed graphs in different sizes are input to the model. • G.M. Zhang et al. [86] proposed learning a new topology by topology-learnable graph convolution, which is decomposed as feature learning and node fusion. Node fusion is performed by a learnable fusion matrix L that is initialized with a normalized adjacency matrix and added with an additional constant bias.

A New Taxonomy for Skeleton-GNN-Based HAR
The skeleton-GNN-based HAR approaches are classified as spatial methods, spatiotemporal methods and generated methods, while spatial ones take spatial graph as input, spatiotemporal ones use spatiotemporal graphs as inputs, and generated approaches are supervised by tasks rather than HAR, such as knowledge distillation, or have an unfixed model structure before training. The idea is illustrated in Figure 3.

Spatial-Based Approaches
Approaches in this category take GNN as a spatial feature extractor, and the temporal evolution is handled by other modules. Two major candidates are proposed to evolve states in temporal dimension. One category is traditional conditional random field (CRF) methods, including Hidden CRF (HCRF). The other one prefers the family of RNN, such as RNN, long-short temporal memory network (LSTM) and Gated Recurrent Units (GRU). Examples in each category are shown in Figure 7. Figure 7. Examples of spatial methods: (a) the CRF approach [87], the others are RNN methods. Precisely, (b) a separated approach [88], (c) a bidirectional approach [53] and (d) the aggregated approach [89].

CRF
CRF is an undirectional graph model whose nodes are divided into exactly two disjoint sets X and Y, the observed and output variables, respectively. The conditional distribution p(Y|X) is then modeled. It is suitable for labeling action sequences since Markov chain models are able to track the evolution among temporal dimension.
K. Liu et al. [87,90] argued that GCN is powerful in extracting spatial information but weak on state evolution and then performed HCRF on extracted features. After obtaining features by GCN, HCRF will learn hidden states on each node and perform directed message passing on these hidden states. Finally, under the minimum negative conditional log-likelihood rule, the label for an action sequence sample is defined. By viewing the skeleton graph as a CRF, K. Liu et al. [63] adopted CRF as a loss function to improve performance.

RNN
Although CRF works as a graph model and handles the state evolution, there are situations when they are non-Markov chains. For example, the current state may rely on states from all previous timesteps. This is why RNN was proposed and started becoming popular. The family of RNN is capable of preserving the relationships between states in multiple timesteps compared with CRF in k predefined timesteps. Among the family, LSTM is capable of solving gradient explosion and gradient vanishing that exists in vanilla RNN, while GRU can be regarded as a simplification of LSTM.
The RNN methods are classified as separated strategy, bidirectional strategy and aggregated block.

Separated Strategy
Some methods perform spatial information extraction, usually by GCN (either GCN in spectral space or in spatial space) and perform state evolution separately. In [88], to further encode continuous motion variations, the deep features learned from skeleton graphs by GCN in spectral space were gathered along consecutive temporal slices and then are fed into a recurrent gated network. Finally, the recurrent temporal encoding was integrated with the spectral graph filtering and action-attending to jointly train.
R. Zhao et al. [69] performed GCN and LSTM separately, the spatial information from GCN in each frame was directly input into LSTM cell. Z. Y. Xu et al. [91] proposed using RL combined with LSTM as the feature selection network (FSN) consisting of a policy network and a value network. To be precise, both the policy network and value network are based on LSTM for sequential action or value generation. The feature selection is done along temporal dimension and the input features are the spatial features from GCN.
S. Xu et al. [92] worked on two-subjects interaction graphs. After performing GNN on skeleton graphs in one frame to extract spatial information, the attentioned LSTM is preformed on the joint-level, person-level and scene-level so as to pass information in different scales. To leverage these three types of features, a Concurrent-LSTM (Co-LSTM) is applied to further balance their temporal dynamics for action recognition.
M.S. Li et al. [77] used GRU to update the joint features while inferring the future pose conditioned on the A-links and previous actions. The prediction from GRU evolution was then handled and later adopted by GNN.
In the work proposed by J.M. Yu et al. [93], RNN was used as an autoregressive model to predict the hidden state of noisy skeleton graphs. The hidden state was later used to predict action class. Q.Q. Huang et al. [94] worked with the same idea except for changing the basic GNN to attentioned GNN. Others, such as [62,64,95] extract state evolution information similarly after various GNN modules but not based on attentioned GCN.

Bidirectional Strategy
Considering the bi-directional information of video sequence, some use bidirectional LSTM to keep forward information and backward information simultaneously.
In order to utilize the past and future temporal information, X.L. Ding et al. [96] choose the bidirectional RNN to model skeleton sequences and adopt it before extracting spatial information by GNN. To capture the temporal contextual information over frames, J.L. Gao et al. [53] provide a context-aware module consisting of bidirectional LSTM cells, aiming at modeling temporal dynamics and dependencies based on the learned spatial latent nodes.
Except for the basic bidirectional LSTM, J. Huang et al. [97] deployed GCN on LSTM to enhance its ability of extracting spatial features. Precisely, they provided a LSGM that consists of one original LSTM cell followed by two GCN layers. Then, the LSGM was used to build Bi-Direction LSGM modules, which comprises of a forward LSGM and a reverse LSGM. The forward LSGM and reverse LSGM work in parallel, and the outputs from them are added together to pass to the next layer.

Aggregated Block
Some argue that the extraction of spatial information and temporal information can be stacked together as a basic building block; however, they process the spatial information before performing temporal convolution. Papers [89,98] integrated GCN with LSTM, in other words, each gate in LSTM-namely, the input gate, forget gate and output gate-is armed with GCN so as to operate LSTM directly on the extracted spatial information from each frame.

Spatiotemporal Approaches
The methods mentioned above tackle spatial information and temporal information separately. However, spatial information and temporal information are correlated. For example, the similar actions of waking up and lying on the bed have similar spatial information but distributed at different timestamps. Examples in each category are shown in Figure 8.
To work on multiple timescale dynamically so as to take either long term dependencies or short term dependencies into consideration, P. Ghosh et al. [67] also used STG but they allowed flexible temporal connections, which can span multiple timesteps. For example, the joint left arm at timestep t can have connections with left arm joint at timestep (t + 1, t + 2, · · · ) rather than only at t + 1 in ST-GCN. Their method is based on Hourglass (a CNN framework), combined with ST-GCN.
Z.T. Zhang et al. [99] attempted to handle temporal information with two gated temporal convolutional network (TCN), herein 1DCNN and 2DCNN with tanh and sigmoid activation functions working as gates. They argued that TCN will not overfit to some extent since it inherits the stable gradient of CNN. After performing filtering in temporal dimension, the outputs are combined together and then tackled by GCN and MLP.
In addition to making progress on temporal dimensions, some approaches attempted to modify GNN to take multi-scale in spatiotemporal dimension into consideration. Z. Hu et al. [100] established dependence relationships for different bone nodes with a bone joint module, which is based on multiscale dynamic aggregated GCNs. GCNs describe and aggregate the bone joint semantic information. In this way, either the spatial information or the multiscale temporal information are all handled together.

RNN
Based on GCN, to tackle long-term information, W.W. Ding et al. [83] used LSTM as a vertex updater during message passing. Therefore, the features of each vertex will contain the temporal information and thus handle spatiotemporal information simultaneously.

Generated Approaches
The generated approaches cover two categories, one includes self-supervised methods, also known as unsupervised methods, and the other is neural architecture search (NAS), which aims at generating the best model by combining candidate components.
Both categories work in a non-end-to-end way. For the self-supervised methods, they first use priors, like pretext tasks, to generate a pretrained model, and then adapt it to fit the target task. For NAS, it aims at generating a best model on the target task. They emphasize the combinations of given components first, and chose the best model from these combinations. Then, the chosen model will be fine tuned on the target task.

Self-Supervised
Self-supervised learning is a means for training computers without manually labeled data. It is a member of unsupervised learning methods where outputs or goals are derived by machines. The machines are thus capable of labeling, categorizing and analyzing information on their own and then drawing conclusions based on connections and correlations. We classify methods in this category as AE, adversarial learning and teacher-student mechanism. AE M. Li et al. [77] built an A-links inference module (AIM) based on AE, where the output of the encoder is the probability of each joint pair with type-c link, and the decoder requires the output of encoder and joints positions in the previous frame. Thus, the loss of AIM is the difference between part of the input from the encoder and decoder's prediction. In this way, no more labeled data are required during pre-training the AIM except for the input poses.

Adversarial Learning
Inspired by adversarial learning, in [69], they incorporated it into the Bayesian inference framework and formulated it as a prior that targets regularized model parameters as to improve the generalization. The discriminator was implemented as a fully connected layer. The loss function while training is similar as what is adopted in generative adversarial network (GAN).

Teacher-Student Mechanism
For transferring knowledge between two graphs, such as one obtained in the lab and the other from real life, Y.S. Tang et al. [115] used a teacher-student mechanism. The teacher network guides the student network to transfer the knowledge across the weight matrices by a task-specific loss function, so that the relation information is well preserved during transfer. By doing so, no more action labels for the target domain are required during training.

NAS
In addition to self-supervised methods to generate task-specific models, some researchers showed their interest on automatic machine learning (AutoML), among which, NAS has gained more attention.
W. Peng et al. [116] discussed the best architecture of skeleton GCN methods, given components: the dynamic graph modules with various spatiotemporal cues and Chebyshev approximations in different orders. All candidates have residual connections. The proposed NAS framework works to find the most accurate and efficient network. Moreover, instead of providing a pre-defined graph, they generate dynamic graphs based on the node correlations captured by different function modules.
N. Heidari et al. [117] progressively adjusted the model topology by increasing the width of the model layers until the performance converges. If the addition of the last layer does not improve the performance, this newly added layer is removed and the algorithm stops growing the topology.

The Common Frameworks
In addition to the proposed taxonomy, most methods use some specific frameworks to improve their performance.

Attention Mechanism
The attention mechanism helps emphasize the inference related information. Based on the dimension where the attention mechanism is used, approaches are divided as self-attention methods and other attention methods.
In a general attention mechanism, there are three components, namely the query Q, the key K and the value V. After comparing the similarity between the key and the given query, one attention map is obtained and then employed on V so as to select discriminative values. The more similar are K and Q, the higher attention score that the corresponding value will have. Usually, the sigmoid functions or softmax functions are good candidates of similarity functions. One example of an attention mechanism is shown in Figure 10a. Figure 10. Examples of (a) self-attention [118], (b) the skip connections [119] and (c) the effective squeeze-excitation (eSE) block [120], where (b,c) are dense blocks.
6.1.1. Self-Attention Self-attention is the basic component in Transformers, commonly used in natural language processing (NLP). It is well-known for its capability of reweighting features. However, self-attention in NLP specifically includes an attention mechanism for the source domain or the target domain. Here, the definition of self-attention is extended. Since the model for HAR may not always have the target domain and source domain that occurs in NLP, as long as the features used for generating attention maps have the same meaning-in other words, are homogeneous-they are all regarded as self-attention. One common example is the attention maps between joints in skeleton graphs.
The first milestone of this idea is AGCN, which integrates attention with GNN by taking the attention map of the input joints as an added weight matrix. The attention map is then summed to the mask matrix, which is also used in ST-GCN. The self-attention matrix is estimated between input joints. All of the weight matrices are summed together to re-balance the relation of joints. Methods, such as [66,73,93,114,121] also used the same idea.
W. Peng et al. [122] adopted self-attention on joints and measure the similarity with a Gaussian similarity function. J. Shi et al. [123] proposed using the self-attention mechanism to calculate the weighted sum of the values from all nodes so as to aggregate features from the entire graph and also use the gating mechanism to adjust the weight of self-attention.
Y.B. Fan et al. [113] used a cross-attention module that consists of a self-attention branch and a cross-attention branch. Apart from paying more attention to informative joints in the self-attention branch, the combination of these two branches also suppress the influence of joints that are less relevant to the context information. T. Ahmad et al. [124] proposed a self-attention graph pooling to retain local properties and graph structures while pooling.
C. Plizzari et al. [125] highlighted a spatiotemporal Transformer network (ST-TR). In this model, the Spatial Self-Attention module (SSA) was used to understand intra-frame interactions between different body parts. They also adopted a Temporal Self-Attention module (TSA) to model inter-frame correlations.
W. Li et al. [126] proposed using self attention in spatial, temporal and channel dimension, which takes the features after global average pooling and max pooling as the original features, after one-dimensional convolution operation, they are regarded as the query and the key and then used softmax to calculate the attention map.
Instead of simple self-attention, such as what AGCN used, J. Xie et al. [127] integrated channel attention module (CAM) into their Vertex Attention Mechanism (VAM) to extract the global co-occurrence features of actions. The CAM generates channel weights by performing a fast 1DCNN in adaptive kernel size. This aggregated feature is used to generate attention map and later is added to sub-adjacency matrices. Finally, the complete summation of adjacency matrices is applied on the input node features.

Other Attention Mechanism
Instead of calculating the relation between homogeneous features in self-attention, other attentions work on multi-domains, including spatial attention, temporal attention, channel attention and auxiliary data attention.

Spatial Attention
• Joint-level The joint-level spatial attention is the most used attention module, since it helps to re-weight joints and emphasize those task-informative joints. This is especially helpful in discovering the long-distance dependency. For each channel, Y.X. Chen et al. [85] converted the the relationship between local motion pattern and global motion pattern to an attention map, where the local features come from the rescaled graph and the global features come from the original skeleton graph. This can be regarded as a channel-separated joint-wise attention. The other one follows the popular recipe while using LSTM, which emphasizes the informative hidden state extracted by LSTM cell. For example, C. Si et al. [89] integrated the attention operation in LSTM cells. They take the weighted summation of all nodes' hidden states as the query, and use sigmoid similarity function to select discriminative spatial information so that to enhance the information from key joints. • Part-level Some researchers argue that the attention mechanism on joints is too localized and fails to detect the intra-part relation (global topologically relations).
Since each action comprises of multiple interactions that happen in different parts, G. Zhang et al. [129] adopted multi-heads attention, which will generate multiple attention maps so as to focus on different parts. The attention model identifies key joints of every action by introducing two regularization terms, spatial diversity and local continuity. The spatial diversity is multi-head. It works by maximizing the distance between attention maps so as to focus on different parts. The local continuity is controlled by the attention map on graph's Laplacian matrix. Y.F. Song et al. [119] concatenated the features of all parts and perform average pooling in temporal dimension, and then pass them through a fully connected layer with a BatchNorm layer and a ReLU function. Subsequently, five fully connected layers are adopted to calculate the attention matrices and a softmax function is utilized to determine the most essential body parts. Q.B. Zhong et al. [109] emphasized the joints with more motions and propose a novel local posture motion-based attention module (LPM-TAM) to filter out low motion information in temporal domain. This operation helps improve the ability of motion-related feature extraction. The attention map of skeleton sequence in the spatiotemporal graph is represented by the attention of local limbs estimated in temporal dimension.

Temporal Attention
Some approaches argue to use the temporal attention to select the most informative frames or joints that are instructive in temporal dimension. N. Heidari et al. [130] proposed a temporal attention module (TAM) to increase the efficiency in skeleton-based action recognition. It selects the most informative skeletons of an action, in other words, skeletons corresponding to the top T highest attention values at the shallow layers of the network.
Most papers [73,74,97,109,126] calculated the temporal attention map with the most popular recipe, which adopts a sequence consisted of a pooling layer, a fully connection (FC) layer or one 1DCNN layer, followed by Relu or softmax activation functions. For example, L. Shi et al. [73] performed attention by 1DCNN. They first conducted average pooling for features and then processed the result by 1DCNN, and the attention map was calculated by softmax.
H.Y. Yang et al. [74] proposed a temporal and channel-wise attention (TCA) module, among which, the query is the node features after temporal and joint-wise global pooling, while the key is the node features after channel and joint-wise global pooling. The later follows the common recipe of attention. Q.B. Zhong et al. [109] processed the motion feature map by local posture motion-based temporal attention module and further by local posture motion-based channel attention module, with an aim at selecting the strongest discriminative representations between different posture movements.

Channel Attention
Methods, such as [49,73,74,100,120,126,127,131], assumed that features in different channels have various importance, and thus they attempted to balance the importance of each channel while inferring, known as channel-wise attention.
In [49], channel-wise attention was adopted in both GCN and gated CNN for information filtering. Squeeze-and-excitation (SE) block is the specific channel wise attention mechanism and was used after their action-specific graph convolutional module and gated convolution.

Auxiliary Data Attention
In addition to the attention on skeleton graphs, some also include attention on auxiliary data with the hope of completing the information that graph skeletons may be lack of. T. Ahmad et al. [132] used attention on RGB images to generate attention masks. These masks will be used on skeletons to pick attention joints.

Dense Block
The dense block attempts to compress features and also preserve gradients. There are mainly two types of dense block structures. One is the SE block [120], working on channel dimension, which performs a channel-wise attention for adaptive aggregation along the channel dimension. The other one is skip connection, also known as residual connection, which adds more connections between hidden states. One well-known example of skip connection is ResNet. It proposes to pass residual features by shortcut passing.

Skip Connection
The skip connection is in fact a shortcut operation while passing information. By doing so, there are at least two advantages: one is that shortcuts help preserve gradients, preventing gradient vanishing after a long sequence and also ensure stable training. The other is that by carefully designing skip steps, one can control the dependencies along the temporal dimension both locally and adaptively. One example is demonstrated in Figure 10b.
C.Y. Si et al. [136] designed a skip-clip connection by adding shortcuts between the final hidden states from each clip and also between the adjacent skip-clip LSTM layers. In this way, gradients along adjacent clips and gradients among each frame in one clip are all taken care of. Paper [62] used the same idea. Some designed a more complex connection strategy to pass dependencies in multi-paths. Y.F. Song et al. [119] (Figure 10b) proposed three types of skip connections, namely block residual, module residual and dense residual. The block residual, which adds connections between spatial-block and temporal-block, are regarded as the basic component of their modules. The module residual adds connections between modules, and the dense residual combines the connections in block residual and module residual.
Some researchers exploited the benefits of skip connections in different aspects. Xia. H et al. [137] used skip connections to fuse the information of the previous hop and the information of the next hop to collect information in different spatial scales. K. Papadopoulos et al. [112] preserved the information of short-term dependencies by skip connections.

Squeeze-and-Excitation Block
Methods with SE block highlight the non-homogeneous features in each channel and attempt to squeeze features by pooling. One example of effective SE (eSE) is demonstrated in Figure 10c. F. Li et al. [120] concatenate consecutive layers by a one-shot aggregation (OSA) and the effective squeeze-excitation (eSE) block, an improved version of SE, which well balances the performance and efficiency. The eSE explores the interdependency between the output channels and squeezes the temporal and spatial dependencies.
Z.T. Zhang et al. [99] compress and extract multi-channel data through SE block to obtain the structure and feature weight. The weight is fused with the extracted features to obtain the spatial structure features.

Multi-Modalities
Multi-modalities is a very common framework using data in various domains. It can greatly help to improve the efficiency of a method since the weaknesses of one modality can be compensated by the strengths of another. The summary of possible candidates for modalities is shown in Figure 11.

Multi-Stream
Multi-stream is the most commonly used structure in skeleton-GNN-HAR, methods, such as [14,49,52,54,68,69,71,73,87,89,90,93,99,104,114,119,120,123,125,126,134,[137][138][139] all use this framework. This framework utilizes different types of data, such as joint stream, bone stream, part stream, relative coordinates of the joints, temporal displacements. One example is shown in Figure 2. Different types of data make up the information that one single stream is lacking of; however, the more streams that one approach adopts, the higher the computation price is.

Multi-Sensors
Usually the sensors are the same but are fixed at different positions to obtain a multiperspective dataset. For example, Y. Jiang et al. [102] used two Kinect sensors, which were orthogonal to each other, to help extract skeleton graphs from different perspectives.

Semantics
Semantics information is also helpful in enhancing the feature representation capability. For example, P. Zhang et al. [135] used frame index and joint index in cooperation with their model. The joint index is important since two joints of the same coordinates but different semantics would deliver very different information. The temporal information (frame index) is also important to distinguish actions, such as sitting down and standing up from a chair, which are different only in posture occurrence order along the temporal dimension.

Images
J.M. Cai et al. [54] used Joint-aligned optical Flow Patches (JFP) to capture the local subtle motion around each joint. The extracted features are taken as pivotal joint-centered visual information. T. Ahmad et al. [132] used attention on RGB images to generate attention masks, which helps to pick attention joints.

Change Feature Space
Rather than working in the original features space, some approaches attempted to explore a better modeling space for skeleton graphs, such as manifold and spectral space. Examples are shown in Figure 12a [140] are examples, which change space, and (c,d) [141] are examples of neighbor convolution.

Manifold
The idea of manifold is inspired by the non-Euclidean characteristics of skeleton graphs. W. Peng et al. [106] define their model on a Riemann manifold, which they argue is more suitable to model the graph data. Their model is built via the Poincaré geometry to better model the latent anatomy of the data structure.

Spectral Space
One common idea of spectral space approaches is spectral GCN. Some implementations of spectral GCN are based on Chebyshev expansion. For example, methods [88,116] all adopt k-order Chebyshev expansion. J. kao et al. [140] designed graph representations for motion data, which is implemented by performing Graph Fourier transformation (GFT) on Laplacian matrix.
T. Ahmad et al. [124] performed spectral sparsification by exploiting similarity of the original graph, which is in Laplacian quadratic form, and that of the sparsed graph. It aims at discarding some redundant information by eliminating noisy nodes and edges.

Neighbors Convolution
The ways of convolution decide the characteristics of the aggregated features. The traditional way is k-neighbors convolution, while the one-hop convolution used in ST-GCN is the specific case of k-neighbors convolution when k = 1. Others consider to modify the basic weighted summation in traditional convolutions, targeting at containing more structural information. Examples are shown in Figure 12c,d.

The k-Neighbors Convolution
This kind of convolution aggregates the features of the node itself and its neighbors. According to the order of neighbors, the module are classified as one-hop and multihop. The method ST-GCN [13] and related methods [29,96,[110][111][112][113] all adopt one-hop convolution. Methods [55,107,128] change one-hop to multi-hop, since the k-order of neighbors will capture more local information around the center node and also preserve the structural information around it.

Other Convolution
One example is shift-convolution [141]. During shift convolution, the input features are first shifted along one predefined direction along body parts and then are weightly aggregated by 1DCNN. The paper proposes two kinds of spatial shift graph operations for modeling spatial skeleton graph, namely local shift graph convolution and non-local shift graph convolution. The first one's receptive field is specified with the physical structure of the human body, while the later one makes the receptive field of each node cover the full skeleton graph.

Means for Collecting Datasets
At the beginning, the 3D skeletons were directly captured by retro-reflective markers (RRMs), which were adhered to a suit and taken as joints. For example, with the VICON (Oxford, UK), a tracker system comprising RGB cameras, infrared-sensitive cameras and markers, movements and the 3D coordinates of every marker can be captured. This offers exact joint coordinates but no depth maps. Also, the requirement of markers and tracker system makes capturing intrusive and only possible under laboratory context.
Afterwards, as the emerging of DL-based skeleton extractors, e.g., Staked Hourglass [142], Google PoseNet [143], and especially the popularity of ToF cameras (Kinect [144]), binocular disparity-based cameras (Intel Realsense [145]), the marker-suit and complex tracking system are no longer or less necessary. Though the extracted joint position might still be influenced by the environment, for example when the subject's clothes look similar to the background, this makes non-intrusive extraction in real-life possible. In addition to the simulation dataset captured under laboratory conditions, video platforms, such as YouTube are excellent candidates for action datasets, since their users have shared lots of videos under various contexts. Videos collected in this way are closer to real cases where HAR methods will be applied and therefore will help to fill the incompleteness shown in simulation datasets. Apart from collecting data from scratch, existing datasets are free to be used to recreate a new dataset, such as the UOW LSC [146] dataset, which combines various existing datasets.

Dataset Taxonomy
The datasets are grouped as simulation datasets and real-life datasets, each group is divided into single-subject actions, interacted actions and hybrid actions. The details of all mentioned datasets are summarized in Tables A1 and A2 of the Appendix A, where  Table A1 collects the basic information of the datasets, and Table A2 collects the activity list and related methods. The example of each dataset is illustrated in Figures A1-A3 of Appendix A.

Simulation Datasets Single-Subject Actions
MSR Action3D [147]: This dataset consists of 20 actions that interacting with gameconsoles and each is performed three times by seven subjects. The depth maps are captured at about 15 frames per second (fps). The actions are classified into AS1, AS2 and AS3, where AS1 and AS2 intend to group actions with similar movements, and AS3 groups complex actions together. Depth maps and skeletons are offered without RGB videos.
CAD 60 [148]: RGB videos and aligned depth maps are captured by Kinect. Videos are taken in five different environments and the 12 unique activities (composed of several sub-activities) are performed by four actors. These actions cover the daily activities in office, kitchen, bedroom, bathroom and living room.
3D Action Pairs [149]: Six pairs of actions are carefully collected such that the two actions in each pair are similar in motions and shapes, with an aim to emphasize the importance of analyzing the shape and motion cues jointly during action recognition. Actions in every pair have similar trajectories and similar objects. Each action is performed three times by 10 actors. Depth sequences are offered but without skeletons.
MSR DailyActivity3D [150]: The dataset was captured by a Kinect. Sixteen activities in living room were recorded. Each of the ten subjects performs every activity twice: once in standing position, and once in sitting position. The depth maps, skeletons, and RGB videos are recorded; however, RGB videos and depth maps are not strictly synchronized.
CAD 120 [151]: After collecting 120 human-object interaction videos, this dataset annotates each video with the human skeleton tracks, object tracks, object affordance labels, sub-activity labels, and high-level activities. In total, four actors help to perform 10 sub-activities in 10 different contexts, e.g., making cereal, taking medicine, arranging objects.
ORGBD [152]: This dataset targets human-object interaction recognition based on RGBD videos. Seven actions were collected and performed by 36 subjects. The depth map, skeletons and RGB videos were offered. SYSU [153]: It is a human-object interaction dataset performed by 40 subjects. There are 12 different activities, among which, each participant manipulated one of six different objects. The RGB videos, depth sequence and skeleton data were captured by Kinect in one view.
UW-IOM [154]: The University of Washington Indoor Object Manipulation dataset, designed for ergocentric risk detection, comprises videos from twenty subjects. They are classified into 17 action classes. Videos were recorded by Kinect at an average rate of 12 fps. Each participant performed to pick up six objects.
In addition to, the pure human-object datasets summarized above, others below in this section cover other single-subject action dataset (with or without objects).
HDM05 [155]: This dataset was introduced in 2005 and contains more than 70 motion classes performed by five actors. The resulting 3D trajectory data are stored in the C3D mocap file format. To capture videos, six RGB cameras and six infrared cameras were used within a VICON MX system.
IEMOCAP [156]: The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database is an acted, multimodal and multispeaker dataset, collected in the SAIL lab at USC. Video, speech, motion capture of face and text transcription are all provided. During recording, 53 markers were attached to the face of the subjects, and they also wore wristbands, an extra marker on hand and headband. Finally, 9 emotions are labeled. Markers are captured by VICON optical tracker.
TUM [157]: The dataset is recorded in kitchen, tracked by four overhead cameras. The 28-joint skeletons were extracted and formated as BVH files. Since actions happening in a kitchen do not always involve every joint of the entire skeleton, the dataset is labeled separately. In nine actions, only the left hand is labeled. Two actions concentrate on the right hand, and two actions focus on the trunk of the skeleton.
UT-Kinect [158]: The videos were captured by a single stationary Kinect. There are 10 indoor actions, performed by 10 subjects. Each subject performs each action twice. The RGB videos, depth maps and skeleton are recorded.
Florence3D [159]: Captured by a Kinect camera, this dataset includes 9 daily activities. During capturing, 10 subjects were asked to perform the given actions.
N-UCLA [160]: It contains RGB, depth and human skeleton data, captured simultaneously by three Kinect cameras. Ten indoor daily actions are undertaken, each action is performed by 10 subjects.
UWA3D Multiview [161]: This dataset consists of 30 indoor daily activities from ten subjects' performances at different scales, but all are taken in front view. This was captured by Kinect. The self-occlusions and high similarities in this dataset contribute to more challenges.
UWA3D Multiview Activity II [162]: In this dataset, 30 indoor actions are performed by 10 subjects from four different views at different scales. RGB videos, depth videos and skeletons are all provided.
UTD MHAD [163]: One Kinect camera and one wearable inertial sensor (WIS) are used to capture the RGB videos, depth maps and 20-joints skeletons. The dataset contains 27 indoor actions performed by eight subjects and repeated four times.
Additionally, there are some datasets created from other existing datasets. UOW LSC [146]: By combing nine publicly available single-view RGB-D action datasets, this dataset with 94 actions was constructed. Data are all captured by Kinect, including datasets MSRAction3DExt, UTKinect, DailyActivity, ActionPair, CAD60, CAD120, G3D, RGBD-HuDa and UTD-MHAD. There are more viewpoints and subjects in this dataset. Actions include large motions of all body parts, e.g., spinal stretch, raising hands and jumping, and small movements of one part, e.g., head anticlockwise circle.
The following hand gesture datasets are also included into HAR action dataset, actions in these datasets are more similar and thus more challengeable since only the hand joints are recorded.
DHG-14/28 [164]: The DHG-14/28 dataset contains 14 gestures performed by 20 subjects in two ways: using one finger and using the entire hand. The captured sequences are labeled following their gestures, the number of fingers used, the performer and the trial. Each frame contains a depth image, the coordinates of 22 joints both in the 2D depth image space and in the 3D world space. The Intel RealSense short range depth camera was adopted as the capture device.
SHREC'17 [165]: The dataset contains 14 hand gestures sequences performed in two ways: using one finger and the entire hand. In these ways, each gesture is performed between 1 and 10 times by 28 participants. Each frame contains a depth image, the coordinates of 22 joints both in the 2D depth image space and in the 3D world space. The capturing device is Intel RealSense short range depth camera.

Interacted Actions
This part only discusses human-human interaction datasets. SBU Kinect Interaction [166]: Using Kinect, this two-person interaction dataset was collected. Eight interaction actions are covered. Seven subjects perform activities in a laboratory and repeat. Apart from RGB videos and depth maps, the 15 joint skeletons were extracted by OpenNI with NITE middleware provided by PrimeSense.

Hybrid Actions
The dataset comprising hybrid actions include single-subject actions and interacted actions.

Real-Life Datasets
Datasets captured in real-life context are reliable and helpful for the generalization of methods, and also contain more challenges, such as occlusions, various environments. Collective Activity (CA) [171]: CA was proposed for group activity recognition. It contains 44 video clips, which are labeled with six individual action classes (null, crossing, walking, waiting, talking and queueing) and five group activity labels (crossing, walking, waiting, talking and queueing). Each clip has 10 frames, which was recorded by consumer hand-held digital cameras with varying view points. J-HMDB [172]: Joint-annotated HMDB, an annotated subset of HMDB51 database, contains 21 single-person actions. Actions were collected from movies or the Internet. The 13-joint skeletons were extracted by 2D puppet model, while the model was constructed in 16 viewpoints.
Charades [173]: By distributing and crowdsourcing the entire process of video creation from script writing to video recording and annotation, Charades was collected as an indoor daily activity dataset, with 267 subjects performing 157 actions. Over 15% of the videos have more than one person. It can be used for egocentric vision research.
Volleyball [174]: This dataset was collected from publicly available YouTube volleyball videos. The 4830 frames, handpicked from 55 videos, were annotated with nine player action labels and eight team activity labels. The eight activities are right set, right spike, right pass, right winpoint, left pass, left spike and left set, while the nine actions are waiting, setting, digging, falling, spiking, blocking, jumping, moving and standing. Both RGB videos and bouding boxes are provided.
StateFarm [175]: This dataset was provided for a Kaggle competition in 2017, with an aim of classifying the RGB videos of drivers into 10 categories in order to check whether the driver is driving safely.
Kinetics [176]: It was built in 2017 by DeepMind, based on YouTube videos. In total, it contains 400 human action classes, and laterly was expanded to Kinetics-600, Kinetics-700. The actions include human-object interactions and human-human interactions. Skeleton dataset is not included.
ICVL-4 [177]: This is a subset of ICVL [178] real-time video surveillance dataset, with only object-related subsets being picked. Actions are divided into 13 categories, and each is performed by a single subject. Only the RGB videos and action labels are offered.
IRD (Illegal Rubbish Dumping) [177]: This human-object dataset comes from the post-processed videos by CCTV cameras. Actions are classified as garbage dumping and normal actions. All data were captured in unconstrained environments rather than indoor.
HiEve [179]: This dataset focuses on human-centric analysis in various crowd and complex events. Videos are collected from nine different scenes and 32 real-world video sequences are captured. Each subject in a video is annotated with a bounding box, 14 joint skeletons, human identities and human actions. In total, there are 14 action categories.

Performance
This part discusses the properties of datasets and the performance of the methods tested on them quantitatively and qualitatively. Precisely, we divide this section as the statistics of datasets, model performance, and hard activity cases. The datasets cover those summarized in Table A1, and the methods are those in Table A3.

The Statistics of Datasets
We classify the datasets according to action types, and count the methods evaluated on them. The results are shown in Figure 13a,b, respectively. From Figure 13a, most datasets are simulated rather than captured in real-life, and are performed by one subject. Only a few of the datasets focus on pure human-object or human-human interaction.

Model Performance
The model performance can be estimated from the accuracy, model complexity and model size.
• Accuracy The accuracy of methods is shown in Figure 14, with Figure 14a on NTU RGB+D, NTU RGB+D 120 and Kinetics, Figure 14b on SYSU, N-UCLA, MSRAction3D, HDM05 and SBU. Figure 14a demonstrates that the Kinetics dataset is more challengeable (scores are below 50% ) than simulated NTU, considering it is a real-life dataset and only provides RGB videos. Under real-life context, because of the occlusions, illuminations, complex environments etc., it is difficult to infer 3D skeleton graphs accurately. This huge challenge proves that the accurate 3D information is neccesary for skeleton-GNN-HAR. Moreover, it is clear that cross-subjects is more challengeable than cross-views, either on NTU or NTU120. This is because 3D skeletons are view-invariant, and under cross-view case, the 3D skeletons from different view points complement each other. When skeletons are from multiple subjects, the different sizes of subjects, separated clothes etc. all contribute to increase the recognition error. On other datasets, performance on MSRAction3D varies severely. MSRAction3D is challengeable because of the 3D information without RGB videos, and high interaction similarities. Specifically, ST-GCN [13] and ST-GCN-jpd [29] underperform others [12,69], where [12] used temporal pyramid, and [69] takes LSTM as the backbone. Methods that are good at temporal tracking outperform ST-GCN-based methods. This can be explained that the temporal evolution in ST-GCN is handled by CNN.  (Table A3)  The performance (logarithmic) on SYSU, N-UCLA, MSRAction3D, HDM05 and SBU. Each method is denoted by its index in Table A3, marked as RefIndex in the figure. The colors identity each dataset. In (b), the numbers around dots denote [12,13,29,56,59,61,69,72,74,75,83,85,[87][88][89][90]92,97,110,115,135,136,138,141,182] respectively in ascending order.
According to [141], most GCN methods are over 15 FLOPs(G). Figure 15 illustrates the complexity and model size on NTU RGB+D in cross-subjects case. Dots that closed to the bottom right of the figure are the best models, with both lower complexity and lower model size but higher performance. JOLO-GCN [54] adopts multi-modalities-the Joint-aligned optical Flow Patches (JFP) to capture the local subtle motion around each joint, which proves the importance of local subtle motions. Figure 15. The logarithmic complexity and model size of the most popular methods (denoted as dots) performed on NTU cross-subjects. The text around each dot indicates the index of reference of the method. Green texts annotate the methods measured by 'Params(M)', and blue texts annotate 'FLOPs (G)/Action'. The numbers around dots denote [13,14,50,54,55,57,59,62,73,77,80,82,85,86,89,110,116,117,121,127,130,135,141,183] respectively in ascending order. Number 94 [54] is a remarkable one, with both relatively low complexity and small model size.

Hard Activity Cases
Despite many successes, there are hard activitiy cases that require further analysis. Unfortunately, only a few methods list the hard activities (less than 20% in our method set). We present them in this part.
• Similar single subject actions without objects For actions involving no objects, activities are mainly misclassified due to similar motion patterns with low pose resolution or inappropriate standardized axis coordinates. When actions only differ slightly around hand joints, the low hand pose resolution will increase the classification error. For instance, the NTU RBD+D dataset only records three joints for each hand, namely the wrist, the tip of the hand and the thumb. These joints are not enough to help distinguish actions with subtle movements around hands. Therefore, for actions with subtle hand movements, the NTU skeletons are less supportive in recognition. For example, Ref. [83] observed that actions, such as rubbing two hands together and clapping, are easily confused with each other on NTU, Ref. [71] misclassified stand up as check time (from watch) on NTU, Ref. [92] mistook make victory sign as make ok sign, snapping fingers as make victory sign. Ref. [64] misclassified standing and walking especially when the back of subjects faces the camera, which is due to the low pose resolution (missing of joints) caused by self-occlusions. Authors of [64] also discovered that actions like twisting are difficult because the standardized axis coordinates (Cartesian coordinates) erases the subtle rotation around the wrist. • Similar single subject actions with objects Similar human-object interactions usually differ in the subtle movements of hands and have similar action trajectories. Generally, the errors are mainly caused by low pose resolution or lacking object information.
Ref. [54,91] failed while classifying reading, writing, playing with phone/tablet, and typing on a keyboard. The authors argue these actions only differ for hand movements, while the skeletons provided by NTU RGB+D are less supportive for hand joints.
Ref. [114] mentioned that when the body movements are not significant, and the sizes of the objects are relatively small, e.g., counting money, and playing magic cube, and the skeletons only provide three hand joints, the model can easily become confused.
Ref. [89] also blamed the low NTU hands resolution, which leads to misclassify reading as writing, writing as typing on keyboard. As for distinguishing actions with subtle movements of two hands, such as wearing a shoe, taking off a shoe, Ref. [96] failed, and expects more precise hand joints to help. Similarly, Ref. [119] made mistakes on reading and writing, and holds the same opinion for fixing it.
Ref. [71] misclassified stapling book into cutting paper, counting money into playing magic cube. The authors explained that this is because the information about objects is missing. This is supported by [83], where the authors observed that although actions, such as drinking water and brushing teeth, have similar motion patterns, the objects involved are different. Ref. [89] expects that their failure cases, such as reading and writing, can be erased by combining object appearance information. • Human-human interaction In human-human interactions, one important reason of recognition errors given by the method set is occlusions. For instance, Refs. [82,111] observed that if important joints, such as right arms, are occluded, actions, such as handshaking cannot be inferred with the rest of the joints.

Challenges
Though the proposed approaches have shown their efficiency, there are still multiple challenges that call for further investigation.

Real-Life Context
If datasets are collected under real-life context, the colorization, illumination, backgrounds etc. vary, caused by the complex and various environments, the capturing time, the exposure condition of capturing devices etc. These all contribute to increase the error of detected skeletons.
Occlusions is another natural problem in the real-life context, and is caused by surrounded objects, or by the subject himself (self-occlusions). Self-occlusions are common while doing actions, due to the interactions between body parts, such as playing basketball. The most rude way to deal with this situation is simply discarding those skeletons who have occlusions, which will lead to a huge loss. It is not only because of the cost of data acquisition but also the plenty of implicit information among occluded skeletons. One possible suggestion from [183] is RL-based pose refinement.

Pose Resolution
The number of joints are different in different datasets. Normally, skeletons detected by DL methods, such as Openpose, PoseNet, have around 18 or 25 joints. Skeletons captured under simulation context, with RRM markers are more precise because of more joints. While distinguishing subtle movements of two hands, such as clapping and rubbing two hands together, more subtle hand joints are expected [71,96]. One possible solution is to mimic the hand gesture datasets during data capturing so as to provide more subtle hand joints.

Pose Topology
Usually the pose topology is directly extracted from frames, following the physical human skeleton, and is fixed while inputing. Each joint has the similar weights during meassage passing, and follow the prefixed edges. However, Ref. [135] observed that different actions correspond to different informative joints. This indicates that there exist potential connections that are informative but missed by physical connections.
Ref. [137] discovered that the correlations of joints in learned spatial-temporal graph differ for different actions and different frames. Therefore, combining action information, spatial and temporal information with pose topology is one possible way to provide complementary information for recognition. The most common way is using self-attention to reweight joints.

View-Invariant
Human can easily recognize the same action, even when looking from different angles. However, combining the captured skeletons of one action observed from different views and recognizing them as the same action is more challengeable for machines, since one action appears quite differently if observed from another view. Features extracted from one viewpoint cannot be identified from another viewpoint accurately.
Currently, armed with RGBD cameras, reconstructing 3D human skeletons with captured depth videos is quite easy. Once the exact locations of skeleton joints are known, one can directly use estimated transformation matrix to make skeletons strictly viewinvariant as standardized poses. Figure 14a demonstrates that 3D skeletons are helpful for HAR. However, when skeletons in different views are standardized to one pose, some partial relative motions among the original skeletons could be erased. For example, the action rotating wrists might be lost if all skeletons are standardized to one specific direction, such as facing the front.
Moreover, depth information is not always available. When only 2D skeletons and 2D videos in different views are provided, it is even more difficult to preserve view invariance due to the lack of 3D information.

Multi Spatial Scale
Most approaches only extract spatial information at the joint scale, without considering feature extraction at multiple scales. However, for different actions, features just from joint scale are not subtle enough for accurate recognition. For example, actions cutting and writing all move hands and arms. If all joints are taken into consideration equally, rather than paying more attention to arms and hands, the result will under-perform since many local information will be ignored. Therefore, capturing local features without losing global features is critical. One hint to solve this is based on skeleton partition, such as part-based partition [57,64,86,95].

Multi Temporal Scale
Temporal scale also contains implicit features, e.g., the actions race walking and walking are visually similar; however, if one compares the frames at different time scales, they are different. Moreover, the order of frames (timesteps) is also important when taking sitting down and standing up into consideration. To deal with this, some previous works performed dilated temporal convolution [55,84], while others simply sampled specific frames [104,137].

Multi Subject Scale
There are two major problems caused by this situation, one is for interacted actions, how to balance features from inter subjects and features from intra subjects. Paper [92] offers a hint to perform attentioned LSTM in joint-level, person-level and scene-level so as to trade-off both the intra-subject and inter-subject information. The other is when there are numerous subjects, such as in the Kinetics dataset. From the performance of methods on NTU RGB+D (Figure 14a), the cross-subjects case is more challengeable compared with the cross-views case.

Multi-Modalities Fusion
Understanding video content precisely benefit from multimodalities. Ref. [123] proves that each modality (skeleton, audio, text) had its unique strengths, and believes that a better fusion strategy may improve performance. Ref. [73] clarifies the complementarity between the skeleton modality and the RGB modality. Ref. [113] argues that, except for RGB video, video captions can also provide context information.
However, the challenges in modalities fusion come from resolution, fusion strategy, alignment etc. Inspired approaches for solving these problems include [54], which proposes JFP (optical flow patches) sequences to capture the local subtle motion information, Ref. [184] that used speech texts, Ref. [185], which used context information, and AdaMML [186] adaptively selected which modality to use for each video segment.

Inner Heterogeneous
From Section 3, we know that graph edges are heterogeneous if considering temporal edges, spatial edges etc. Normally, the GNN layer is used on the entire graph rather than treating these edges heterogeneously. Traditional methods, such as the recipe combining of GNN and LSTM, treat spatial edges and temporal edges separately; however, they ignore the inter relations of spatiotemporal data. Tackling the heterogeneous nodes or edges in graphs but without ignoring the original relations among them has created a large question for the scientific community.
For that case, X. Gao et al. [75] suggested distinguishing intra-frame and inter-frame. Precisely, joints in intra-frame are grouped as weakly connected and strongly connected joints, and their corresponding edges are further emphasized by two constant weights. Temporal edges in inter-frames are also assigned with two constant weights to distinguish edges between corresponding joints and edges between each point and the neighborhood of its correspondence in the adjacent frames.

Interactions
Most methods focus on single skeleton graph-based HAR. However, in addition to this simple case, there are two more complex cases: the human-human interaction and human-object interaction, errors among them are not the same [187].

Human-Human Interaction
For human-human interactions, such as hugging, kicking, there are at least two skeletons in each frame. In some complex cases, such as walking through zebra crossing, more skeletons are obtained. Either grouping multiple skeletons to detect sub-activities or handling local and global features simultaneously all can be a problem.
Coping with this, papers [48,187,188] concentrate on relative features, paper [101] splits the motion model into single person and double person motion model, paper [115] attempted to transfer the knowledge between the two interacted skeletons by the teacherstudent mechanism. Ref. [125] discovered the long-range relations on two people actions benefit from self-attention, because self-attention detects the correlation along the entire action. Ref. [189] used distance, orientation and O-space features to describe the relations between subjects.

Human-Object Interaction
There are two main tasks in human-object interactions. One is object detection, the other is HAR according to the semantics of object detection. Human-objects interactions that have subtle movements of hands can be easily misidentified [83,91,92]. However, Ref. [61] observed that almost no skeleton motion and the differences are illustrated as human-object interaction.
Some existing papers give inspiration. S. Kim et al. [177] split the problem into two sub-problems and build two streams upon it, which are the human pose stream and the object-related pose stream. X. Shi et al. [190] treated the related objects as joint points and linked them to hands. Refs. [61,89] recommend to combine object appearance for similar movements, Ref. [119] preferred adding more hand joints while collecting datasets.

Conclusions
In this paper, we thoroughly analyzed skeleton graph-based HAR. Concretely, the types of skeleton graphs and the means to build graph structures were discussed. We proposed a new taxonomy that classifies the skeleton-GNN-based HAR into spatial approaches, spatiotemporal approaches and generated approaches. The most common frameworks were also summarized, including the attention mechanism, dense blocks, multi-modalities, changing feature space and neighbor convolution. Finally, the most frequently used datasets were also collected and described.
Based on the previous discussion, the possible future directions are apparent. Usually, the built graphs are undirected graphs, and many researchers attempted to develop models based on ST-GCN or AGCN, the two milestones in this field. Additionally, the two frameworks (the attention mechanism plus skip connections and multi-stream with multi-modalities) were preferred. The details are shown in Figure 16.