Video Abnormal Behavior Recognition and Trajectory Prediction Based on Lightweight Skeleton Feature Extraction

Video action recognition based on skeleton nodes is a highlighted issue in the computer vision field. In real application scenarios, the large number of skeleton nodes and behavior occlusion problems between individuals seriously affect recognition speed and accuracy. Therefore, we proposed a lightweight multi-stream feature cross-fusion (L-MSFCF) model to recognize abnormal behaviors such as fighting, vicious kicking, climbing over the wall, et al., which could obviously improve recognition speed based on lightweight skeleton node calculation, and improve recognition accuracy based on occluded skeleton node prediction analysis in order to effectively solve the behavior occlusion problem. The experiments show that our proposed All-MSFCF model has a video action recognition average accuracy rate of 92.7% for eight kinds of abnormal behavior recognition. Although our proposed lightweight L-MSFCF model has an 87.3% average accuracy rate, its average recognition speed is 62.7% higher than the full-skeleton recognition model, which is more suitable for solving real-time tracing problems. Moreover, our proposed Trajectory Prediction Tracking (TPT) model could real-time predict the moving positions based on the dynamically selected core skeleton node calculation, especially for the short-term prediction within 15 frames and 30 frames that have lower average loss errors.


Introduction
Video surveillance technology is widely used in daily life for public safety management.Abnormal behavior recognition and tracking technology, as an important video surveillance application technology, has a great deal of research significance and practical value.Abnormal behavior has different definitions in different scenarios, and in our research the abnormal behavior studied is based on the field of public security.According to the definition of "behavior" and "abnormal", the abnormal behavior classification is clarified."Behavior" [1] refers to the most basic and meaningful interactions with their surrounding environment."Abnormal" [2] refers to phenomena that are different from the normal state.Therefore, the following definition of "abnormal behavior" is given: all actions, gestures, or events made in the current scene that are not suitable.
Abnormal behavior recognition and tracking technology can improve the efficiency and accuracy of video surveillance, reduce operators' workloads, and detect and handle abnormal behavior events early.Abnormal recognition and tracking based on skeleton nodes is one of the important methods in current research.The detection effect of skeleton nodes is continuously improved, and it is increasingly becoming one of the core technologies of intelligent video surveillance.However, in real application scenarios, the large number of skeleton nodes or the occlusion between individuals seriously affects the abnormal behavior recognition speed and accuracy, and limits the application of abnormal behavior recognition and tracking algorithms.Therefore, we propose the LSFE model.Before feature extraction, for the problem of occluded skeleton nodes information, the L-MSFCF model is proposed to utilize the skeleton nodes information of past frames to predict the occluded skeleton node information, thus improving accuracy.When tracking abnormal targets, in order to address the issue of targets becoming occluded or disappearing from view, we suggest the TPT model.
The structure of this paper is as follows: The history of the current quest for abnormal behavior recognition is provided in Section 1, and the current status of the field is introduced in Section 2. The study's purpose, materials, and procedures are all presented in Section 3. The LSFE approach is described in full in Section 4. A comprehensive explanation of the L-MSFCF model is given in Section 5.The TPT model is explained in depth in Section 6.The findings and analysis of the experiment are presented in Section 7. A thorough discussion is given in Section 8, and the study is concluded in Section 9.

Related Work
Computer vision techniques are gradually becoming the mainstream of abnormal behavior recognition, and the main challenge is to accurately extract and analyze representative appearance features and dynamic motions.In the early stages of the study, researchers typically thought of objects as particles.By simulating the tension in every pixel, Mehran R et al. [3] created a particle flow network to extract the interaction of force as features from video data.In order to capture the space-time properties of a crowd, new global features were proposed by Xie S et al. [4] to describe the position, speed, and direction of particles.Furthermore, Yu B et al. [5] enhanced the representation capacity of the particles by utilizing several comparable particles to describe objects.However, these feature-extraction techniques were unable to recover the subtle aspects of motions.In order to better capture motion information, numerous researchers have turned to feature extraction from space-time cubes, whereas Sabokrou M et al. [6] considered the sub-region of continuous frames as a space-time cube and extracted 3D gradient characteristics for cubes, and Fayyaz M et al. [7] collected global features from space-time cubes using an automatic encoder.Martinel N et al. [8] extracted deep features by rebuilding the interesting cubes using stacked sparse automatic encoders.Since features extracted from space-temporal cubes do not maintain the correlation of motion features between the cubes, Coşar S et al. [9] learned velocity and trajectory from real tracking data pixels and clustered the trajectories using a clustering tree to predict the most probable paths of the tracked objects.Xu M et al. [10] captured both groups and personal trajectories at the same time, and performed separate abnormal behavior detection.
Target motion trajectory prediction algorithms are vital to research in computer vision and robotics areas.The purpose is to forecast a target's future motion trajectory utilizing existing motion data and environmental information so that a robot or other intelligent system can react appropriately.Kerdvibulvech C et al. [11,12] proposed a method for 3D human motion analysis for reconstruction and recognition.They used 3D gait signatures computed from 3D data that are obtained from a triangulation-based projector-camera system.Results demonstrated that the proposed 3D gait signatures-based biometrics pro-vide valid results on real-world 3D data.Combining trajectory prediction due to maneuver recognition with trajectory prediction owing to constant slew rate and acceleration motion models was conducted by Houenou A et al. [13] and Czyz J et al. [14] proposed a mixed-value sequence state estimation algorithm.Shao X et al. [15] presented a unique filtering technique, which is to follow a target's mobility utilizing GPS sensors.Vashishtha D et al. [16] and Kapania S et al. [17] improved particle filtering, combined color sequences, and constrained Bayesian state estimation to achieve motion trajectory prediction of the target.Choi D et al. [18] proposed a method using maximum likelihood multi-filter to obtain an overall estimate to predict the target trajectory by combining independent multiple kinematic model correlation estimates through a great likelihood rule.Predicting target trajectories by building kinematic and kinetic models will not affect the accuracy even if losing a large part of the data, but part of the target motion trajectory is nonlinear and prone to many curvilinear trajectories, which means that the model-based trajectory prediction algorithms will have the problem of low accuracy.Another important model is based on data-driven trajectory prediction studies [19].The model uses both classification and regression algorithms to treat the trajectory prediction issues.Semwal et al. [20] suggested a target trajectory prediction technique, long short-term memory networks (LSTMs), and convolutional neural networks (CNNs).Deep neural network by Shirazi M S et al. [21], Faster R-CNN by Zhou H et al. [22], and YOLO network by Yoon Y C et al. [23] also have good performance in target trajectory prediction.
To solve the occlusion problem, Sabokrou M et al. [24] first applied fully convolutional neural networks to abnormal behavior detection.They completed it by utilizing AlexNet's fully convolutional layer to extract deep features and by cascading Gaussian classifiers to identify abnormal behaviors.Chu W et al. [25] extracted temporal characteristics using 3D convolutional neural networks.Liqian Yan [26] suggested a 3D convolutional residual network structure in light of this.In order to lessen the effect of the network, Fang Z et al. [27] defined the motion characteristics of the footage using a visual system to define spatial features and the multi-scale histogram of the optical flow.Ye O et al. [28] extracted initial features through the CNN-LSTM network and used feature expectation subgraph to filter unexpected feature values.The values of the remaining predicted features were sent into SV M to detect abnormal behavior, and Tay N C et al. [29] created a shallow convolutional neural network to extract appearance characteristics, added spatial attention, and integrated it with a LSTM network.
In previous conclusions, although the target trajectory prediction and occlusion problems can be solved effectively, the accuracy and time complexity need to be improved.Therefore, we need further research on abnormal behavior recognition.

Motivation
In our research, lightweighting the skeleton is an effective way to cope with the effects of an excessive number of skeleton nodes.In addition, we found that the multi-stream feature cross-fusion method has significant advantages in feature extraction.Therefore, the flowchart of abnormal behavior recognition and tracking is demonstrated in Figure 1.

Datasets
The experiment utilized the human3.6mdataset, comprising 3.6 million 3D human posture examples and their related photos.These data were collected from six males and five females across 17 diverse scenes such as discussions, smoking, taking photos, and more.The video captures were from four calibrated cameras capable of capturing precise 3D joint positions and joint angles.For more information about the human 3.6 m dataset, see [30].The UCF-Crime dataset [31], a vast collection of actual surveillance footage containing 1900 long, unedited recordings with 13 distinct kinds of abnormal events, was also used in the studies.Furthermore, the ShanghaiTech Campus dataset [32] was employed.It included over 270,000 training frames and 130 occurrences of abnormal events.

Methods
This paper is a study of abnormal behavior recognition and tracking in surveillance videos.First, the skeleton nodes of various behaviors are lightened by the LSFE method and then construct the optimal skeleton node architecture graphs for various behaviors.Second, they construct the L-MSFCF model for abnormal behavior recognition; after predicting the information of the occluded skeleton nodes, it takes the lightweight feature skeleton to coordinate information and the skeleton vectors as the dual-stream inputs and uses the cross-feature fusion to carry out the feature extraction.Finally, the TPT model is proposed for trajectory prediction.It provides a reference for the tracking of abnormal behavior targets.

Skeleton Feature Extraction
We proposed a lightweight skeleton feature extraction (LSFE) method to solve the problem of a large number of skeleton nodes.Firstly, we design the adaptive computation of the video frame window and find out the optimal video frame window length for optimizing the skeleton nodes; secondly, we design the formula for calculating the skeleton nodes of the proposed trigonometry and triangulated all the skeleton nodes, and find out the motion law of the individual behavioral movement process by using the association rule mining algorithm under the optimal video frame window length; finally, we find out the skeleton nodes that can represent the action by data mining and filter out the redundant skeleton nodes, so as to achieve the purpose of skeleton nodes optimization.

Data Preprocessing
The definition of behavior in this paper divides behavior into two categories: normal behavior and abnormal behavior.We constructs a normal behavior video database and an abnormal behavior video database.The definitions are as follows (Table 1).

N Normal behavior
When behavior is consistent with usual behavior, it is quite normal for an individual.

A Abnormal behavior
Can be divided into two kinds: one is the disturbance of order in public places, and the other refers to criminal acts.
Fighting, vicious kicking, climbing over walls, throwing suspicious objects, and slashing devices.
Before action recognition based on skeleton joint point, it is necessary to convert the original video data into skeleton joint data, the space representation of the action is detected and recognized in the original video data.Through the existing posture evaluation algorithm, the video data can be transformed into corresponding skeleton joint data and the skeleton corresponding to each number in Figure 2 is represented as follows: 0-nose, 1-neck, 2-right shoulder, 3-right elbow, 4-right hand, 5-left shoulder, 6-left elbow, 7-left hand, 8-right hip, 9-right knee, 10-right foot, 11-left hip, 12-left knee, 13-left foot, 14-stomach, and 15-head.Due to the different angular positions of the sportsman relative to the camera, resulting in possible differences in the coordinate origin, to facilitate the study, we do a harmonized coordinate transformation for the skeleton data.We reconstruct the coordinate with a triangle formed by the three points v 1 , v 2 , and v 3 in Figure 3. v 1 , v 2 , v 3 in space constitute a triangle and its three sides are l 1 , l 2 , l 3 , v t , set as the projection point on the line l 3 .Through Equation (1) we can obtain three basis vectors by transforming the coordinate.
v 1 , v 2 , v 3 -the three skeleton joints in Figure 3; v t -the projection point on the line l 3 ; U t -the three basis vectors of the transformed coordinate.
The conversion process of the coordinates also needs the three basis vectors of the original coordinate.It is represented in Equation (2).
Using Equation (3), the three basis vectors of the original coordinate and the three basis vectors of the transformed coordinate are operated to obtain the corresponding transformation matrix.
U −1 t -the inverse matrix of U t .Using the transformation matrix, the original coordinates are transformed by Equation ( 4).
v-a skeleton node in the original coordinate; v -the corresponding transformation node; R-transformation matrix; v 1 -new coordinate origin.
During the transformation, we designate the origin of the new coordinate as v 1 in the existing coordinate; moreover, all 3D skeleton coordinates are transformed into the new skeleton data with v 1 as the origin by the above equation.

Adaptive Sliding Window Selection Calculation
Activities are characterized by continuity and periodicity.However, the length of this cycle cannot be determined; therefore, this paper utilizes adaptive sliding window selection calculation to determine the cycle length that meets the requirements.First, we apply the method to the segmentation of action sequences.For the action sequence a = a 1 , a 2 , . . .a n , we set the window width to T and step size to K, and each window contains R i = r i1 , r i2 . . .r iT .In this way, the initial action sequence is divided into FT action segments, which can be represented as R = A 1 , A 2 . . .A FT .Each segment contains K poses describing the local information of the body.Figure 4 shows the complete process of segmenting an action sequence by adaptive sliding window selection calculation and the original video frames are cited by the Human3.6mdataset [30].The window width parameter T determines the size and the number of segments that can be segmented in an action sequence.A larger K means that each segment contains more poses and a coarser description of the movement; on the contrary, a smaller K means that each segment contains fewer poses and a more accurate description of the movement.Although smaller K describes movements more accurately, this means that smaller segments are more susceptible to noise in the 3D skeleton position tracking results, which in turn affects the recognition of movements.Defining the set of stored window sizes as L, we calculate each action accuracy in the window size from 3 to 23.The top three action accuracies are stored in the set L; we can obtain the calculation of all the actions and then take the size of the window with the highest number of occurrences as the window length.Finally, the most frequent occurrence is 15.Therefore, the window length of T = 15 is taken as a basic action sequence in this paper.

Lightweight Skeleton Feature Extraction Method (LSFE Method)
We put forward the lightweight skeleton feature extraction (LSFE) method.The method is based on the association rule mining of similar vectors.It converts the 3D skeleton data into a series of vectors with a length of 15 frames by using adaptive sliding window selection calculation and then utilizes vector similarity to mine the similarity association rule set of each node.If there exists a similarity association rule set, the skeleton node is considered to be a strongly associated skeleton node of the action.The computation is as follows: Step 1: Obtain any one of its skeleton nodes v i based on the original skeleton node data V. Define the node data of the last two frames as v i+1 , v i+2 .Define the 3D coordinates of v i , v i+1 , v i+2 as x i y i z i , x i+1 y i+1 z i+1 , x i+2 y i+2 z i+2 .
Step 2: Calculate the angle change in the skeleton nodes v i , v i+1 , v i+2 and the three points in the time dimension according to Equation (5).
α, β, θ-the value of the angle change; The plane ρ can be obtained through the skeleton nodes v i , v i+1 , v i+2 , which are represented as shown in Equation (6).
a, b, c, d-the plane equation parameters; It is clear that the normal vector n = a, b, c of the plane ρ can be obtained, then the distance R from the origin of the space coordinates to this plane is obtained through Equation (7).
After obtaining the height R of the proposed triangular pyramid, the volume X i is then calculated according to Equation (8).
α, β, θ-the three included angles of the proposed triangle pyramid; R-height of the proposed triangular pyramid; X i -volume of the proposed triangular pyramid.According to Equation ( 8), we can find the set of vectorized data X j,M i = {X i } for a certain skeleton node v i of a certain action single video M j , then the vectorized dataset for all the data of a certain skeleton node of the action is denoted as Equation (9).
X j,M i -the vectorized dataset of the ith skeleton node of the jth video data under the action classification; M i -a vectorized set of all data for a certain skeleton node.
Step 3: Construct the frequent item set.Scan all the {X i } data in the M i set in a single pass to determine the support of each {X i }.Since {X i } is a vector datum, it utilizes a calculation rule of similar vectors: if two vectors are similar vectors then their frequency adds one.The similar vectors are shows in Equation (10).
X a -data a of a certain skeleton node in the frequent item set; X b -data b of a skeleton node in the frequent item set; cos δ-similarity of vectors X a and vectors X b .X a and X b are two vectors of the same length, cos δ is between 0 and 1.When cos δ > 0.9, the vector X a is considered to be similar to the vector X b .
Step 4: Mining association rule sets.Define the association rule set of an individual's behavior as a set in the form of key-value pairs Mine the 16 skeleton nodes of an individual to obtain the set of association rules L for a single node v i , and L is added to Y i .When Y i = ∅ , (v i : Y i ) , (v i ; Y i ) will be stored in the association rule set J as key-value pairs; Y i = ∅ means that the current skeleton nodes do not have an obvious regularity; they cannot represent the behavior action and should be discarded.Define the final association rule set J = {(v i : Y i )|v i ∈ V, Y i = ∅}, the set of all skeleton nodes in the set J is {v i }, {v i } is a non-empty subset of V.
Step 5: Determine lightweight skeleton nodes.The maximum frequent item set in J is n.When n > µ, then v i is considered as the skeleton node of the current action.Finally, calculate all feature skeleton nodes.
The highest accuracy of skeleton node recognition is when µ = 3.The extracted lightweight skeleton nodes for each action are shown in Table 2.
Step 6: Based on the above lightweight skeleton nodes, construct the LSFE model to recognize actions and verify the feasibility of lightweight skeleton nodes.

Lightweight Multi-Stream Features Cross-Fusion Model (L-MSFCF Model)
Lightweight feature skeleton node extraction is a core processing step for supporting the L-MSFCF model, which could help to greatly reduce the model parameter numbers and computation time than full skeleton processing.In fact, as shown in the experiment Section 7.2 (Section 4-LSFE model testing), only considering the optimized skeleton nodes to recognize the video behaviors, the accuracy is not ideal.In order to improve the recognition accuracy and further reduce the computation time, our proposed L-MSFCF model has enhanced lightweight features based on a multi-stream feature cross-fusion process in order to obtain more behavior feature information.

L-MSFCF Model Abnormal Behavior Recognition Process
The L-MSFCF model is different from the traditional multi-stream feature fusion action recognition method.The L-MSFCF model processes the occluded skeleton nodes and also utilizes the feature cross-fusion extraction method.Firstly, lightweight the skeleton nodes.Secondly, predict the occluded skeleton node information by utilizing the skeleton node data information of past frames.Finally, obtain action features through the skeleton stream, nodes stream, and feature cross-fusion stream.The L-MSFCF model strengthens the recognition ability of abnormal behaviors.
The L-MSFCF model abnormal behavior recognition process mainly has two steps: the first step is occluded skeleton nodes prediction and lightweight processing; the second part is lightweight skeleton data feature extraction through dual-stream, and then feature fusion is performed on all the features to finally obtain the classification results.Figure 5 shows the flowchart of the L-MSFCF model.The following are the steps: Step 1: Preprocessing the skeleton data.Create a skeleton joint dataset and a skeleton vector dataset.Because skeleton vectors are composed of two skeleton nodes and the whole skeleton point graph is not a ring structure, this results in the number of skeleton vectors always being less than the number of skeleton nodes in the generation process by 1.We add an empty skeleton with the value of 0 to skeleton vectors so that there are as many skeleton nodes.

Behavious recognition and classification
Step 2: Lightweighting the skeletons.Lightweight skeleton data are based on lightweight characteristic skeleton nodes for each action.Skeleton node data and skeleton vector data are processed similarly, taking skeleton node data processing as an example.The process is as follows: According to Table 2, retain the corresponding characteristic skeleton node information and set other skeleton information to 0. Take fighting as an example; its original skeleton data of a certain frame is expressed as Equation (11).
Step 3: Determine whether the lightweight skeleton node data are occluded or not; if the space coordinates of this skeleton node data are all 0, it is determined that this skeleton node data are occluded, then the skeleton node data are predicted.
Step 4: Process the skeleton node data and skeleton vector data separately by convolution to obtain features that can represent each action.
Step 5: Combine the skeleton node features and skeleton vector features to form the overall action features utilizing feature fusion.

Occluded Skeleton Node Prediction
Occluded skeleton nodes can cause noise to the abnormal behavior recognition, affecting the accuracy.To solve the problem, we suggest a generative network-based method for occluded skeleton node prediction, which utilizes the skeleton node data from past frames to predict the skeleton node information of the next frame.
The advantages over existing methods are: the GRU at the lowest level can learn the motion information of the smallest unit frame without interference from higher levels, and the higher levels can capture different features of the motion of specific length frames; moreover, the latest GRU outputs from different levels are used as inputs during the prediction period of each time step, which makes the motion information more adequate and the features of the next frame more comprehensive.
In Figure 6, the skeleton data information of the previous, the current, and the future frame is represented by the vectors e t+1 , e t , e t+1 .The expected skeleton data information at moments t and t + 1 is e t and e t+1 .The skeleton data information for every time step is used as a series of input GRU units at the first layer.Define K distinct GRU unit sequences at the second level, each of which will only accept similar inputs from the first level's GRU units that have been time-step-modeled.If K = 2, for instance, the second layer would contain two GRU sequences: the first would be derived from time frame data t = {1, 3, 5, . . .}, whereas the second would come from time frame data t = {2, 4, 6, . . .}. GRUs at the same hierarchical level share weights, improving the characteristics of the skeleton data to improve long-term dependent learning.There are a total of K 2 GRU sequences on the third layer because for every K GRU sequence on the second layer, there are K different GRU sequences corresponding to it in the third layer.Each GRU sequence uses the same complex modulo K inputs from it.Up to level (M − 1), where a GRU sequence of K M−1 will exist in level M, the hierarchy's process of creating new, higher-level GRU sequences continues.In order to produce skeleton vector predictions for the associated hidden units in all hierarchies, a two-layer connected network is finally introduced.The inputs for these projected skeleton vectors will then contribute to the skeleton vector prediction process for upcoming frames.

Lightweight Multi-Stream Feature Cross-Fusion Process
Behavior recognition method networks with multi-stream feature fusion, such as dual-stream networks, 2s-AGCN [33], typically utilize single-stream networks to extract characteristics independently before fusing them.The feature fusion method performs weight fusion at the end, and the average pooling layer will overrun the fusion step, making the network unable to fully perform each dependent feature.To solve this question, this subsection proposes a L-MSFCF model, which performs feature cross-fusion during pooling to fully utilize each tributary feature.There are two parts to introduce the model: the network architecture and the basic convolution module.
The whole network of L-MSFCF consists of three sub-stream networks: skeleton vector stream network, skeleton joint stream network, and features cross-fusion stream network.Each sub-stream network utilizes the 2s-AGCN graph convolution network as the backbone.Either joints or skeletons can be used as input data.Formally, the skeleton sequence data are V = R C×T×S , D C × T × S, and C, T, S separately denote channel dimension, time dimension, and space dimension.Space characteristics may be extracted from the input data via the spatial stream network.Shallow sub-networks have a lot of inaccurate and localized data in their features.Conversely, features located in the network's deeper levels have less false information and more global information.Many conventional networks are bottom-up and end-to-end systems that only employ a subset of top-layer characteristics.These methods lack local information that facilitates action recognition classification.For this reason, the network proposed in this paper selects features from multiple layers.The features extracted from different levels have different feelings and contain various local and global information.The whole process of feature fusion is as follows: Step 1: Mark the skeleton vector features collected from the skeleton vector stream network, denoted as The skeleton joint stream network is almost identical to the skeleton stream network, where the extracted features are, respectively, denoted as L is the maximum layer of elements.In the experiment part, we set L to 3.
Step 2: Calculate the weights of the skeleton vector stream network and the skeleton joint stream network.The skeleton vector stream network N bv (D) and the skeleton joint stream network N bn (D) are represented as shown in Equations ( 13) and (14).
N bv (D)-skeleton vector stream network; N bn (D)-skeleton joint stream network; P bv -skeleton vector stream network weight; P bn -skeleton joint stream network weight.
Step 3: The fusion stream network inputs the features collected from the basic dualstream network, and the weights of the fusion stream network are calculated.As an example, for the case where L is 3, the fusion stream network is represented as shown in Equation (15).
N f us ()-fusion stream network; p f us ()-fusion stream network weight.
Step 4: Use weighted average fusion function w(•) to compute the prediction weight of the whole network.
Step 5: The feature data of the three tributaries are fused in the fusion layer by weighted average fusion, and finally in the fully connected layer by So f tmax function.Fuse all the information to finally output a feature that can represent the whole action.
The convolution module's goal is to extract deep features.This paper utilizes an adaptive graph convolutional network, and the advantage is that the whole process is a bottleneck structure, aiding in first reducing noise and then obtaining extremely effective information.Its specific structure is shown in Figure 7.The entire convolutional block can be represented as: f in -input features; f out -output features; K v -kernel size in space dimensions; W k -1 × 1 convolution operation; A k -N × N adjacency matrix, its elements indicate whether a vertex is in a subset of another vertex; δ-weighting parameter; B k -data-driven matrix.
Throughout the computation, we set the K v -space dimension's kernel size to 3. A k = ρ A k ρ, and A k is N × N adjacency matrix whose elements indicate whether the weak feature skeleton nodes are in the subset of lightweight feature skeleton nodes or not.ρ is the normalized diagonal matrix, ρ k = Σ j A k + σ. δ is set to 0.001 to avoid blank lines.W k denotes a 1 × 1 convolution operation with weights in the shape of B k × B k × 1 × 1. B k is a data-driven matrix in the shape of N × N. B k × B k is a non-local block that goes through the computation of Figure 8 once before participating in a second computation.The value of δ directly determines the impact of B k on the quadratic convolution.In the experiment, we set δ = 0.3 to obtain high-level valid information, and if the parameters and elements in the matrix were not initialized, its value was set to 0.01.

Trajectory Prediction Tracking Model (TPT Model) 6.1. Five-Bit Skeleton Screening Method
To lower the entire model's time complexity and reduce the impact of skeleton node occlusion on trajectory prediction, this paper proposes a five-bit screening method.First, the skeleton nodes are divided into five parts, A, B, C, D, and E, and their partitions are in Figure 9.Then, the skeleton nodes in each partition are sorted, utilizing lightweight feature skeleton node extraction results in Table 2.After that, select the feature skeleton nodes that can represent each partition.Finally, find out the mass point that is based on the five featured skeleton data, regard it as the starting point for trajectory prediction tracking.Calculate the probability of their occurrence based on feature skeleton node extraction results in Table 2, and sort the skeleton nodes in each partition by top; the results are as follows: We select the characteristic points in each partition and then denote them as e A , e B , e C , e D , e E .Form a pentagon e x by the five points.The calculation is in Equation (18).
N v -vertices number; v i -vertices space coordinates; v x -mass point space coordinates.
In most cases, we can essentially detect complete skeleton nodes.However, in some cases, some are not detected.For example, the skeleton node 10 is occluded in the partition D of Figure 9.To solve this problem, we can select the top-ordered skeleton nodes of the partition in turn, and according to Table 3, node 9 should be selected as the representative node of the partition.When a partition is occluded, set it to the same position detected in the previous frame.Considering that trajectory is a vector with velocity and direction, this paper calculates the change of direction and velocity of the mass point e x for each frame.The velocity of the mass point e x at T frames is represented in Equation (19).
v x -mass point e x space coordinates at frame T; v x−1 -mass point e x space coordinates at frame T − 1.
Define the space coordinates at T − 2, T − 1, and T frames as v x−2 , v x−1 , v x , the angle of the mass point at T frames is the angle between the vector − −−−− → , as shown in Figure 10.The cosine function of this angle is expressed as Equation (20).In the experiment part, the angle of frame 1 and frame 2 is generally set to 0.
In addition, we need to calculate the absolute velocities of e A , e B , e C , e D , and e E ; their computation is comparable to the e x velocity calculation.Once all the above data are calculated, we track the trajectory of e x and extract its position and motion characteristics to make a prediction of the next trajectory.Our inputs include the position information and absolute velocity of e A , e B , e C , e D , and e E and the angle change of e x .

TPT Modeling Architecture
The TPT model is autoregressive, the model predicts the future frames' trajectory by taking as inputs the previous cyclic state as well as features describing the earlier trajectory at each time step.The entire model forecasts the trajectory's state in the upcoming K frames using the current frame data as input.The TPT network model consists of two GRU layers, each containing 1000 hidden units and a linear activation function Linear.Mul denotes the product of the two matrices.The purpose of regularization and normalization is to avoid overfitting the model and reduce the algorithm generalization error.Finally, after the Sigmoid activation, the mass points prediction coordinates are generated (Figure 11).

Experiment and Results
The experiments were conducted on a Windows 10 system with an Intel(R) xeon(R) E5-2640 v4 @ 2.40 GHz processor with 32G RAM and the graphics card was an NVIDIA GeForce RTX 2080Ti.The codes were written in Python 3.7, and the entire training and testing were conducted on the PyCharm.

Datasets
The experiment had three parts.Firstly, we compared the LSFE model and RNN model time complexity and accuracy.To confirm the reasonableness of the L-MSFCF model, the All-MSFCF model and 2s-AGCN were compared.Lastly, we compared the TPT model with other models regarding the number of parameters and the final average loss error prediction.
The datasets we chose in this experiment were the human3.6mdataset [30], the UCF-Crime dataset [31], and the ShanghaiTech Campus dataset [32], as detailed in Section 3.2.Based on these datasets, we organized eight categories: walking, running, stooping, fighting, vicious kicking, climbing over walls, throwing suspicious objects, and slashing devices.Each video is between 0 and 10 s and the video format is avi.The entire dataset contains 3146 videos and we selected 314 videos as the test set and the rest as the training set.

LSFE Model Test
The purpose of lightweight skeleton nodes was to increase the action recognition speed, so we compared time complexity, as shown in Figure 12.Taking 15 frames as a recognition unit, the outcomes supported that the LSFE model's time complexity was less than the RNN model's [34].The results show that the time complexity of LSFE is significantly better than that of the RNN model, and the average recognition speed is about 86.5% higher.In order to further validate the effectiveness of the feature skeleton nodes, this paper compares the recognition accuracy rate between the LSFE model and the RNN model (Figure 13).The results show that the accuracy of the RNN model is higher than that of the LSFE model for all eight actions.In Table 4, we compared the average accuracy and time in detail.Although the average accuracy of the LSFE model is 4.5% lower than that of the RNN model, its average recognition speed is 86.5% higher.Thus, there was merit in abnormal behavior recognition based on lightweight skeleton nodes.

L-MSFCF Model Test
The L-MSFCF model had a good performance in accuracy and loss value after training, the loss rate of the L-MSFCF model is displayed in Figure 14, and the accuracy rate is shown in Figure 15.From Figure 14, the initial loss of the L-MSFCF model was as high as around 2.2 during the neural network's training phase.However, it decreased rapidly after 500 iterations, and the convergence gradually slowed down after about 500 iterations, and the final loss value reached a smaller value.It indicated that the L-MSFCF model had a good learning effect.
From Figure 15, in the first 1000 iterations, the accuracy curve of L-MSFCF converged rapidly, and then after 1500 iterations the accuracy basically stayed stable at a value of about 0.84 (Figure 16).
In addition, we assessed the algorithm's performance utilizing a confusion matrix.The vertical coordinate represented the real value, whereas the horizontal coordinate indicated the projected value.The diagonal element was the percentage of the projected value to the true value.The confusion matrix diagram demonstrated that running and walking had higher similarity among normal behaviors, probably because of the higher similarity of the lightweight skeleton nodes in their recognition process.Meanwhile, fighting and running had high similarities in abnormal behavior.Overall, normal behavior was more accurately recognized than abnormal behavior, and the recognition accuracy is 87.3%, so the overall recognition effect basically meets the expectation.
The L-MSFCF model took lightweight skeleton nodes as inputs; we regarded the model that had the same network architecture but took full skeleton nodes as the All-MSFCF model.
We compared the L-MSFCF model, the All-MSFCF model, the 2s-AGCN model [33], and the LSFE model.Table 5 presented the findings.The results demonstrated that when we input lightweight skeleton nodes, the L-MSFCF model's recognition accuracy outperformed both the 2s-AGCN model and the LSFE model by a large margin.However, the L-MSFCF model's accuracy was lower than that of the All-MSFCF models.
We also compared the time complexity of the L-MSFCF model, All-MSFCF model, and 2s-AGCN model to further assess the feasibility of the L-MSFCF model (Table 6).
Taking 15 frames as a recognition unit, the L-MSFCF model recognition speed was clearly higher than that of the All-MSFCF model and the 2s-AGCN network model.Compared with the All-MSFCF model, the average recognition speed was more than two times higher.Compared with the 2s-AGCN model, the average recognition speed is about 62.7% higher.
The L-MSFCF model was much more efficient than the 2s-AGCN model, superior to the 2s-AGCN network model in both recognition speed and accuracy.Even though the L-MSFCF model's accuracy was 5.4% less than the All-MSFCF model's, the recognition speed was improved by nearly one time.This showed that the L-MSFCF model had merit.
Finally, the method of this paper demonstrates the effect of recognizing some abnormal behaviors.The line in Figure 17 indicates the skeleton outline of the body in each frame.From Figure 17, it can be seen that the method can identify abnormal behavior more accurately.

TPT Model Test
We contrasted the TPT model with the PIF [35] and the S-GAN-P [36] model in order to emphasize its superiority.Each of them set up the training parameters, trained on the same datasets, predicted the trajectories in the future 15 frames, 30 frames, and 45 frames, and took the final average loss errors as the evaluation indexes.The experiment results are displayed in Table 7.
The findings demonstrated that the TPT model's predictions for the next 15 and 30 frames outperform those of the other two network models, but as time goes on, the gap between the errors of the TPT model and S-GAN-P model gradually decreases, and after 45 frames the error of S-GAN-P model is smaller than the TPT model.It shows that TPT has an advantage in short-term prediction, but its advantage gradually decreases with the increase of time, so the later research can focus on the following long-time prediction.The S-GAN-P model has 46.3k parameters and is the smallest model.However, the TPT model only has 17.6k parameters, which is about one-third of the parameters in S-GAN-P.Regarding inference speed, the quickest approach was S-GAN-P, which takes 0.0968 s for each inference step.The TPT model has an inference time of 0.0235 s for each inference step, four times faster than the S-GAN-P model.Table 8 shows that the TPT model has a considerable advantage both in parameter numbers and prediction time.We chose lightweight skeleton nodes and a redesigned convolutional architecture nicely circumvented the problem of substantial data and the usage of a cyclic architecture.Finally, this paper shows part of the visualization results based on trajectory prediction tracking.In order to make the visualization results clear, we process the mass point e x in the visualization by keeping the X-axis and Y-axis coordinates unchanged, but the Z-axis coordinates are subtracted by half of the height of the body.
Firstly, Figures 18 and 19 provide the visualization graphs of trajectory prediction results for some of the intact skeleton nodes.In the complete skeleton nodes, we utilize 4, 10, 7, 13, and 15, which represent the right hand, right foot, left hand, left foot, and head as the basis points to calculate the mass points.Figure 18 shows the walking posture prediction visualization results.Figure 19 shows the fighting posture prediction visualization results.

Discussion
This experiment verifies that the lightweight skeleton nodes process efficiently increases the timeliness of video action recognition.The LSFE method has a huge advantage in time complexity.The L-MSFCF model improves abnormal behavior recognition accuracy by predicting occluded skeletons and using feature fusion.We propose a method based on the previous frame for predicting skeleton data, which reduces the noise generated by the current skeleton data prediction from the distant skeleton data.
In this paper, 15 frames were selected as the most appropriate length of the action sequence.The advantages include a smaller amount of data and less noise, which enables a more accurate capture of video abnormal behaviors.Thus, the TPT model has the highest prediction accuracy under 15 frames, indicating that it is most effective in shortterm prediction.The model can predict the trajectory of abnormal behaviors efficiently and quickly, showing its significant advantages in real-time applications in the computer vision field.
In terms of video abnormal behavior recognition and tracking, the research in this paper has achieved some milestones, but the applicability and reliability of the method in complex scenarios have not yet been fully discussed, and the stability of the algorithm's performance in multiple datasets or scenarios, as well as the applicability of the algorithm to different numbers of skeleton nodes, have not been fully discussed.This paper only studies the video without combining audio, sensors, and other multiple data sources for comprehensive analysis.Future research should focus on the algorithm's real-time, multimodal fusion and interpretability.

Conclusions
In this paper, we addressed the problem of a large number of skeleton nodes as well as behavioral occlusion between individuals degrading the abnormal behavior recogni-tion speed and accuracy.We proposed a lightweight multi-stream feature cross-fusion (L-MSFCF) model.The model adopted lightweight skeleton node computation, which significantly improved the recognition speed; at the same time, it improved the recognition accuracy by predicting the occluded skeleton nodes and effectively coped with the behavioral occlusion problem.Experiments show that our model achieves an average accuracy of 87.3% for abnormal behavior recognition.In addition, we also proposed the Trajectory Prediction Tracking (TPT) model, which can predict the movement position in real time based on core skeleton nodes, and its short-term prediction average loss error is small.In conclusion, our research effectively solves the behavioral occlusion problem while improving the recognition speed and accuracy, providing new ideas and methods for the development of the video action recognition field.The proposed model is expected to be applied to various practical applications in the fields of security and surveillance.

Figure 4 .
Figure 4. Adaptive sliding window intervals selection calculation example.

Figure 10 .
Figure 10.Mass e x angle schematic at frame T.

Figure 12 .
Figure 12.Time complexity comparison of RNN model and LSFE model.

Figure 13 .
Figure 13.Accuracy rate comparison of RNN model and LSFE model.

Figure 20 .
Figure 20.Walking posture trajectory prediction visualization results with occluded skeleton nodes.

Figure 21 .
Figure 21.Fighting posture trajectory prediction visualization results with occluded skeleton nodes.

Table 2 .
Lightweight feature skeleton node extraction results.

Table 3 .
Skeleton node sequences for each partition.

Table 4 .
Model prediction accuracy and time comparison.

Table 5 .
Behavior recognition rate accuracy comparison.

Table 7 .
The model's average loss errors in each frame prediction.

Table 8 .
Model prediction times comparison.