1. Introduction
Traffic police gesture recognition technology is key to addressing road congestion and traffic management issues [
1]. Especially in autonomous driving systems, accurately interpreting traffic police command gestures is a critical component for enabling coordinated interaction between vehicles and traffic controllers. However, the meaning conveyed by gestures is often influenced by factors such as context, background, and clothing, making gesture recognition a challenging task [
2].
In the field of deep learning, traffic police gesture recognition technology primarily focuses on image-based visual methods [
3,
4]. These methods typically employ convolutional neural networks (CNNs) to extract features from single frames or sequential video frames [
5] and combine them with temporal modeling networks (such as LSTMs, 3D-CNNs, etc.) to achieve gesture classification [
6]. However, traffic police gestures typically exhibit distinct temporal dynamic characteristics. Relying solely on single-frame images or simple temporal modeling methods makes it difficult to fully capture the continuous changes and temporal dependencies of gestures. Therefore, some studies have begun to introduce skeleton-based gesture recognition methods. Compared to RGB image-based gesture recognition methods, this approach extracts sequences of human keypoints through pose estimation algorithms, mapping complex visual information into structured temporal data. This reduces data dimensionality while preserving critical motion semantic information [
7]. Skeletal keypoint data centers on describing the spatial positions of human joints and their temporal variations and on mapping high-dimensional pixel information from raw images into structured temporal representations. This approach significantly reduces data dimensionality while preserving the primary motion semantics of gestural actions [
8,
9]. Moreover, this method uses joint motion as the basic modeling unit, which offers good interpretability and aligns better with the essential characteristics of traffic police gestures, with body movement changes as the main expression [
10]. Therefore, skeleton-based gesture recognition has gradually become an important research direction in traffic police gesture recognition.
However, traffic police gestures consist of continuous motion sequences with prominent temporal dependency and phased features. When characterizing the overall dynamic evolution of gestures, a single temporal model fails to balance local motion variations and long-term temporal dependencies. Furthermore, key motions across different time segments differ in their significance for gesture discrimination. The absence of an effective weight allocation mechanism tends to weaken discriminative information, thus impairing recognition performance.
To address the aforementioned issues, this paper proposes a traffic police command gesture recognition method that integrates lightweight pose estimation and hybrid temporal modeling, with a collaborative design of skeleton representation and temporal modeling. First, YOLOv11m-Pose is employed to estimate poses of traffic police in video sequences, extract human keypoint sequences, and use them as input features for subsequent temporal modeling. Subsequently, a bidirectional long short-term memory network (BiLSTM) is adopted to model key-point sequences, fully exploring the forward and backward temporal dependencies of gesture motion in the time dimension, thereby enhancing the expressive ability to capture continuous motion variations. On this basis, the Transformer architecture is introduced, and the self-attention mechanism is used to model and weight key motion features across different time steps, thereby more effectively capturing long-term temporal dependencies and global discriminative information.
The main contribution of this paper is as follows:
(1) First, a skeletal temporal representation framework for traffic police gesture recognition is proposed, which integrates lightweight pose estimation and hybrid temporal modeling to effectively reduce background interference while preserving discriminative motion information.
(2) Second, a hybrid temporal modeling architecture combining BiLSTM and Transformer is designed: BiLSTM is responsible for capturing local temporal continuity and motion transition features, while the Transformer models long-range temporal dependencies via the self-attention mechanism, thus achieving multi-scale temporal feature representation.
(3) Extensive experiments on the traffic police gesture dataset demonstrate that the proposed method achieves excellent recognition performance while maintaining low computational complexity, verifying the effectiveness and practical value of the hybrid temporal modeling framework in real-world traffic gesture recognition applications.
The structure of this manuscript is as follows:
Section 2 summarizes existing methods for human skeletal keypoint detection and gesture recognition.
Section 3 proposes the research method of this paper and elaborates on its underlying principles in detail.
Section 4 includes a comprehensive evaluation of the proposed method and compares it with state-of-the-art methods.
Section 5 includes an in-depth discussion of the experimental results, analyzes the advantages and limitations of the proposed method, and explores potential avenues for improvement; finally,
Section 6 summarizes the research findings and draws the corresponding conclusions.
2. Related Work
In recent years, deep learning-based human pose estimation models have been continuously evolving. The OpenPose model enables real-time, reliable keypoint localization for multiple users and supports downstream motion and pose analysis tasks [
11]. The HRNet model can maintain high-resolution representations throughout the entire network, thereby achieving highly accurate keypoint localization [
12]. The YOLO-Pose series of models enables end-to-end, real-time inference for object detection and keypoint localization, making them particularly suitable for real-time gesture recognition and deployment in traffic scenarios [
13,
14]. AlphaPose is a method for multi-person pose estimation that represents the connection relationships between key points using alpha shapes [
15]. The MediaPipe model first roughly localizes the human body and then performs fine-grained regression on key points and reduces redundant computations through temporal tracking [
16]. These models have achieved favorable performance in terms of key point localization accuracy and computational efficiency, thereby providing a reliable data foundation for key point-based action and gesture recognition.
Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), Gated Recurrent Units (GRUs), and their variants have been widely applied to gesture recognition and human action analysis tasks [
17]. Verma et al. [
18] proposed the GoogleNet+BGRU model, achieving a recognition accuracy of over 98% on most datasets. Xu et al. [
19] proposed the CFF-RCNN model, which exhibits higher accuracy and faster convergence speed. Noorkholis et al. [
20] proposed the 3DCNN+LSTM+FSM model, which maintains a recognition accuracy of 97.8%. Singh et al. [
21] demonstrated that by simultaneously modeling the forward and backward temporal information of sequences, BiLSTMs can more comprehensively capture the contextual dependencies of motions in the time dimension. However, these models are limited by their recursive computation; models based on recurrent structures often struggle to fully capture long-term dependencies between key motion segments when processing long sequences or complex motion structures, and their global temporal modeling capabilities remain limited.
To address the aforementioned limitations, researchers have found that Transformers can perform global modeling on features at different time steps at the sequence level. Through the self-attention mechanism, they explicitly model the correlations across time steps, thereby effectively capturing long-term temporal dependencies [
22]. The CT-HGR framework proposed by Mansooreh et al. [
23] is based on the Vision Transformer’s attention mechanism, which effectively models long-term dependencies among key motion segments. The STFTnet architecture proposed by Tang et al. [
24] fuses convolutional neural networks with Transformers; it captures global temporal dependencies via the multi-head self-attention mechanism, breaks free from the constraints of recursive computation, and effectively addresses the problem that recurrent structure models struggle to model the long-term dependencies of key motion segments in long-sequence and complex dynamic gesture recognition. Although Transformer-based methods offer significant advantages in long-term temporal action modeling and fine-grained action differentiation, they still have inherent limitations for characterizing local temporal dynamics when used alone. To address this, Gazis et al. [
25] proposed a combined architecture of 3DCNN and Transformer, where 3DCNN extracts short-term spatiotemporal features and Transformer captures long-term dependencies, thereby making up for the shortcomings of the standalone Transformer in characterizing local temporal dynamics. Liu et al. [
26] designed a CNN-Transformer hybrid architecture, which extracts local features with ResNeXt50, models global relationships via the CAT branch, further optimizes feature representation by integrating AFB and MFA blocks, and enhances the capability of capturing local temporal dynamics. Wang et al. [
27] proposed a TD-CNN-Transformer hybrid architecture that extracts local features from data cube sequences using TD-CNN, combines positional encoding with a Transformer to model global dependencies, and effectively compensates for the limitations of a standalone Transformer; Guo et al. [
28] further proposed the MG-GCT architecture, which extracts spatiotemporal local features via a motion-guided two-stream GCN, combines a temporal Transformer to model global dependencies, and improves the characterization of local temporal dynamics.
In summary, effectively integrating the advantages of recurrent neural networks for local temporal modeling with the global dependency-capture capability of Transformers to achieve a balance between local and global temporal features has become an important research direction in current gesture recognition [
29,
30]. Based on the aforementioned research background, this paper further explores an efficient modeling framework for traffic police command gesture recognition, focusing on the collaborative design of skeletal representations and hybrid temporal modeling.
3. Method
The traffic police gesture recognition method consists of three stages. In the first stage, the YOLOv11m-Pose algorithm extracts skeleton data for traffic police command gestures in the video sequence. In the second stage, data collection errors are eliminated through feature–label frame alignment, and temporal sequence features are constructed using sliding windows, converting single-frame static posture features into temporal features that capture the dynamic evolution of gestures. Finally, the data obtained in the second stage is used as the input of the BiLSTM-Transformer fusion model for action recognition.
This paper proposes a traffic police gesture recognition method based on skeletal keypoints and hybrid temporal modeling. YOLOv11m-Pose is employed for human keypoint extraction, rather than relying on appearance-dependent RGB features. Since skeletal features primarily capture human structural motion, they are less sensitive to illumination variations, background interference, and environmental changes, thereby providing stronger cross-scene robustness and generalization capability.
3.1. Skeleton Key Point Extraction
The YOLOv11m-Pose algorithm uses an end-to-end detection approach that jointly performs target detection and keypoint prediction. The network structure is shown in
Figure 1. When extracting the key points of the skeleton of the traffic police gesture, the video is first parsed into a continuous stream of frame images and fed into the Backbone network for pre-training and feature extraction, and then the multi-scale position information is fused by the Neck layer. Finally, the Head layer completes image classification and target regression and outputs the target detection frame and the coordinate information of the skeleton’s key points.
The YOLOv11m-Pose model outputs 17 COCO standard human keypoints. For frame
t, the keypoint set is denoted as follows:
where
represents the two-dimensional coordinates of the
k-th keypoint, and
denotes the confidence score.
In real-world traffic scenarios, issues such as occlusion, motion blur, and detection jitter frequently occur. To ensure feature quality, a keypoint confidence threshold () was adopted. When the confidence score satisfied (), the corresponding keypoint was regarded as missing. To avoid the ambiguity caused by directly filling missing keypoints with zero values, which may confuse missing keypoints with valid keypoints located near the coordinate origin, a visibility-mask-based temporal interpolation strategy was employed to handle missing keypoints.
First, a visibility mask was generated for each frame of the skeleton sequence:
where
indicates that the keypoint is valid, while
indicates that the keypoint is missing. Subsequently, linear temporal interpolation is applied to restore the coordinates of missing keypoints, thereby ensuring the temporal continuity of the motion trajectory.
All keypoints within a single frame were concatenated in a fixed order to form a one-dimensional feature vector:
Meanwhile, the visibility mask vector was defined as follows:
The mask vector was concatenated with the skeleton coordinate features to obtain the final input representation:
Finally, the frame-wise feature matrix X was constructed, where N denotes the total number of video frames.
3.2. Data Processing
Gesture labels are provided in a frame-level annotation format. Due to potential frame count mismatches during data export from the original data source, this paper adopts a length alignment strategy to ensure a one-to-one correspondence between features and labels. Assuming that the number of feature frames is , the length of the label is , and the minimum value of the alignment length is , the aligned feature and label sequences and are obtained.
To handle potential null values and missing keypoints in the feature files, a time-series-based linear interpolation method is used for completion, and a visibility mask records each keypoint’s validity.
To adapt to the BiLSTM-Transformer network’s input format
, the frame-wise features are further converted into fixed-length sequence samples. A sliding window strategy is adopted, and for a given sequence length
, the
i-th sequence sample is denoted as follows:
The corresponding labels adopt the sequence-to-one supervision strategy, taking the label of the last frame in the sequence:
With this strategy, the frame-level data of length
N can generate approximately
sequence samples. This sequence construction method enables modeling local continuous motion variations in traffic police gestures while preserving sample size, thereby providing a stable input for subsequent temporal network learning. Before temporal modeling, all skeleton joint coordinates in each frame are subtracted by the reference center joint coordinates, converting the extracted 2D skeleton coordinates into relative coordinates:
where
denotes the coordinate of the
i-th skeleton joint in frame
t, and
represents the coordinate of the reference center joint.
is the normalized relative coordinate.
This preprocessing preserves the inherent relative motion relationships between skeleton joints while reducing the impact of global positional variations.
3.3. Design of BiLSTM-Transformer Fusion Model
Traffic police gestures exhibit distinct temporal dynamics, including variations in motion duration, transition speed, temporal continuity, and long-range inter-frame dependencies. Therefore, a single temporal modeling mechanism is insufficient to comprehensively characterize diverse gesture dynamics. To address this issue, the proposed framework adopts a hierarchical hybrid temporal modeling strategy that integrates BiLSTM, Transformer, and attention pooling mechanisms to learn local and global temporal dependencies collaboratively.
3.3.1. Embedding Layer
Firstly, the feature dimension of the input attitude feature
is mapped from 34 to 128 via a linear embedding layer, and Layer Normalization is introduced to alleviate the vanishing gradient problem during training. At the same time, noise from attitude estimation errors are effectively suppressed, thereby enhancing network training stability.
where
is the high-dimensional embedding feature of the
t-frame. The study is the weight matrix of the embedding layer,
is the original pose feature of the
t-frame, and
is the bias vector of the embedding layer.
3.3.2. BiLSTM
To capture local temporal dependencies in traffic police gestures, this paper employs a BiLSTM to model embedded feature sequences. The LSTM model can effectively capture temporal dependencies in sequence data through the gating mechanisms of the forget gate, input gate, and output gate. The LSTM cell unit architecture is shown in
Figure 2, and the three gate control mechanisms are as follows.
(1) The forget gate determines which historical information to discard based on the current input information. For instance, when a gesture transition from Stop to Go Straight is detected, the forget gate actively discards the previous memory of wrist positions. The specific calculation is given by Equation (
10), which outputs a value between 0 and 1 to control the proportion of historical information to discard, with 0 indicating complete forgetting and 1 indicating full retention.
where
denotes the input data at time
m,
H is the output state value of the hidden layer in the LSTM unit at time
,
W represents the corresponding weight,
b is the corresponding bias parameter,
f is the state value of the forget gate at time
m, and
denotes the Sigmoid function.
(2) The LSTM integrates new information into the cell state via the input gate, with the specific calculations given by Equations (
11) and (
12).
Candidate values:
where
i is the state value of the input gate at
m time, and
c is the candidate value of the memory unit at
m time.
All information transmission is centered on the cell state, and the updated cell state directly determines the output.
(4) Based on the updated cell state, the output at the current time step is controlled via the output gate, with the specific calculations given by Equations (
14) and (
15).
BiLSTM is a variant of LSTM that runs two LSTMs simultaneously on a time series, one from the front to the post-processing and one from the back to the forward processing. BiLSTM runs the LSTM layer forward and backwards along the time axis to calculate the bidirectional hidden state. The forward LSTM is calculated from the first element of the sequence to the last element in order, and the backward LSTM is the opposite. These two hidden states are connected together to form the final bidirectional hidden state. The operation process of the BiLSTM model is shown in
Figure 3.
Specifically, the BiLSTM module focuses on modeling local temporal continuity and short-term motion-transition characteristics among adjacent frames, enabling effective representation of gesture initiation, transition, and termination dynamics.
The high-dimensional embedding features of each frame are input into the bidirectional LSTM to obtain the hidden state of the forward and backward LSTM in the
t-frame, capture the temporal dependencies from the past to the current, and capture the temporal dependencies from the future to the current, as shown in Equations (
16) and (
17).
Finally, the forward and backwards hidden states are spliced, as shown in Equation (
18). The bi-directional structure enables the model to use historical and future context simultaneously, helping alleviate recognition difficulties caused by the blurring of gesture start and end boundaries.
3.3.3. Transformer Encoder
Although BiLSTM performs well in short-term modeling, its ability to capture long-term dependencies is limited. To address this, this paper introduces a Transformer encoder on top of the BiLSTM output to enhance the modeling capability of global temporal relationships. The Transformer model architecture is shown in
Figure 4.
The Transformer encoder primarily consists of a multi-head self-attention mechanism and a feedforward network. The calculation form of the multi-head self-attention is as follows:
Among these, is the query matrix, representing the feature vectors at the current time step, used to query relevant temporal features. is the key matrix, comprising the feature vectors across all time steps, used to match with matrix . is the value matrix, containing the feature vectors across all time steps, serving as the attention-weighted target. denotes the product of and the transpose of , calculating the similarity between and each . Higher scores indicate stronger relevance. is the scaling factor that prevents vanishing gradients in the softmax layer caused by excessively high dimensionality. The softmax() normalization function converts similarity scores into attention weights ranging from 0 to 1.
Unlike conventional recurrent modeling approaches that assume relatively uniform propagation of temporal dependencies, the Transformer encoder dynamically establishes long-range temporal associations through the self-attention mechanism. Different temporal frames are assigned adaptive correlation weights based on contextual relevance, enabling the network to capture distinct patterns of temporal dependencies across gesture actions.
3.3.4. Attention Pooling Mechanism
To compress temporal features into a fixed-length global representation, this paper employs an attention pooling mechanism. The calculation method for its attention weights is shown in Equation (
20):
where
is the attention weight for frame
t; higher weights indicate greater importance of that frame for classification. The tanh () hyperbolic tangent activation function maps feature to the interval [−1, 1], enhancing nonlinear representation.
denotes the transformation matrix for attention weights,
represents the bidirectional LSTM hidden state at frame
t, and is the weight vector for attention scores used to compute scalar scores.
is the weight vector for attention scores used to compute scalar scores.
Equation (
20) essentially adopts a Softmax-based normalization mechanism for attention-score computation. This normalization constrains all attention weights to the interval [0, 1] while ensuring that the summation of all temporal attention coefficients equals 1. Therefore, the attention pooling mechanism forms a normalized probabilistic distribution over temporal frames, effectively preventing uncontrolled weight amplification and imbalanced allocation of keyframe feature weights.
The final context vector is represented as follows:
where
represents the global context feature vector, which integrates the key features of the entire time series.
Finally, the context features obtained through attention pooling are fed into the classification head for gesture category prediction. The classification head consists of a fully connected layer, a ReLU activation function, and Dropout and outputs the final category probabilities via Softmax:
where
denotes the probability vector for the predicted class,
represents the weight matrix of the classification head with dimensions [N × D], and
indicates the bias vector of the classification head.
3.4. BiLSTM-Transformer Model
Let the input sequence be
, where
B is the batch size,
T is the sequence length, and
F is the feature dimension of a single frame. First,
is fed into the BiLSTM layer to obtain the bidirectional hidden states
:
In this equation, denotes the hidden state dimension of the LSTM. The bidirectional long short-term memory (BiLSTM) network captures local temporal dependencies and outputs a contextual feature representation for each frame.
Subsequently,
is fed into the Transformer encoder to obtain the feature representation
, which integrates global contextual information:
The Transformer models long-range global dependencies via the multi-head self-attention mechanism, effectively capturing contextual correlations across long temporal sequences. Finally, a temporal attention mechanism is introduced to generate the weighted context vector
:
where
and
are learnable network parameters, ⊙ denotes element-wise multiplication, and
represents the attention weight corresponding to the
t-th frame. The weighted context vector
is then fed into the classifier:
The BiLSTM extracts local temporal dynamic features, while the Transformer enhances global dependency modeling. Their fusion enables the model to learn discriminative temporal representations for traffic police gesture recognition. The BiLSTM-Transformer model framework is shown in
Figure 5.
4. Experiments and Results
Figure 6 shows that the Chinese traffic police gestures include eight types: Stop, Go straight, Wait for the left turn, Turn Left, Turn Right, Change lane, Slow down, and Pull over. The public traffic police gesture dataset [
31] includes a “Prepare” action, indicating that the traffic police did not take any action. The experimental dataset contains 20 videos with a resolution of 1024 × 768. The dataset partition is presented in
Table 1, where the training and testing sets correspond to different but paired traffic scenarios. The gesture dataset only provides frame-level ground-truth gesture labels for each video frame, while no skeleton annotations are available for the gesture sequences.
Therefore, the YOLOv11m-Pose algorithm is used to extract skeleton data for each video frame, outputting points including the nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, and right ankle. Taking the Stop category as an example, its skeleton sequence diagram is shown in
Figure 7. Simultaneously, the skeleton data is aligned with the label data to serve as input for the category recognition model. Simultaneously, the extracted skeleton sequences were temporally aligned with the corresponding gesture category labels to construct temporally consistent input data for the gesture recognition model.
The experiments were conducted on the Windows 11 operating system with the following computer configuration: CPU: 13th Gen Intel® Core™i7-13650HX @ 2.60 GHz, RAM: 16.0 GB, and GPU: NVIDIA GeForce RTX 4060 Laptop GPU. The deep learning platform was built based on the PyTorch 2.0.1 framework, with program code written in Python 3.9. Computational acceleration was implemented using NVIDIA CUDA and NVIDIA CUDNN.
The Transformer encoder comprises four layers with eight attention heads. The embedding and hidden-state dimensions of the BiLSTM are both set to 64. The model is trained with the AdamW optimiser from PyTorch 2.0.1, an initial learning rate of , a batch size of 64, and 350 training epochs. The dropout rates for the Transformer encoder and the classifier are set to 0.1 and 0.3, respectively.
4.1. Evaluation Indicators
For pose estimation tasks, evaluation metrics primarily include accuracy, F1-score, number of parameters, and inference time, with calculation formulas as shown in Equations (
28) and (
29).
where
, and
.
denotes true positive samples (correctly detected),
denotes true negative samples,
denotes false positive samples (falsely detected), and
denotes false negative samples (missed detections).
4.2. Selection of Skeletal Keypoint Detection Models
To assess how different skeleton extraction methods affect traffic-police gesture recognition performance, this study employed the BiLSTM + Transformer model with a temporal sequence length of 40 and an LSTM hidden-state dimension of 64. A total of nine mainstream keypoint detection models, including the YOLO series (YOLOv11m/n/l/x and YOLOv8n), OpenPose, AlphaPose, HRNet, and MediaPipe, were evaluated. The experimental results are presented in
Table 2.
As shown in
Table 2, significant performance differences exist among skeleton keypoint detection models on the traffic police hand gesture dataset. YOLOv11m-Pose achieved the best results across all models, with Accuracy and F1-Score both at 98.91%. Compared with OpenPose, AlphaPose, MediaPipe, and HRNet models, its Accuracy improved by 1.53%, 4.92%, 6.44%, and 3.72%, respectively, while its F1-Score increased by 1.56%, 5.05%, 6.51%, and 3.7%.
Among the YOLO-based methods, YOLOv11m-Pose outperforms YOLOv11n-Pose, YOLOv11l-Pose, YOLOv11x-Pose, and YOLOv8n-Pose, yielding accuracy gains of 0.42%, 0.72%, 13.67%, and 8.13%, respectively. The above results demonstrate that YOLOv11m-Pose possesses significantly enhanced keypoint localization accuracy and stability when handling complex dynamic gesture scenarios. In summary, the choice of skeleton extraction method directly impacts the performance of downstream temporal gesture recognition. Among the evaluated methods, YOLOv11m-Pose provides the most favorable trade-off between recognition accuracy, robustness, and computational efficiency.
For intuitive comparison, the skeletal keypoint detection results of all models are presented in
Figure 8, which intuitively reflects the keypoint detection accuracy in actual scenes.
The results demonstrate that different skeleton extraction methods directly affect the performance of downstream temporal gesture recognition. Among them, YOLOv11m-Pose achieves the best balance between recognition accuracy, robustness, and computational efficiency. Traffic police gestures are primarily conveyed through upper-body joint movements, particularly the coordinated motions of the shoulders, elbows, and wrists. The 17 standard COCO keypoints extracted by YOLOv11m-Pose are sufficient to capture the essential motion features for gesture discrimination while maintaining computational efficiency and enabling real-time detection.
4.3. Parameter Evaluation
The network architecture and input data structure of the model fundamentally influence the effectiveness of traffic police command action recognition. This section evaluates the impact of the temporal step size of the input sequence and the number of hidden states on the LSTM network’s accuracy.
The time step length serves as the“time window” through which models perceive continuous actions in sequential tasks, directly determining both the quality of temporal information capture and computational efficiency. If the step length is too short, the model cannot capture the entire gesture, resulting in fragmented features and a significant drop in classification accuracy. If the step length is too long, it introduces redundant static frames after the gesture completes. This not only increases the computational load on LSTMs and Transformers but also causes the model to focus on noise rather than core actions, leading to overfitting. Furthermore, the effective action duration varies across different gestures. A fixed step length forces padding with zeros for short gestures and truncation for long ones, compromising the integrity of temporal features and ultimately reducing the model’s adaptability to diverse gestures.
The number of LSTM hidden states serves as the“container” for storing temporal features within the model, balancing feature representation capability with generalization performance. If the number of hidden states is too low, the model fails to capture fine-grained gesture features, leading to underfitting. Conversely, excessive hidden states enable fitting more details but increase susceptibility to noise in the training data, leading to overfitting.
Therefore, to evaluate the impact of input sequence time steps and LSTM hidden state counts on the model, testing was conducted using an LSTM model. The time step values ranged within [10, 15, 20, 25, 30, 35, 40, 45, 50]; hidden state counts ranged within [4, 8, 16, 32, 64, 128]. For action sequences with fewer than 10 frames, the estimated data were padded with all zeros at the end. For sequences exceeding 50 frames, a shortened frame sequence was generated by selecting random, non-repeating frames and arranging them in their original order [
32].
As shown in
Figure 9, performance generally improves with increasing the number of hidden states, particularly at lower state counts such as 4–64. However, when hidden states reach 128 and 256, performance no longer shows significant gains at certain time steps and occasionally even declines slightly. Performance analysis for different hidden state configurations is as follows:
Hidden states = 4: Performance is relatively low and shows little variation across different time steps.
Hidden states = 8: Demonstrates significant performance improvement but stabilizes after time step 20.
Hidden states = 16: Performance continues to improve, reaching a high level of stability.
Hidden states = 32 and 64: Performance further improves, approaching optimal and stable levels.
Hidden states = 128: Performance is lower than that of the model with 64 hidden states.
For most hidden states, performance improves with increasing time steps, particularly between 10 and 35 steps. After approximately 35–40 steps, performance remains largely stable. Overall, the model performs best and most stably with Hidden states = 64 and a time step of 40.
In summary, the time step determines the temporal receptive field of the model and directly affects the completeness of temporal dependency modeling [
33]. Meanwhile, the hidden state dimension determines the temporal representation capability of the BiLSTM network [
34]. Considering that the input skeleton feature dimension in this study is 34, the hidden state dimension was selected according to commonly adopted empirical principles in sequence modeling and further optimized through parameter evaluation experiments [
35,
36]. Experimental results demonstrate that when the time step is set to 40 and the hidden state dimension is set to 64, the model achieves the best balance among recognition accuracy, model stability, generalization ability, and computational efficiency. Therefore, in this study, the hidden state dimension was uniformly set to 64, and the time step length was set to 40. All model experiments were conducted based on these parameters.
4.4. Ablation Experiment
This experiment treats traffic police gesture recognition as the task and compares the performance of a single model with a multi-model fusion approach. The reported inference time reflects only the inference latency of the proposed BiLSTM-Transformer recognition model and excludes the runtime consumed by YOLOv11m-Pose for skeleton keypoint extraction. The experimental results are shown in
Table 3.
As shown in
Table 3, among the single-model architectures, BiLSTM achieves the best baseline performance, with both Accuracy and F1-Score at 98.91%, significantly outperforming the Transformer model, which attains 91.24% Accuracy and 91.25% F1-Score. From a computational efficiency perspective, the Transformer requires more parameters and longer training time than the BiLSTM, yet it exhibits a noticeable decline in recognition performance. The BiLSTM module mainly captures local temporal continuity and short-term motion-transition characteristics through recurrent dependency learning, making it more suitable for modeling adjacent-frame temporal dynamics. In contrast, Transformer encoder models capture global temporal dependencies via self-attention and can directly establish contextual associations between non-adjacent temporal frames.
From a theoretical perspective, the computational complexity of the BiLSTM module is mainly determined by the temporal sequence length and hidden-state dimension, which can be approximated as , where T denotes the sequence length, and H represents the hidden-state dimension. In contrast, the Transformer encoder introduces a self-attention mechanism whose computational complexity is approximately , where D denotes the feature dimension. Therefore, compared with single temporal models, the proposed hybrid architecture inevitably increases the overall computational complexity due to the integration of recurrent temporal modeling and global self-attention mechanisms.
However, this additional complexity enables collaborative modeling of local temporal continuity and long-range temporal dependencies, thereby significantly enhancing the representation of temporal features. As shown in
Table 3, although the proposed fusion model increases the parameter count to 86.41K, the recognition accuracy improves to 98.91%, which is 2.43% higher than that of the standalone BiLSTM model. Moreover, the temporal sequence length in this study is fixed at 40 frames, effectively constraining the quadratic complexity introduced by the Transformer self-attention mechanism. Therefore, the proposed model achieves a favorable trade-off between computational complexity, recognition accuracy, and inference efficiency.
The corresponding confusion matrices are presented in
Figure 10,
Figure 11 and
Figure 12, which correspond to the ablation experimental data.
4.5. Comparative Experiments
To validate the effectiveness and practicality of the proposed method for traffic police gesture recognition, comparative experiments were conducted using multiple representative gesture recognition models, including graph convolutional networks, recurrent neural networks, and multi-feature fusion. The comparison results are shown in
Table 4.
As shown in
Table 4, the proposed BiLSTM + Transformer hybrid spatiotemporal model achieves a recognition accuracy of 98.91%, ranking first among all compared methods. It outperforms mainstream state-of-the-art graph convolutional network (GCN)-based approaches, including STIE-GCN and MD-GCN, thereby demonstrating the effectiveness of the proposed method. In contrast, the accuracies of conventional recurrent temporal models, such as LSTM, GRU, and RCNN, remain around 95%, indicating a clear performance bottleneck. This suggests that temporal modeling alone is insufficient to fully capture the spatial joint characteristics inherent in traffic police gestures. Furthermore, the pure object detection method YOLOv8-nano achieves an accuracy of only 78.70%, which is more than 16 percentage points lower than that of the weakest recurrent-based method. This result indicates that spatiotemporal features derived from human skeletal joints are substantially more effective for traffic police gesture recognition than direct RGB image-based object detection. Skeletal representations can effectively suppress interference caused by background clutter and illumination variations while focusing on the essential motion patterns of gesture execution.
For multi-feature fusion methods combining visual feature extraction with temporal modeling, such as VGGNet-SSD + KEN + LSTM and DenseNet Part Localizer + LSTM, accuracy remains within 96.3–96.9%, with inference times of 0.05–0.10 s, achieving a balance between precision and efficiency. However, such methods typically rely on relatively complex feature-extraction networks with redundant model structures, which demand higher computational resources and more complex deployment environments. Furthermore, PKEN + LSTM and PKEN + Bidirectional LSTM exhibit relatively low recognition performance on this dataset, achieving accuracies of 91.18% and 86.84%, respectively. This indicates that relying solely on local keypoint augmentation without effective global temporal modeling struggles to fully capture the discriminative action patterns inherent in traffic police hand signals.
Compared to the above models, the proposed BiLSTM + Transformer model achieves a more reasonable balance between recognition accuracy and inference efficiency. This approach maintains a low inference time (0.1025 s) while achieving an accuracy of 98.48%, outperforming most methods based on recurrent neural networks and multi-feature fusion. It also demonstrates significantly superior real-time performance compared to graph convolutional network models.
5. Discussion
Experimental results show that the proposed method achieves both high recognition accuracy and low inference latency in traffic police hand gesture recognition, resulting in excellent overall performance. This advantage primarily stems from the optimal selection of skeleton representation and the effective design of the hybrid temporal modeling architecture. First, the keypoint extraction based on YOLOv11m-Pose provides stable, discriminative input representations for subsequent temporal modeling. Compared to end-to-end modeling directly on RGB video, skeleton keypoints effectively mitigate the impact of complex backgrounds, lighting variations, and differences in pedestrian appearance. This allows the model to focus more on human body structure and motion patterns, delivering more stable and consistent input features for subsequent temporal modeling.
Secondly, the fusion architecture of BiLSTM and Transformer is a key factor in performance enhancement. BiLSTM excels at capturing local temporal continuity and action-transition features, while the Transformer enhances modeling of global temporal dependencies through self-attention mechanisms. The combination enables the model to simultaneously represent local details and overall action structures, thereby forming more discriminative temporal feature representations across complex gesture categories. Although the hybrid architecture increases theoretical computational complexity compared with standalone temporal models, the additional complexity remains controllable under the fixed sequence length setting and provides substantial improvements in temporal representation capability and recognition accuracy.
The proposed method primarily relies on skeleton-based temporal motion representations, which improve robustness to background interference and environmental variations. However, under severe occlusion or multi-person overlap, inaccurate keypoint extraction or incorrect skeleton association may propagate errors into subsequent temporal modeling stages and affect recognition performance. Future work will focus on incorporating multi-object modeling mechanisms and more efficient temporal feature modeling strategies, while also exploring multimodal information fusion to further enhance the model’s adaptability and generalization performance in real-world traffic scenarios.
6. Conclusions
(1) This paper proposes a traffic police hand gesture recognition method based on skeletal keypoints and hybrid temporal modeling. It employs YOLOv11m-Pose for human keypoint extraction. Meanwhile, a local–global collaborative temporal modeling framework is constructed by integrating BiLSTM and Transformer. Experimental results demonstrate that the BiLSTM module captures short-term motion transitions, while the Transformer encoder models long-range temporal dependencies via self-attention. This complementary temporal modeling strategy improves the model’s adaptability to different gesture evolution patterns. Experimental results demonstrate that this method achieves 98.91% accuracy and F1-score on the traffic police command gesture dataset, with an average inference time of 1.3299 s per gesture sequence.
(2) The ablation study results indicate that in a single temporal model, BiLSTM achieves 5.24% higher accuracy than Transformer for traffic police hand gesture recognition tasks. After integrating BiLSTM with Transformer, the model achieves respective improvements of 2.43% in Accuracy and 2.42% in F1-Score compared with the single BiLSTM baseline.
(3) The comparison results indicate that the BiLSTM + Transformer hybrid spatiotemporal model outperforms GCN-based approaches, traditional recurrent networks, and image-based object detection methods. The use of skeleton representations effectively mitigates the effects of lighting variations and complex backgrounds, while the integration of spatial and temporal feature modeling substantially improves the recognition accuracy of fine-grained gestures.