A Traffic Police Gesture Recognition Method Based on BiLSTM-Transformer Architecture

Zhang, Xiaoyu; Guo, Baohua; Wang, Sen; Sigama, Anthony; Bassir, David

doi:10.3390/electronics15122578

Open AccessArticle

A Traffic Police Gesture Recognition Method Based on BiLSTM-Transformer Architecture

by

Xiaoyu Zhang

¹

,

Baohua Guo

^1,2,*

,

Sen Wang

¹

,

Anthony Sigama

^1,3

and

David Bassir

^4,5

¹

School of Energy Science and Engineering, Henan Polytechnic University, Jiaozuo 450043, China

²

Jiaozuo Engineering Research Center of Road Traffic and Transportation, Henan Polytechnic University, Jiaozuo 450043, China

³

Faculty of Science, Engineering and Agriculture, University of Venda, Thohoyandou 0950, South Africa

⁴

Smart Structural Health Monitoring and Control Laboratory, DGUT-CNAM, Dongguan University of Technology, Dongguan 523808, China

⁵

Centre Borelli, UMR CNRS 9010, ENS Paris-Saclay University, 91190 Gif-sur-Yvette, France

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(12), 2578; https://doi.org/10.3390/electronics15122578

Submission received: 24 April 2026 / Revised: 8 June 2026 / Accepted: 9 June 2026 / Published: 11 June 2026

(This article belongs to the Special Issue AI Innovations in Smart Transportation)

Download

Browse Figures

Versions Notes

Abstract

To address the issues of insufficient real-time performance and inadequate modeling of temporal features in traffic police gesture recognition, this paper proposes a method based on skeleton keypoints and hybrid temporal modeling. First, YOLOv11m-Pose is employed to detect human skeleton keypoints in video sequences, extracting reliable two-dimensional skeleton features. Second, this study designs a temporal modeling network that integrates a bidirectional long short-term memory (BiLSTM) with a Transformer. The BiLSTM models local temporal continuity and action transition features between adjacent frames, capturing short-term dynamic changes. The Transformer, through its self-attention mechanism, models global temporal dependencies and weights critical time steps to extract long-range discriminative information. Experimental results demonstrate that the proposed method achieved 98.91% for both Accuracy and F1-Score. In terms of Accuracy, it outperformed the BiLSTM and Transformer models by 2.43% and 7.67%, respectively. It outperforms most methods based on recurrent neural networks and feature fusion. Meanwhile, the model achieves an average inference time of just 1.3299 s per gesture sequence. Consequently, this approach strikes a favorable balance between recognition accuracy and real-time performance, demonstrating significant practical value.

Keywords:

skeletal keypoints; sequence modeling; hybrid neural network; BiLSTM-transformer architecture

1. Introduction

Traffic police gesture recognition technology is key to addressing road congestion and traffic management issues [1]. Especially in autonomous driving systems, accurately interpreting traffic police command gestures is a critical component for enabling coordinated interaction between vehicles and traffic controllers. However, the meaning conveyed by gestures is often influenced by factors such as context, background, and clothing, making gesture recognition a challenging task [2].

In the field of deep learning, traffic police gesture recognition technology primarily focuses on image-based visual methods [3,4]. These methods typically employ convolutional neural networks (CNNs) to extract features from single frames or sequential video frames [5] and combine them with temporal modeling networks (such as LSTMs, 3D-CNNs, etc.) to achieve gesture classification [6]. However, traffic police gestures typically exhibit distinct temporal dynamic characteristics. Relying solely on single-frame images or simple temporal modeling methods makes it difficult to fully capture the continuous changes and temporal dependencies of gestures. Therefore, some studies have begun to introduce skeleton-based gesture recognition methods. Compared to RGB image-based gesture recognition methods, this approach extracts sequences of human keypoints through pose estimation algorithms, mapping complex visual information into structured temporal data. This reduces data dimensionality while preserving critical motion semantic information [7]. Skeletal keypoint data centers on describing the spatial positions of human joints and their temporal variations and on mapping high-dimensional pixel information from raw images into structured temporal representations. This approach significantly reduces data dimensionality while preserving the primary motion semantics of gestural actions [8,9]. Moreover, this method uses joint motion as the basic modeling unit, which offers good interpretability and aligns better with the essential characteristics of traffic police gestures, with body movement changes as the main expression [10]. Therefore, skeleton-based gesture recognition has gradually become an important research direction in traffic police gesture recognition.

However, traffic police gestures consist of continuous motion sequences with prominent temporal dependency and phased features. When characterizing the overall dynamic evolution of gestures, a single temporal model fails to balance local motion variations and long-term temporal dependencies. Furthermore, key motions across different time segments differ in their significance for gesture discrimination. The absence of an effective weight allocation mechanism tends to weaken discriminative information, thus impairing recognition performance.

To address the aforementioned issues, this paper proposes a traffic police command gesture recognition method that integrates lightweight pose estimation and hybrid temporal modeling, with a collaborative design of skeleton representation and temporal modeling. First, YOLOv11m-Pose is employed to estimate poses of traffic police in video sequences, extract human keypoint sequences, and use them as input features for subsequent temporal modeling. Subsequently, a bidirectional long short-term memory network (BiLSTM) is adopted to model key-point sequences, fully exploring the forward and backward temporal dependencies of gesture motion in the time dimension, thereby enhancing the expressive ability to capture continuous motion variations. On this basis, the Transformer architecture is introduced, and the self-attention mechanism is used to model and weight key motion features across different time steps, thereby more effectively capturing long-term temporal dependencies and global discriminative information.

The main contribution of this paper is as follows:

(1) First, a skeletal temporal representation framework for traffic police gesture recognition is proposed, which integrates lightweight pose estimation and hybrid temporal modeling to effectively reduce background interference while preserving discriminative motion information.

(2) Second, a hybrid temporal modeling architecture combining BiLSTM and Transformer is designed: BiLSTM is responsible for capturing local temporal continuity and motion transition features, while the Transformer models long-range temporal dependencies via the self-attention mechanism, thus achieving multi-scale temporal feature representation.

(3) Extensive experiments on the traffic police gesture dataset demonstrate that the proposed method achieves excellent recognition performance while maintaining low computational complexity, verifying the effectiveness and practical value of the hybrid temporal modeling framework in real-world traffic gesture recognition applications.

The structure of this manuscript is as follows: Section 2 summarizes existing methods for human skeletal keypoint detection and gesture recognition. Section 3 proposes the research method of this paper and elaborates on its underlying principles in detail. Section 4 includes a comprehensive evaluation of the proposed method and compares it with state-of-the-art methods. Section 5 includes an in-depth discussion of the experimental results, analyzes the advantages and limitations of the proposed method, and explores potential avenues for improvement; finally, Section 6 summarizes the research findings and draws the corresponding conclusions.

2. Related Work

In recent years, deep learning-based human pose estimation models have been continuously evolving. The OpenPose model enables real-time, reliable keypoint localization for multiple users and supports downstream motion and pose analysis tasks [11]. The HRNet model can maintain high-resolution representations throughout the entire network, thereby achieving highly accurate keypoint localization [12]. The YOLO-Pose series of models enables end-to-end, real-time inference for object detection and keypoint localization, making them particularly suitable for real-time gesture recognition and deployment in traffic scenarios [13,14]. AlphaPose is a method for multi-person pose estimation that represents the connection relationships between key points using alpha shapes [15]. The MediaPipe model first roughly localizes the human body and then performs fine-grained regression on key points and reduces redundant computations through temporal tracking [16]. These models have achieved favorable performance in terms of key point localization accuracy and computational efficiency, thereby providing a reliable data foundation for key point-based action and gesture recognition.

Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), Gated Recurrent Units (GRUs), and their variants have been widely applied to gesture recognition and human action analysis tasks [17]. Verma et al. [18] proposed the GoogleNet+BGRU model, achieving a recognition accuracy of over 98% on most datasets. Xu et al. [19] proposed the CFF-RCNN model, which exhibits higher accuracy and faster convergence speed. Noorkholis et al. [20] proposed the 3DCNN+LSTM+FSM model, which maintains a recognition accuracy of 97.8%. Singh et al. [21] demonstrated that by simultaneously modeling the forward and backward temporal information of sequences, BiLSTMs can more comprehensively capture the contextual dependencies of motions in the time dimension. However, these models are limited by their recursive computation; models based on recurrent structures often struggle to fully capture long-term dependencies between key motion segments when processing long sequences or complex motion structures, and their global temporal modeling capabilities remain limited.

To address the aforementioned limitations, researchers have found that Transformers can perform global modeling on features at different time steps at the sequence level. Through the self-attention mechanism, they explicitly model the correlations across time steps, thereby effectively capturing long-term temporal dependencies [22]. The CT-HGR framework proposed by Mansooreh et al. [23] is based on the Vision Transformer’s attention mechanism, which effectively models long-term dependencies among key motion segments. The STFTnet architecture proposed by Tang et al. [24] fuses convolutional neural networks with Transformers; it captures global temporal dependencies via the multi-head self-attention mechanism, breaks free from the constraints of recursive computation, and effectively addresses the problem that recurrent structure models struggle to model the long-term dependencies of key motion segments in long-sequence and complex dynamic gesture recognition. Although Transformer-based methods offer significant advantages in long-term temporal action modeling and fine-grained action differentiation, they still have inherent limitations for characterizing local temporal dynamics when used alone. To address this, Gazis et al. [25] proposed a combined architecture of 3DCNN and Transformer, where 3DCNN extracts short-term spatiotemporal features and Transformer captures long-term dependencies, thereby making up for the shortcomings of the standalone Transformer in characterizing local temporal dynamics. Liu et al. [26] designed a CNN-Transformer hybrid architecture, which extracts local features with ResNeXt50, models global relationships via the CAT branch, further optimizes feature representation by integrating AFB and MFA blocks, and enhances the capability of capturing local temporal dynamics. Wang et al. [27] proposed a TD-CNN-Transformer hybrid architecture that extracts local features from data cube sequences using TD-CNN, combines positional encoding with a Transformer to model global dependencies, and effectively compensates for the limitations of a standalone Transformer; Guo et al. [28] further proposed the MG-GCT architecture, which extracts spatiotemporal local features via a motion-guided two-stream GCN, combines a temporal Transformer to model global dependencies, and improves the characterization of local temporal dynamics.

In summary, effectively integrating the advantages of recurrent neural networks for local temporal modeling with the global dependency-capture capability of Transformers to achieve a balance between local and global temporal features has become an important research direction in current gesture recognition [29,30]. Based on the aforementioned research background, this paper further explores an efficient modeling framework for traffic police command gesture recognition, focusing on the collaborative design of skeletal representations and hybrid temporal modeling.

3. Method

The traffic police gesture recognition method consists of three stages. In the first stage, the YOLOv11m-Pose algorithm extracts skeleton data for traffic police command gestures in the video sequence. In the second stage, data collection errors are eliminated through feature–label frame alignment, and temporal sequence features are constructed using sliding windows, converting single-frame static posture features into temporal features that capture the dynamic evolution of gestures. Finally, the data obtained in the second stage is used as the input of the BiLSTM-Transformer fusion model for action recognition.

This paper proposes a traffic police gesture recognition method based on skeletal keypoints and hybrid temporal modeling. YOLOv11m-Pose is employed for human keypoint extraction, rather than relying on appearance-dependent RGB features. Since skeletal features primarily capture human structural motion, they are less sensitive to illumination variations, background interference, and environmental changes, thereby providing stronger cross-scene robustness and generalization capability.

3.1. Skeleton Key Point Extraction

The YOLOv11m-Pose algorithm uses an end-to-end detection approach that jointly performs target detection and keypoint prediction. The network structure is shown in Figure 1. When extracting the key points of the skeleton of the traffic police gesture, the video is first parsed into a continuous stream of frame images and fed into the Backbone network for pre-training and feature extraction, and then the multi-scale position information is fused by the Neck layer. Finally, the Head layer completes image classification and target regression and outputs the target detection frame and the coordinate information of the skeleton’s key points.

The YOLOv11m-Pose model outputs 17 COCO standard human keypoints. For frame t, the keypoint set is denoted as follows:

\begin{matrix} P_{t} = \{(x_{t}^{k}, y_{t}^{k}, c_{t}^{k}) ∣ k = 1, 2, \dots, 17\} \end{matrix}

(1)

where

(x_{t}^{k}, y_{t}^{k})

represents the two-dimensional coordinates of the k-th keypoint, and

c_{t}^{k}

denotes the confidence score.

In real-world traffic scenarios, issues such as occlusion, motion blur, and detection jitter frequently occur. To ensure feature quality, a keypoint confidence threshold (

τ = 0.5

) was adopted. When the confidence score satisfied (

c_{t}^{k} < τ

), the corresponding keypoint was regarded as missing. To avoid the ambiguity caused by directly filling missing keypoints with zero values, which may confuse missing keypoints with valid keypoints located near the coordinate origin, a visibility-mask-based temporal interpolation strategy was employed to handle missing keypoints.

First, a visibility mask was generated for each frame of the skeleton sequence:

\begin{matrix} M_{t} = {m_{k}}_{k = 1}^{17}, m_{k} = \{\begin{matrix} 1, & c_{t}^{k} \geq τ \\ 0, & c_{t}^{k} < τ \end{matrix} \end{matrix}

(2)

where

m_{k} = 1

indicates that the keypoint is valid, while

m_{k} = 0

indicates that the keypoint is missing. Subsequently, linear temporal interpolation is applied to restore the coordinates of missing keypoints, thereby ensuring the temporal continuity of the motion trajectory.

All keypoints within a single frame were concatenated in a fixed order to form a one-dimensional feature vector:

\begin{matrix} P_{t} = [x_{1}, y_{1}, x_{2}, y_{2}, \dots, x_{17}, y_{17}] \end{matrix}

(3)

Meanwhile, the visibility mask vector was defined as follows:

\begin{matrix} M_{t} = [m_{1}, m_{2}, \dots, m_{17}] \end{matrix}

(4)

The mask vector was concatenated with the skeleton coordinate features to obtain the final input representation:

\begin{matrix} X_{t} = [P_{t}, M_{t}] \end{matrix}

(5)

Finally, the frame-wise feature matrix X was constructed, where N denotes the total number of video frames.

3.2. Data Processing

Gesture labels are provided in a frame-level annotation format. Due to potential frame count mismatches during data export from the original data source, this paper adopts a length alignment strategy to ensure a one-to-one correspondence between features and labels. Assuming that the number of feature frames is

N_{t}

, the length of the label is

N_{l}

, and the minimum value of the alignment length is

N = min (N_{t}, N_{l})

, the aligned feature and label sequences

X_{1 : N}

and

Y_{1 : N}

are obtained.

To handle potential null values and missing keypoints in the feature files, a time-series-based linear interpolation method is used for completion, and a visibility mask records each keypoint’s validity.

To adapt to the BiLSTM-Transformer network’s input format

(B, T, F)

, the frame-wise features are further converted into fixed-length sequence samples. A sliding window strategy is adopted, and for a given sequence length

T (T = 30)

, the i-th sequence sample is denoted as follows:

\begin{matrix} S_{i} = [X_{i}, X_{i + 1}, \dots, X_{i + T - 1}] \in R^{T \times 34} \end{matrix}

(6)

The corresponding labels adopt the sequence-to-one supervision strategy, taking the label of the last frame in the sequence:

\begin{matrix} {\hat{y}}_{i} = y_{i + T - 1} \end{matrix}

(7)

With this strategy, the frame-level data of length N can generate approximately

N - T

sequence samples. This sequence construction method enables modeling local continuous motion variations in traffic police gestures while preserving sample size, thereby providing a stable input for subsequent temporal network learning. Before temporal modeling, all skeleton joint coordinates in each frame are subtracted by the reference center joint coordinates, converting the extracted 2D skeleton coordinates into relative coordinates:

\begin{matrix} {\hat{p}}_{i}^{t} = p_{i}^{t} - p_{c}^{t} \end{matrix}

(8)

where

p_{i}^{t}

denotes the coordinate of the i-th skeleton joint in frame t, and

p_{c}^{t}

represents the coordinate of the reference center joint.

{\hat{p}}_{i}^{t}

is the normalized relative coordinate.

This preprocessing preserves the inherent relative motion relationships between skeleton joints while reducing the impact of global positional variations.

3.3. Design of BiLSTM-Transformer Fusion Model

Traffic police gestures exhibit distinct temporal dynamics, including variations in motion duration, transition speed, temporal continuity, and long-range inter-frame dependencies. Therefore, a single temporal modeling mechanism is insufficient to comprehensively characterize diverse gesture dynamics. To address this issue, the proposed framework adopts a hierarchical hybrid temporal modeling strategy that integrates BiLSTM, Transformer, and attention pooling mechanisms to learn local and global temporal dependencies collaboratively.

3.3.1. Embedding Layer

Firstly, the feature dimension of the input attitude feature

X

is mapped from 34 to 128 via a linear embedding layer, and Layer Normalization is introduced to alleviate the vanishing gradient problem during training. At the same time, noise from attitude estimation errors are effectively suppressed, thereby enhancing network training stability.

\begin{matrix} E_{t} = LayerNorm (W_{e} X_{t} + b_{e}) \end{matrix}

(9)

where

E_{t}

is the high-dimensional embedding feature of the t-frame. The study is the weight matrix of the embedding layer,

X_{t}

is the original pose feature of the t-frame, and

b_{e}

is the bias vector of the embedding layer.

3.3.2. BiLSTM

To capture local temporal dependencies in traffic police gestures, this paper employs a BiLSTM to model embedded feature sequences. The LSTM model can effectively capture temporal dependencies in sequence data through the gating mechanisms of the forget gate, input gate, and output gate. The LSTM cell unit architecture is shown in Figure 2, and the three gate control mechanisms are as follows.

(1) The forget gate determines which historical information to discard based on the current input information. For instance, when a gesture transition from Stop to Go Straight is detected, the forget gate actively discards the previous memory of wrist positions. The specific calculation is given by Equation (10), which outputs a value between 0 and 1 to control the proportion of historical information to discard, with 0 indicating complete forgetting and 1 indicating full retention.

\begin{matrix} F_{m} = σ (W_{x f} X_{m} + W_{h f} H_{m - 1} + b_{f}) \end{matrix}

(10)

where

X_{m}

denotes the input data at time m, H is the output state value of the hidden layer in the LSTM unit at time

m - 1

, W represents the corresponding weight, b is the corresponding bias parameter, f is the state value of the forget gate at time m, and

σ

denotes the Sigmoid function.

(2) The LSTM integrates new information into the cell state via the input gate, with the specific calculations given by Equations (11) and (12).

Update weight:

\begin{matrix} I_{m} = σ (X_{m} W_{x i} + H_{m - 1} W_{h i} + b_{i}) \end{matrix}

(11)

Candidate values:

\begin{matrix} {\tilde{C}}_{m} = tanh (X_{m} W_{x c} + H_{m - 1} W_{h c} + b_{c}) \end{matrix}

(12)

where i is the state value of the input gate at m time, and c is the candidate value of the memory unit at m time.

(3) Cell state update

\begin{matrix} C_{m} = F_{m} ⊙ C_{m - 1} + I_{m} ⊙ {\tilde{C}}_{m} \end{matrix}

(13)

All information transmission is centered on the cell state, and the updated cell state directly determines the output.

(4) Based on the updated cell state, the output at the current time step is controlled via the output gate, with the specific calculations given by Equations (14) and (15).

\begin{matrix} O_{m} = σ (X_{m} W_{x o} + H_{m - 1} W_{h o} + b_{o}) \end{matrix}

(14)

\begin{matrix} H_{m} = O_{m} ⊙ tanh (C_{m}) \end{matrix}

(15)

BiLSTM is a variant of LSTM that runs two LSTMs simultaneously on a time series, one from the front to the post-processing and one from the back to the forward processing. BiLSTM runs the LSTM layer forward and backwards along the time axis to calculate the bidirectional hidden state. The forward LSTM is calculated from the first element of the sequence to the last element in order, and the backward LSTM is the opposite. These two hidden states are connected together to form the final bidirectional hidden state. The operation process of the BiLSTM model is shown in Figure 3.

Specifically, the BiLSTM module focuses on modeling local temporal continuity and short-term motion-transition characteristics among adjacent frames, enabling effective representation of gesture initiation, transition, and termination dynamics.

The high-dimensional embedding features of each frame are input into the bidirectional LSTM to obtain the hidden state of the forward and backward LSTM in the t-frame, capture the temporal dependencies from the past to the current, and capture the temporal dependencies from the future to the current, as shown in Equations (16) and (17).

\begin{matrix} \vec{h} = {LSTM}_{f} (E_{t}) \end{matrix}

(16)

\begin{matrix} \overset{\leftarrow}{h} = {LSTM}_{b} (E_{t}) \end{matrix}

(17)

Finally, the forward and backwards hidden states are spliced, as shown in Equation (18). The bi-directional structure enables the model to use historical and future context simultaneously, helping alleviate recognition difficulties caused by the blurring of gesture start and end boundaries.

\begin{matrix} H_{t} = [{\vec{h}}_{t}; {\overset{\leftarrow}{h}}_{t}] \end{matrix}

(18)

3.3.3. Transformer Encoder

Although BiLSTM performs well in short-term modeling, its ability to capture long-term dependencies is limited. To address this, this paper introduces a Transformer encoder on top of the BiLSTM output to enhance the modeling capability of global temporal relationships. The Transformer model architecture is shown in Figure 4.

The Transformer encoder primarily consists of a multi-head self-attention mechanism and a feedforward network. The calculation form of the multi-head self-attention is as follows:

\begin{matrix} Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d}}) V \end{matrix}

(19)

Among these,

Q

is the query matrix, representing the feature vectors at the current time step, used to query relevant temporal features.

K

is the key matrix, comprising the feature vectors across all time steps, used to match with matrix

Q

.

V

is the value matrix, containing the feature vectors across all time steps, serving as the attention-weighted target.

{QK}^{T}

denotes the product of

Q

and the transpose of

K

, calculating the similarity between

Q

and each

K

. Higher scores indicate stronger relevance.

\sqrt{d}

is the scaling factor that prevents vanishing gradients in the softmax layer caused by excessively high dimensionality. The softmax() normalization function converts similarity scores into attention weights ranging from 0 to 1.

Unlike conventional recurrent modeling approaches that assume relatively uniform propagation of temporal dependencies, the Transformer encoder dynamically establishes long-range temporal associations through the self-attention mechanism. Different temporal frames are assigned adaptive correlation weights based on contextual relevance, enabling the network to capture distinct patterns of temporal dependencies across gesture actions.

3.3.4. Attention Pooling Mechanism

To compress temporal features into a fixed-length global representation, this paper employs an attention pooling mechanism. The calculation method for its attention weights is shown in Equation (20):

\begin{matrix} α_{t} = \frac{exp (w^{T} tanh (W_{ff} H_{t}))}{\sum_{i = 1}^{T} exp (w^{T} tanh (W_{ff} H_{i}))} \end{matrix}

(20)

where

α_{t}

is the attention weight for frame t; higher weights indicate greater importance of that frame for classification. The tanh () hyperbolic tangent activation function maps feature to the interval [−1, 1], enhancing nonlinear representation.

W_{ff}

denotes the transformation matrix for attention weights,

H_{t}

represents the bidirectional LSTM hidden state at frame t, and is the weight vector for attention scores used to compute scalar scores.

w^{T}

is the weight vector for attention scores used to compute scalar scores.

Equation (20) essentially adopts a Softmax-based normalization mechanism for attention-score computation. This normalization constrains all attention weights to the interval [0, 1] while ensuring that the summation of all temporal attention coefficients equals 1. Therefore, the attention pooling mechanism forms a normalized probabilistic distribution over temporal frames, effectively preventing uncontrolled weight amplification and imbalanced allocation of keyframe feature weights.

The final context vector is represented as follows:

\begin{matrix} c = \sum_{t = 1}^{T} α_{t} H_{t} \end{matrix}

(21)

where

c

represents the global context feature vector, which integrates the key features of the entire time series.

Finally, the context features obtained through attention pooling are fed into the classification head for gesture category prediction. The classification head consists of a fully connected layer, a ReLU activation function, and Dropout and outputs the final category probabilities via Softmax:

\begin{matrix} \hat{y} = Softmax (W_{c} c + b_{c}) \end{matrix}

(22)

where

\hat{y}

denotes the probability vector for the predicted class,

W_{c}

represents the weight matrix of the classification head with dimensions [N × D], and

b_{c}

indicates the bias vector of the classification head.

3.4. BiLSTM-Transformer Model

Let the input sequence be

X \in R^{B \times T \times F}

, where B is the batch size, T is the sequence length, and F is the feature dimension of a single frame. First,

X

is fed into the BiLSTM layer to obtain the bidirectional hidden states

H_{LSTM} \in R^{B \times T \times 2 H}

:

\begin{matrix} H_{LSTM} = BiLSTM (X) \end{matrix}

(23)

In this equation,

H

denotes the hidden state dimension of the LSTM. The bidirectional long short-term memory (BiLSTM) network captures local temporal dependencies and outputs a contextual feature representation for each frame.

Subsequently,

H_{L S T M}

is fed into the Transformer encoder to obtain the feature representation

H_{Trans} \in R^{B \times T \times 2 H}

, which integrates global contextual information:

\begin{matrix} H_{Trans} = TransformerEncoder (H_{LSTM}) \end{matrix}

(24)

The Transformer models long-range global dependencies via the multi-head self-attention mechanism, effectively capturing contextual correlations across long temporal sequences. Finally, a temporal attention mechanism is introduced to generate the weighted context vector

C

:

\begin{matrix} α_{t} = softmax (W_{2} tanh (W_{1} H_{Trans})) \end{matrix}

(25)

\begin{matrix} C = \sum_{t} α_{t} ⊙ H_{Trans} \end{matrix}

(26)

where

W_{1} \in R^{2 H \times H}

and

W_{2} \in R^{H \times 1}

are learnable network parameters, ⊙ denotes element-wise multiplication, and

α_{t}

represents the attention weight corresponding to the t-th frame. The weighted context vector

C

is then fed into the classifier:

\begin{matrix} \hat{Y} = Classifier (C) \end{matrix}

(27)

The BiLSTM extracts local temporal dynamic features, while the Transformer enhances global dependency modeling. Their fusion enables the model to learn discriminative temporal representations for traffic police gesture recognition. The BiLSTM-Transformer model framework is shown in Figure 5.

4. Experiments and Results

Figure 6 shows that the Chinese traffic police gestures include eight types: Stop, Go straight, Wait for the left turn, Turn Left, Turn Right, Change lane, Slow down, and Pull over. The public traffic police gesture dataset [31] includes a “Prepare” action, indicating that the traffic police did not take any action. The experimental dataset contains 20 videos with a resolution of 1024 × 768. The dataset partition is presented in Table 1, where the training and testing sets correspond to different but paired traffic scenarios. The gesture dataset only provides frame-level ground-truth gesture labels for each video frame, while no skeleton annotations are available for the gesture sequences.

Therefore, the YOLOv11m-Pose algorithm is used to extract skeleton data for each video frame, outputting points including the nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, and right ankle. Taking the Stop category as an example, its skeleton sequence diagram is shown in Figure 7. Simultaneously, the skeleton data is aligned with the label data to serve as input for the category recognition model. Simultaneously, the extracted skeleton sequences were temporally aligned with the corresponding gesture category labels to construct temporally consistent input data for the gesture recognition model.

The experiments were conducted on the Windows 11 operating system with the following computer configuration: CPU: 13th Gen Intel® Core™i7-13650HX @ 2.60 GHz, RAM: 16.0 GB, and GPU: NVIDIA GeForce RTX 4060 Laptop GPU. The deep learning platform was built based on the PyTorch 2.0.1 framework, with program code written in Python 3.9. Computational acceleration was implemented using NVIDIA CUDA and NVIDIA CUDNN.

The Transformer encoder comprises four layers with eight attention heads. The embedding and hidden-state dimensions of the BiLSTM are both set to 64. The model is trained with the AdamW optimiser from PyTorch 2.0.1, an initial learning rate of

3 \times 10^{- 4}

, a batch size of 64, and 350 training epochs. The dropout rates for the Transformer encoder and the classifier are set to 0.1 and 0.3, respectively.

4.1. Evaluation Indicators

For pose estimation tasks, evaluation metrics primarily include accuracy, F1-score, number of parameters, and inference time, with calculation formulas as shown in Equations (28) and (29).

\begin{matrix} A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N} \end{matrix}

(28)

\begin{matrix} F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} \end{matrix}

(29)

where

Precision = \frac{T P}{T P + F P}

, and

Recall = \frac{T P}{T P + F N}

.

T P

denotes true positive samples (correctly detected),

T N

denotes true negative samples,

F P

denotes false positive samples (falsely detected), and

F N

denotes false negative samples (missed detections).

4.2. Selection of Skeletal Keypoint Detection Models

To assess how different skeleton extraction methods affect traffic-police gesture recognition performance, this study employed the BiLSTM + Transformer model with a temporal sequence length of 40 and an LSTM hidden-state dimension of 64. A total of nine mainstream keypoint detection models, including the YOLO series (YOLOv11m/n/l/x and YOLOv8n), OpenPose, AlphaPose, HRNet, and MediaPipe, were evaluated. The experimental results are presented in Table 2.

As shown in Table 2, significant performance differences exist among skeleton keypoint detection models on the traffic police hand gesture dataset. YOLOv11m-Pose achieved the best results across all models, with Accuracy and F1-Score both at 98.91%. Compared with OpenPose, AlphaPose, MediaPipe, and HRNet models, its Accuracy improved by 1.53%, 4.92%, 6.44%, and 3.72%, respectively, while its F1-Score increased by 1.56%, 5.05%, 6.51%, and 3.7%.

Among the YOLO-based methods, YOLOv11m-Pose outperforms YOLOv11n-Pose, YOLOv11l-Pose, YOLOv11x-Pose, and YOLOv8n-Pose, yielding accuracy gains of 0.42%, 0.72%, 13.67%, and 8.13%, respectively. The above results demonstrate that YOLOv11m-Pose possesses significantly enhanced keypoint localization accuracy and stability when handling complex dynamic gesture scenarios. In summary, the choice of skeleton extraction method directly impacts the performance of downstream temporal gesture recognition. Among the evaluated methods, YOLOv11m-Pose provides the most favorable trade-off between recognition accuracy, robustness, and computational efficiency.

For intuitive comparison, the skeletal keypoint detection results of all models are presented in Figure 8, which intuitively reflects the keypoint detection accuracy in actual scenes.

The results demonstrate that different skeleton extraction methods directly affect the performance of downstream temporal gesture recognition. Among them, YOLOv11m-Pose achieves the best balance between recognition accuracy, robustness, and computational efficiency. Traffic police gestures are primarily conveyed through upper-body joint movements, particularly the coordinated motions of the shoulders, elbows, and wrists. The 17 standard COCO keypoints extracted by YOLOv11m-Pose are sufficient to capture the essential motion features for gesture discrimination while maintaining computational efficiency and enabling real-time detection.

4.3. Parameter Evaluation

The network architecture and input data structure of the model fundamentally influence the effectiveness of traffic police command action recognition. This section evaluates the impact of the temporal step size of the input sequence and the number of hidden states on the LSTM network’s accuracy.

The time step length serves as the“time window” through which models perceive continuous actions in sequential tasks, directly determining both the quality of temporal information capture and computational efficiency. If the step length is too short, the model cannot capture the entire gesture, resulting in fragmented features and a significant drop in classification accuracy. If the step length is too long, it introduces redundant static frames after the gesture completes. This not only increases the computational load on LSTMs and Transformers but also causes the model to focus on noise rather than core actions, leading to overfitting. Furthermore, the effective action duration varies across different gestures. A fixed step length forces padding with zeros for short gestures and truncation for long ones, compromising the integrity of temporal features and ultimately reducing the model’s adaptability to diverse gestures.

The number of LSTM hidden states serves as the“container” for storing temporal features within the model, balancing feature representation capability with generalization performance. If the number of hidden states is too low, the model fails to capture fine-grained gesture features, leading to underfitting. Conversely, excessive hidden states enable fitting more details but increase susceptibility to noise in the training data, leading to overfitting.

Therefore, to evaluate the impact of input sequence time steps and LSTM hidden state counts on the model, testing was conducted using an LSTM model. The time step values ranged within [10, 15, 20, 25, 30, 35, 40, 45, 50]; hidden state counts ranged within [4, 8, 16, 32, 64, 128]. For action sequences with fewer than 10 frames, the estimated data were padded with all zeros at the end. For sequences exceeding 50 frames, a shortened frame sequence was generated by selecting random, non-repeating frames and arranging them in their original order [32].

As shown in Figure 9, performance generally improves with increasing the number of hidden states, particularly at lower state counts such as 4–64. However, when hidden states reach 128 and 256, performance no longer shows significant gains at certain time steps and occasionally even declines slightly. Performance analysis for different hidden state configurations is as follows:

Hidden states = 4: Performance is relatively low and shows little variation across different time steps.
Hidden states = 8: Demonstrates significant performance improvement but stabilizes after time step 20.
Hidden states = 16: Performance continues to improve, reaching a high level of stability.
Hidden states = 32 and 64: Performance further improves, approaching optimal and stable levels.
Hidden states = 128: Performance is lower than that of the model with 64 hidden states.

For most hidden states, performance improves with increasing time steps, particularly between 10 and 35 steps. After approximately 35–40 steps, performance remains largely stable. Overall, the model performs best and most stably with Hidden states = 64 and a time step of 40.

In summary, the time step determines the temporal receptive field of the model and directly affects the completeness of temporal dependency modeling [33]. Meanwhile, the hidden state dimension determines the temporal representation capability of the BiLSTM network [34]. Considering that the input skeleton feature dimension in this study is 34, the hidden state dimension was selected according to commonly adopted empirical principles in sequence modeling and further optimized through parameter evaluation experiments [35,36]. Experimental results demonstrate that when the time step is set to 40 and the hidden state dimension is set to 64, the model achieves the best balance among recognition accuracy, model stability, generalization ability, and computational efficiency. Therefore, in this study, the hidden state dimension was uniformly set to 64, and the time step length was set to 40. All model experiments were conducted based on these parameters.

4.4. Ablation Experiment

This experiment treats traffic police gesture recognition as the task and compares the performance of a single model with a multi-model fusion approach. The reported inference time reflects only the inference latency of the proposed BiLSTM-Transformer recognition model and excludes the runtime consumed by YOLOv11m-Pose for skeleton keypoint extraction. The experimental results are shown in Table 3.

As shown in Table 3, among the single-model architectures, BiLSTM achieves the best baseline performance, with both Accuracy and F1-Score at 98.91%, significantly outperforming the Transformer model, which attains 91.24% Accuracy and 91.25% F1-Score. From a computational efficiency perspective, the Transformer requires more parameters and longer training time than the BiLSTM, yet it exhibits a noticeable decline in recognition performance. The BiLSTM module mainly captures local temporal continuity and short-term motion-transition characteristics through recurrent dependency learning, making it more suitable for modeling adjacent-frame temporal dynamics. In contrast, Transformer encoder models capture global temporal dependencies via self-attention and can directly establish contextual associations between non-adjacent temporal frames.

From a theoretical perspective, the computational complexity of the BiLSTM module is mainly determined by the temporal sequence length and hidden-state dimension, which can be approximated as

O (T \cdot H^{2})

, where T denotes the sequence length, and H represents the hidden-state dimension. In contrast, the Transformer encoder introduces a self-attention mechanism whose computational complexity is approximately

O (T^{2} \cdot D)

, where D denotes the feature dimension. Therefore, compared with single temporal models, the proposed hybrid architecture inevitably increases the overall computational complexity due to the integration of recurrent temporal modeling and global self-attention mechanisms.

However, this additional complexity enables collaborative modeling of local temporal continuity and long-range temporal dependencies, thereby significantly enhancing the representation of temporal features. As shown in Table 3, although the proposed fusion model increases the parameter count to 86.41K, the recognition accuracy improves to 98.91%, which is 2.43% higher than that of the standalone BiLSTM model. Moreover, the temporal sequence length in this study is fixed at 40 frames, effectively constraining the quadratic complexity introduced by the Transformer self-attention mechanism. Therefore, the proposed model achieves a favorable trade-off between computational complexity, recognition accuracy, and inference efficiency.

The corresponding confusion matrices are presented in Figure 10, Figure 11 and Figure 12, which correspond to the ablation experimental data.

4.5. Comparative Experiments

To validate the effectiveness and practicality of the proposed method for traffic police gesture recognition, comparative experiments were conducted using multiple representative gesture recognition models, including graph convolutional networks, recurrent neural networks, and multi-feature fusion. The comparison results are shown in Table 4.

As shown in Table 4, the proposed BiLSTM + Transformer hybrid spatiotemporal model achieves a recognition accuracy of 98.91%, ranking first among all compared methods. It outperforms mainstream state-of-the-art graph convolutional network (GCN)-based approaches, including STIE-GCN and MD-GCN, thereby demonstrating the effectiveness of the proposed method. In contrast, the accuracies of conventional recurrent temporal models, such as LSTM, GRU, and RCNN, remain around 95%, indicating a clear performance bottleneck. This suggests that temporal modeling alone is insufficient to fully capture the spatial joint characteristics inherent in traffic police gestures. Furthermore, the pure object detection method YOLOv8-nano achieves an accuracy of only 78.70%, which is more than 16 percentage points lower than that of the weakest recurrent-based method. This result indicates that spatiotemporal features derived from human skeletal joints are substantially more effective for traffic police gesture recognition than direct RGB image-based object detection. Skeletal representations can effectively suppress interference caused by background clutter and illumination variations while focusing on the essential motion patterns of gesture execution.

For multi-feature fusion methods combining visual feature extraction with temporal modeling, such as VGGNet-SSD + KEN + LSTM and DenseNet Part Localizer + LSTM, accuracy remains within 96.3–96.9%, with inference times of 0.05–0.10 s, achieving a balance between precision and efficiency. However, such methods typically rely on relatively complex feature-extraction networks with redundant model structures, which demand higher computational resources and more complex deployment environments. Furthermore, PKEN + LSTM and PKEN + Bidirectional LSTM exhibit relatively low recognition performance on this dataset, achieving accuracies of 91.18% and 86.84%, respectively. This indicates that relying solely on local keypoint augmentation without effective global temporal modeling struggles to fully capture the discriminative action patterns inherent in traffic police hand signals.

Compared to the above models, the proposed BiLSTM + Transformer model achieves a more reasonable balance between recognition accuracy and inference efficiency. This approach maintains a low inference time (0.1025 s) while achieving an accuracy of 98.48%, outperforming most methods based on recurrent neural networks and multi-feature fusion. It also demonstrates significantly superior real-time performance compared to graph convolutional network models.

5. Discussion

Experimental results show that the proposed method achieves both high recognition accuracy and low inference latency in traffic police hand gesture recognition, resulting in excellent overall performance. This advantage primarily stems from the optimal selection of skeleton representation and the effective design of the hybrid temporal modeling architecture. First, the keypoint extraction based on YOLOv11m-Pose provides stable, discriminative input representations for subsequent temporal modeling. Compared to end-to-end modeling directly on RGB video, skeleton keypoints effectively mitigate the impact of complex backgrounds, lighting variations, and differences in pedestrian appearance. This allows the model to focus more on human body structure and motion patterns, delivering more stable and consistent input features for subsequent temporal modeling.

Secondly, the fusion architecture of BiLSTM and Transformer is a key factor in performance enhancement. BiLSTM excels at capturing local temporal continuity and action-transition features, while the Transformer enhances modeling of global temporal dependencies through self-attention mechanisms. The combination enables the model to simultaneously represent local details and overall action structures, thereby forming more discriminative temporal feature representations across complex gesture categories. Although the hybrid architecture increases theoretical computational complexity compared with standalone temporal models, the additional complexity remains controllable under the fixed sequence length setting and provides substantial improvements in temporal representation capability and recognition accuracy.

The proposed method primarily relies on skeleton-based temporal motion representations, which improve robustness to background interference and environmental variations. However, under severe occlusion or multi-person overlap, inaccurate keypoint extraction or incorrect skeleton association may propagate errors into subsequent temporal modeling stages and affect recognition performance. Future work will focus on incorporating multi-object modeling mechanisms and more efficient temporal feature modeling strategies, while also exploring multimodal information fusion to further enhance the model’s adaptability and generalization performance in real-world traffic scenarios.

6. Conclusions

(1) This paper proposes a traffic police hand gesture recognition method based on skeletal keypoints and hybrid temporal modeling. It employs YOLOv11m-Pose for human keypoint extraction. Meanwhile, a local–global collaborative temporal modeling framework is constructed by integrating BiLSTM and Transformer. Experimental results demonstrate that the BiLSTM module captures short-term motion transitions, while the Transformer encoder models long-range temporal dependencies via self-attention. This complementary temporal modeling strategy improves the model’s adaptability to different gesture evolution patterns. Experimental results demonstrate that this method achieves 98.91% accuracy and F1-score on the traffic police command gesture dataset, with an average inference time of 1.3299 s per gesture sequence.

(2) The ablation study results indicate that in a single temporal model, BiLSTM achieves 5.24% higher accuracy than Transformer for traffic police hand gesture recognition tasks. After integrating BiLSTM with Transformer, the model achieves respective improvements of 2.43% in Accuracy and 2.42% in F1-Score compared with the single BiLSTM baseline.

(3) The comparison results indicate that the BiLSTM + Transformer hybrid spatiotemporal model outperforms GCN-based approaches, traditional recurrent networks, and image-based object detection methods. The use of skeleton representations effectively mitigates the effects of lighting variations and complex backgrounds, while the integration of spatial and temporal feature modeling substantially improves the recognition accuracy of fine-grained gestures.

Author Contributions

Conceptualization, X.Z. and B.G.; methodology, X.Z.; software, X.Z.; validation, X.Z., B.G. and S.W.; formal analysis, A.S.; investigation, D.B.; resources, X.Z.; data curation, X.Z.; writing—review and editing, X.Z.; visualization, B.G.; supervision, D.B.; project administration, B.G.; funding acquisition, B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Plan Project of Henan Provincial Department of Transportation (Grant No. 2023-2-1), the Henan Provincial Science and Technology Decision-Making Consultation Project (Grant No. SKXJCZX-2026-29B) supported by Henan Association for Science and Technology, and the Henan Provincial Science and Technology Research Project supported by Henan Provincial Department of Science and Technology (Grant No. 262102521014).

Data Availability Statement

The data supporting the findings of this study are publicly available in the Chinese Traffic Police Gesture Dataset at https://www.heywhale.com/mw/dataset/5de75df5ca27f8002c4cf1bb (accessed on 17 December 2025), reference number dataset_207358. These data were derived from the following resource available in the public domain: https://github.com/zc402/ChineseTrafficPolicePose (accessed on 17 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of this study; in the collection, analyses, or interpretation of data; in the writing of this manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

LSTM	Long Short-Term Memory
BiLSTM	Bidirectional Long Short-Term Memory

References

Dang, X.; Ke, W.; Hao, Z.; Jin, P.; Deng, H.; Sheng, Y. mm-TPG: Traffic policemen gesture recognition based on millimeter wave radar point cloud. Sensors 2023, 23, 6816. [Google Scholar] [CrossRef] [PubMed]
Xiao, J.; Li, H.; Zhao, J. A lightweight and efficient gesture recognizer for traffic police commands using spatiotemporal feature fusion. Sci. Rep. 2025, 15, 18256. [Google Scholar] [CrossRef] [PubMed]
Baek, T.; Lee, Y.G. Traffic control hand signal recognition using convolution and recurrent neural networks. J. Comput. Des. Eng. 2022, 9, 296–309. [Google Scholar] [CrossRef]
Bhushan, S.; Alshehri, M.; Keshta, I.; Chakraverti, A.K.; Rajpurohit, J.; Abugabah, A. An experimental analysis of various machine learning algorithms for hand gesture recognition. Electronics 2022, 11, 968. [Google Scholar] [CrossRef]
Ma, C.; Zhang, Y.; Wang, A.; Wang, Y.; Chen, G. Traffic command gesture recognition for virtual urban scenes based on a spatiotemporal convolution neural network. ISPRS Int. J. Geo-Inf. 2018, 7, 37. [Google Scholar] [CrossRef]
He, J.; Zhang, C.; He, X.; Dong, R. Visual recognition of traffic police gestures with convolutional pose machine and handcrafted features. Neurocomputing 2020, 390, 248–259. [Google Scholar] [CrossRef]
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3595–3603. [Google Scholar]
Guo, Z.; Ying, S. Whole-Body Keypoint and Skeleton Augmented RGB Networks for Video Action Recognition. Appl. Sci. 2022, 12, 6215. [Google Scholar] [CrossRef]
Lu, M.; Lu, X.; Liu, J. Skeleton-prompt: A cross-dataset transfer learning approach for skeleton action recognition. Pattern Recognit. 2026, 169, 111885. [Google Scholar] [CrossRef]
Wang, H.; Yu, B.; Xia, K.; Li, J.; Zuo, X. Skeleton edge motion networks for human action recognition. Neurocomputing 2021, 423, 1–12. [Google Scholar] [CrossRef]
Nie, Q.; Wang, J.; Wang, X.; Liu, Y. View-invariant human action recognition based on a 3D bio-constrained skeleton model. IEEE Trans. Image Process. 2019, 28, 3959–3972. [Google Scholar] [CrossRef] [PubMed]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Ding, J.; Niu, S.; Nie, Z.; Zhu, W. Research on human posture estimation algorithm based on YOLO-Pose. Sensors 2024, 24, 3036. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, Z.; Li, M.; Gao, P. SP-YOLO: An end-to-end lightweight network for real-time human pose estimation. Signal Image Video Process. 2024, 18, 863–876. [Google Scholar] [CrossRef]
Fang, H.S.; Xie, S.; Tai, Y.W.; Lu, C. Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2334–2343. [Google Scholar]
Habeeba, M.U.; Jayashree, P.; Poornima, M.L.; Saraswathi, D.; Muralimohan, G.; Jero, S.E. FusionPose: A MediaPipe and keypoint R-CNN fused model for enhanced cyclist pose estimation and injury risk prediction. Adv. Eng. Inform. 2026, 69, 104014. [Google Scholar] [CrossRef]
Zhu, G.; Zhang, L.; Yang, L.; Mei, L.; Shah, S.A.A.; Bennamoun, M.; Shen, P. Redundancy and Attention in Convolutional LSTM for Gesture Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 1323–1335. [Google Scholar] [CrossRef] [PubMed]
Verma, B. A two stream convolutional neural network with bi-directional GRU model to classify dynamic hand gesture. J. Vis. Commun. Image Represent. 2022, 87, 103554. [Google Scholar] [CrossRef]
Xu, P.; Li, F.; Wang, H. A novel concatenate feature fusion RCNN architecture for sEMG-based hand gesture recognition. PLoS ONE 2022, 17, e0262810. [Google Scholar] [CrossRef] [PubMed]
Hakim, N.L.; Shih, T.K.; Kasthuri Arachchi, S.P.; Aditya, W.; Chen, Y.C.; Lin, C.Y. Dynamic hand gesture recognition using 3DCNN and LSTM with FSM context-aware model. Sensors 2019, 19, 5429. [Google Scholar] [CrossRef]
Singh, R.P.; Singh, L.D. Dyhand: Dynamic hand gesture recognition using BiLSTM and soft attention methods. Vis. Comput. 2025, 41, 41–51. [Google Scholar] [CrossRef]
Zhao, D.; Li, H.; Yan, S. Spatial–Temporal Synchronous Transformer for Skeleton-Based Hand Gesture Recognition. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 1403–1412. [Google Scholar] [CrossRef]
Montazerin, M.; Rahimian, E.; Naderkhani, F.; Atashzar, S.F.; Yanushkevich, S.; Mohammadi, A. Transformer-based hand gesture recognition from instantaneous to fused neural decomposition of high-density EMG signals. Sci. Rep. 2023, 13, 11000. [Google Scholar] [CrossRef]
Tang, Y.; Pan, M.; Li, H.; Cao, X. A convolutional-transformer-based approach for dynamic gesture recognition of data gloves. IEEE Trans. Instrum. Meas. 2024, 73, 1–13. [Google Scholar] [CrossRef]
Gazis, A.; Karaiskos, P.; Loukas, C. Surgical gesture recognition in laparoscopic tasks based on the transformer network and self-supervised learning. Bioengineering 2022, 9, 737. [Google Scholar] [CrossRef]
Liu, Y.; Li, X.; Yang, L.; Bian, G.; Yu, H. A CNN-transformer hybrid recognition approach for sEMG-based dynamic gesture prediction. IEEE Trans. Instrum. Meas. 2023, 72, 1–16. [Google Scholar] [CrossRef]
Wang, C.; Zhao, X.; Li, Z. Dcs-ctn: Subtle gesture recognition based on td-cnn-transformer via millimeter-wave radar. IEEE Internet Things J. 2023, 10, 17680–17693. [Google Scholar] [CrossRef]
Guo, X.; Zhu, Q.; Wang, Y.; Mo, Y. MG-GCT: A Motion-Guided Graph Convolutional Transformer for Traffic Gesture Recognition. IEEE Trans. Intell. Transp. Syst. 2024, 25, 14031–14039. [Google Scholar] [CrossRef]
Wang, Z.; Ma, Y.; Liu, Z.; Tang, J. R-transformer: Recurrent neural network enhanced transformer. arXiv 2019, arXiv:1907.05572. [Google Scholar] [CrossRef]
Alomar, K.; Aysel, H.I.; Cai, X. RNNs, CNNs and Transformers in Human Action Recognition: A Survey and a Hybrid Model. arXiv 2024, arXiv:2407.06162. [Google Scholar] [CrossRef]
He, J.; Liao, J.; Zhang, C.; Wei, X.; Bai, J.; Wang, W. Visual gesture recognition technology based on long short term memory and deep neural network. J. Graph. 2020, 41, 372–381. [Google Scholar]
Zhang, Y.; Tian, Y.; Wu, P.; Chen, D. Application of skeleton data and long short-term memory in action recognition of children with autism spectrum disorder. Sensors 2021, 21, 411. [Google Scholar] [CrossRef]
Aceituno, P.V.; Miller, J.W.; Marti, N.; Farag, Y.; Boussange, V. Temporal Horizons in Forecasting: A Performance-Learnability Trade-Off. Trans. Mach. Learn. Res. 2025. [Google Scholar]
Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef]
Murad, A.; Pyun, J.Y. Deep Recurrent Neural Networks for Human Activity Recognition. Sensors 2017, 17, 2556. [Google Scholar] [CrossRef] [PubMed]
Liu, J.; Shahroudy, A.; Xu, D.; Kot, A.C.; Wang, G. Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 3007–3021. [Google Scholar] [CrossRef]
Shi, P.; Zhang, Q.; Yang, A. Dual-module spatial temporal information enhancement graph convolutional network for recognizing traffic police command gestures. Signal Image Video Process. 2025, 19, 92. [Google Scholar] [CrossRef]
Xiong, X.; Wu, H.; Min, W.; Xu, J.; Fu, Q.; Peng, C. Traffic police gesture recognition based on gesture skeleton extractor and multichannel dilated graph convolution network. Electronics 2021, 10, 551. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Gan, L.; Liu, Y.; Li, Y.; Zhang, R.; Huang, L.; Shi, C. Gesture Recognition System Using 24 GHz FMCW Radar Sensor Realized on Real-Time Edge Computing Platform. IEEE Sens. J. 2022, 22, 8904–8914. [Google Scholar] [CrossRef]
Tang, J.; Zhao, L.; Wu, M.; Jiang, Z.; Cao, J.; Bao, X. A SE-DenseNet-LSTM model for locomotion mode recognition in lower limb exoskeleton. PeerJ Comput. Sci. 2024, 10, e1881. [Google Scholar] [CrossRef]
Chang, M.; Xu, H.; Zhang, Y. Low light recognition of traffic police gestures based on lightweight extraction of skeleton features. Neurocomputing 2025, 617, 129042. [Google Scholar] [CrossRef]
Ma, W.; Song, G.; Zeng, Q.; Zhang, H.; Zou, M.; Zhao, Z. FFCSLT: A Deep Learning Model for Traffic Police Hand Gesture Recognition Using Surface Electromyographic Signals. IEEE Sens. J. 2024, 24, 13640–13655. [Google Scholar] [CrossRef]
M, S.R.; Mohamed Mansoor Roomi, S.; Sathyabama, B.; Senthilarasi, M. Hand Gesture Recognition System Using Transfer Learning. In Proceedings of the 2023 International Conference on Energy, Materials and Communication Engineering (ICEMCE), Madurai, India, 14–15 December 2023; pp. 1–5. [Google Scholar] [CrossRef]

Figure 1. The network structure of YOLOv11m-Pose.

Figure 2. LSTM cell unit architecture.

Figure 3. The operation process of the BiLSTM model.

Figure 4. The structure of the Transformer model.

Figure 5. BiLSTM-Transformer model structural framework.

Figure 6. Decomposition actions of eight gestures of Chinese traffic police.

Figure 7. Stop gesture skeleton sequence.

Figure 8. Visual examples of skeletal keypoint detection results of different models in Table 2.

Figure 9. Recognition accuracy of the LSTM model at different time steps and hidden states.

Figure 10. Confusion matrix of the BiLSTM model.

Figure 11. Confusion matrix of the Transformer model.

Figure 12. Confusion matrix of the BiLSTM-Transformer model.

Table 1. Dataset partition.

Number of Videos	Train	Test
1	001.mp4	002.mp4
2	003.mp4	004.mp4
3	005.mp4	008.mp4
4	007.mp4	010.mp4
5	009.mp4	012.mp4
6	011.mp4	014.mp4
7	013.mp4	016.mp4
8	015.mp4	018.mp4
9	017.mp4	102.mp4
10	103.mp4	104.mp4

Table 2. Performance comparison of different skeletal keypoint detection models.

Indicator	YOLOv11m-Pose	YOLOv11n-Pose	YOLOv11l-Pose
Accuracy (%)	98.91	98.49	98.19
F1-Score (%)	98.91	98.48	98.20
Indicator	YOLOv11x-Pose	YOLOv8n-Pose	OpenPose
Accuracy (%)	85.24	90.78	97.38
F1-Score (%)	84.46	90.70	97.35
Indicator	MediaPipe	HRNet	AlphaPose
Accuracy (%)	93.99	92.47	95.19
F1-Score (%)	93.86	92.40	95.21

Table 3. Ablation experimental results.

BiLSTM	Transformer	Accuracy	F1-Score	Parameters (K)	Inference Time (s)
√		96.48	96.49	11.34	0.7120
	√	91.24	91.25	86.09	0.8428
√	√	98.91	98.91	86.41	1.3299

Table 4. Comparative test results.

Model	Accuracy (%)
STIE-GCN [37]	98.63
MD-GCN [38]	98.01
LSTM [39]	95.62
GRU [40]	95.10
RCNN [41]	95.58
PKEN + LSTM [6]	91.18
PKEN + Bidirectional LSTM [42]	86.84
NTPGR [2]	97.56
DenseNet + LSTM [43]	96.62
SA-MobileNet-IGRU [44]	94.75
FFCSLT [45]	98.89
YOLOv8-nano [46]	78.70
(ours) BiLSTM + Transformer	98.91

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, X.; Guo, B.; Wang, S.; Sigama, A.; Bassir, D. A Traffic Police Gesture Recognition Method Based on BiLSTM-Transformer Architecture. Electronics 2026, 15, 2578. https://doi.org/10.3390/electronics15122578

AMA Style

Zhang X, Guo B, Wang S, Sigama A, Bassir D. A Traffic Police Gesture Recognition Method Based on BiLSTM-Transformer Architecture. Electronics. 2026; 15(12):2578. https://doi.org/10.3390/electronics15122578

Chicago/Turabian Style

Zhang, Xiaoyu, Baohua Guo, Sen Wang, Anthony Sigama, and David Bassir. 2026. "A Traffic Police Gesture Recognition Method Based on BiLSTM-Transformer Architecture" Electronics 15, no. 12: 2578. https://doi.org/10.3390/electronics15122578

APA Style

Zhang, X., Guo, B., Wang, S., Sigama, A., & Bassir, D. (2026). A Traffic Police Gesture Recognition Method Based on BiLSTM-Transformer Architecture. Electronics, 15(12), 2578. https://doi.org/10.3390/electronics15122578

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Traffic Police Gesture Recognition Method Based on BiLSTM-Transformer Architecture

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Skeleton Key Point Extraction

3.2. Data Processing

3.3. Design of BiLSTM-Transformer Fusion Model

3.3.1. Embedding Layer

3.3.2. BiLSTM

3.3.3. Transformer Encoder

3.3.4. Attention Pooling Mechanism

3.4. BiLSTM-Transformer Model

4. Experiments and Results

4.1. Evaluation Indicators

4.2. Selection of Skeletal Keypoint Detection Models

4.3. Parameter Evaluation

4.4. Ablation Experiment

4.5. Comparative Experiments

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI