A Lightweight Spatiotemporal Skeleton Network for Abnormal Train Driver Action Detection

Tao, Kaijia; Wang, Fen; Liu, Zhigang; Huang, Yuanchun

doi:10.3390/app152413152

Open AccessArticle

A Lightweight Spatiotemporal Skeleton Network for Abnormal Train Driver Action Detection

by

Kaijia Tao

,

Fen Wang

^*,

Zhigang Liu

and

Yuanchun Huang

School of Urban Railway Transportation, Shanghai University of Engineering Science, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(24), 13152; https://doi.org/10.3390/app152413152

Submission received: 10 November 2025 / Revised: 9 December 2025 / Accepted: 12 December 2025 / Published: 15 December 2025

(This article belongs to the Section Transportation and Future Mobility)

Download

Browse Figures

Versions Notes

Abstract

Abnormal behaviors of train drivers are a critical factor affecting the operational safety of urban rail transit. To achieve automated and efficient detection while meeting practical deployment requirements, this study proposes an end-to-end Temporal Action Detection network based on skeleton data. The network directly uses skeleton sequences as input, integrates a skeleton topology graph tailored to train driver actions for spatiotemporal feature extraction, and employs a non-shared feature propagation design to enhance classification and regression performance. Evaluated on a custom dataset of driver operations (including both standard and abnormal behaviors), the experimental results demonstrate favorable performance with high mean Average Precision (mAP) and strong accuracy. The findings show that the proposed network can accurately localize and classify driver operational behaviors, enabling precise detection of abnormal actions. Furthermore, its low parameter count and minimal storage requirements highlight strong potential for practical deployment in urban rail transit systems.

Keywords:

abnormal behaviors; temporal action detection; skeleton data; non-shared feature

1. Introduction

As a backbone of public transportation, urban rail transit systems accommodate most of the daily travel demand. Therefore, the operational safety of urban rail transit is of critical significance. The level of train operation safety represents a fundamental measure of the overall reliability of urban rail transit systems. Given that train drivers play a central role in the operation process, their abnormal behavior may directly undermine the safety of train operations. Besides train driving, drivers are also responsible for managing boarding and alighting operations, including door operation, monitoring passenger movements via direct observation and closed circuit television (CCTV), and verifying the proper functioning of train and platform screen doors. They must further ensure that no passengers or objects are trapped, and confirm platform conditions and a clear departure signal before leaving the station. Because these tasks involve acquiring and verifying multiple critical items of information, any omission in signal confirmation can jeopardize operational safety. Accordingly, to reinforce drivers’ attention to information verification and to support supervisory monitoring, metro operators require drivers to conduct synchronized hand-pointing and verbal-calling actions, known as pointing and calling, during the process of signal acquisition and confirmation. Given individual differences among drivers, explicit requirements have been established regarding the prescribed sequence and duration of pointing and calling actions. Deviations from these requirements—such as incorrect sequence of task-related actions or action omissions—are classified as abnormal behaviors. Accordingly, rapid and accurate detection of such behaviors is critical, as it facilitates timely intervention and the prevention of potential risks [1], ultimately ensuring the safety of train operations and urban rail transit systems as a whole.

Current research on abnormal train driver behavior identification mainly targets train operation scenarios, typically applying online action detection methods [2,3,4,5,6,7] to provide real-time alerts for behaviors such as mobile phone use or dozing off [8,9,10,11,12]. However, limited studies address critical abnormal behaviors in specific tasks, such as incorrect sequencing or omissions during pointing and calling. Existing approaches depend on manual video sampling, suffering from low efficiency, oversight, and high labor costs, thus inadequate for practical needs [13]. Hence, efficient and accurate automated methods are urgently required for large-scale continuous video processing under real equipment constraints. Unlike detecting distracted or fatigued driving, identifying non-standard train driver actions requires evaluating the sequence, duration, and omissions, necessitating temporal action detection (TAD) [14]. Current methods mainly use RGB video data [15,16,17,18] with behavior recognition algorithms [19,20,21,22]. However, RGB data exhibit limited robustness to occlusions, lighting, and similar actions, while convolutional neural network (CNN)-based models are parameter-heavy and computationally costly [23]. Moreover, storing large volumes of real-time RGB video imposes high demands on bandwidth and memory. Thus, practical equipment constraints, combined with video variations and action similarity, hinder the applicability of RGB-based temporal detection methods.

Recent advancements in human pose estimation algorithms [24,25] enables the extraction of skeleton data from videos, offering significant advantages over RGB data. Skeleton-based methods employ lightweight models with fewer parameters and smaller feature maps, reducing hardware requirements. Moreover, skeleton data are robust to background, lighting, and view-angle variations, while spatiotemporal joint dynamics effectively capture the essence of actions [23,26]. Accordingly, this study leverages skeleton data to optimize the speed, robustness, and practicality of driver behavior detection algorithm under real equipment constraints.

Skeleton-based detection often requires depth cameras or high-quality videos to obtain accurate three-dimension (3D) key-points [27,28,29,30]. In real-world scenarios such as train cabs, only two-dimension (2D) skeletons can usually be acquired due to conventional cameras and occlusion, leading to challenges of action similarity and missing key-points. To address these issues, this study employs a graph convolutional network (GCN) to extract features from non-Euclidean skeleton graphs. Moreover, a tailored skeleton topology is designed to emphasize critical key-point variations, thereby enhancing the model’s ability to distinguish between similar driver actions under practical conditions.

Deep learning-based temporal action detection methods are generally divided into multi-stage and end-to-end approaches. End-to-end models take video sequences as input and directly output action detection results, enabling joint optimization of discriminative features while improving efficiency [31,32]. In view of this, the model designed in this study is based on an end-to-end region convolutional 3D network (R-C3D) framework, tailored to skeleton sequences extracted from train driver operation videos, with distinct designs for the feature forward propagation layers of the classification and regression branches. Performance is evaluated using mean average precision (mAP) at different time intersection over union (tIoU) thresholds, alongside precision, recall, and ablation studies.

In summary, the main contributions of this study can be summarized as follows:

A skeleton-based temporal action detection method is proposed for identifying abnormal behaviors of urban rail train drivers. The method utilizes 2D skeleton sequences and applies spatiotemporal graph convolution for feature extraction, thereby reducing model parameters and computational overhead while improving speed and equipment adaptability. Moreover, a tailored skeleton topology, designed according to the characteristics of driver actions, enhances the model’s ability to discriminate similar actions in 2D skeleton data.
An end-to-end detection framework adapted from R-C3D network is developed for skeleton data. The feature propagation layers are distinctly redesigned for classification and boundary regression, enabling the models to focus on task-relevant information and achieve efficient end-to-end detection with improved deployment feasibility.
An effective training and validation strategy is introduced. To overcome the absence of pretrained models and the limitations of small-scale datasets, a partial pre-training followed by joint optimization scheme is adopted. The proposed approach demonstrates strong performance on a custom dataset, and ablation experiments further validate the effectiveness of the task-specific feature layer design.

2. Related Work

2.1. Skeleton-Based Action Recognition

In recent years, skeleton-based action recognition has garnered significant interest due to its computational efficiency and lightweight parameterization. GCNs [33], which extend convolutional neural networks to non-Euclidean spaces, adeptly process skeleton data by leveraging generalized topological graph structures. GCNs have emerged as a preferred approach for modeling graph-structured data, such as skeleton graphs [34,35,36], demonstrating superior performance over CNN-based [37,38,39] and recurrent neural network (RNN)-based [40,41] methods in action recognition tasks [42,43].

Yan et al. [44] pioneered the application of GCNs to skeleton-based action recognition, constructing a topology graph with skeletal joints as vertices and bones as edges to capture spatial features, integrated with temporal convolution for action classification. This innovation spurred extensive research, with subsequent studies optimizing GCN performance through multi-stream data fusion [34,45,46,47], refined skeleton topology designs [35,48,49,50], and attention mechanisms [51,52,53,54], yielding robust methodological frameworks for practical applications.

In applied contexts, GCN-based algorithms have been effectively deployed in real-world scenarios [55,56]. In the transportation field, Li et al. [57] developed a multi-layer spatiotemporal graph, utilizing natural skeletal joint connections within and across frames as input for GCN-based driver action recognition. Lin et al. [58] enhanced the two-stream adaptive graph convolutional networks (2s-AGCN) model with an attention mechanism to improve driver behavior recognition. To address missing skeleton data in driver monitoring, Li et al. [59] introduced a Smooth Node strategy for data preprocessing, coupled with the multi-scale excitation graph convolution network (MSE-GCN) model to extract multi-scale spatiotemporal features, thereby enhancing recognition accuracy. Additionally, Wei et al. [60] proposed a lightweight multimodal feature GCN framework, employing singular value decomposition for model compression while integrating critical spatial, temporal, and motion features to improve driver action recognition accuracy and computational efficiency.

Following the introduction of the spatiotemporal GCN (ST-GCN) model [44], various improved models have been proposed. While GCNs with multi-stream data or attention mechanisms can enhance detection performance, they often introduce increased complexity in deployment. In contrast, the ST-GCN model, with its relatively simple structure, allows for efficient skeleton topology design, effectively addressing performance limitations [49,61]. Therefore, the ST-GCN model is selected as the feature extraction network in this study due to its balance of simplicity and effectiveness.

2.2. Temporal Action Detection

The main objective of the TAD process is to identify and classify action-containing segments within unedited videos by localizing their temporal boundaries. Regarding deep learning-based approaches, TAD methods can be broadly divided into multi-stage detection methods and end-to-end detection methods, depending on whether the method performs the mapping from input to output in a single pass or multiple passes.

Multi-stage methods [15,16,62,63] decompose the detection process into distinct tasks-proposal generation, classification, and boundary refinement-enhancing precision in complex scenarios. However, these methods require separate optimization for each task, leading to increased training and inference complexity, error accumulation from intermediate results, high computational memory demands, and challenges in engineering deployment. In contrast, end-to-end models [17,18,32,64,65] integrate feature extraction, action localization, and classification into a unified process, enabling efficient joint optimization. These models minimize redundant computations, offering simplicity and efficiency that make them well-suited for engineering applications and increasingly prevalent in recent research [65,66,67].

Lin et al. [68] developed an enhanced TAD algorithm for detecting abnormal behaviors in large-scale surveillance videos within the power industry, leveraging frame interpolation to improve detection accuracy. In addition, Lu et al. [69] proposed a lightweight end-to-end TAD model that employs frame-level temporal encoding to predict action progress, facilitating direct learning of action features from raw video data and enhancing detection accuracy through streamlined end-to-end training and inference.

Although existing studies have applied end-to-end temporal action detection algorithms to practical scenarios, methods utilizing RGB video frames and optical flow information as data sources still entail substantial network parameters and high GPU performance requirements. Additionally, performing temporal action detection for train drivers’ pointing and calling actions requires the storage of operational videos. However, the daily video streams from multiple bidirectional platforms across various stations generate massive amounts of data, and storing RGB videos demands significant memory capacity. In contrast, real-time detection and storage of human skeleton joint data, enabled by efficient human pose estimation algorithms, can substantially reduce memory requirements. To address these challenges and enable standardized assessment of continuous actions in train driver operations while improving computational efficiency, this study proposes a temporal action detection model based only on skeleton data sequences.

3. Proposed Method

Inspired by the R-C3D network architecture [17], which consists of three subnetworks—feature extraction, temporal proposal, and action classification—this section introduces the corresponding network structure. The preprocessed skeleton motion sequence data are fed into network, and detected actions including time boundary as well as activity scores are given. The processing flowchart is shown in Figure 1.

3.1. Feature Extraction Subnetwork

Regarding the design of the feature extraction subnetwork, taking into account the classic architecture of ST-GCN and drawing on the designs from prior pioneering works, this study sets the output channel number of the first spatiotemporal convolution block to 64. This achieves a smooth transition from low-dimensional raw coordinates to a medium-dimensional feature space while balancing model capacity and computational efficiency. In the classic ST-GCN architecture, the number of channels is doubled while simultaneously performing downsampling by a factor of 2 along the temporal dimension. Since this study adopts an anchor-box generation strategy that treats 8 video frames as one temporal unit, the output channels are progressively increased from 64 to 512 dimensions. Thus, the 2D skeleton data are initially mapped to a 64-dimensional space, then processed through layers with output channel sizes of 64, 128, 256, and 512. Each set of layers, with varying channel sizes, consists of three layers, ultimately yielding a 512-dimensional spatial feature representation. Meanwhile, the temporal dimension undergoes three down-sampling operations with a scale of two, resulting in a feature map

X \in R^{B M \times 512 \times L / 8 \times N}

, where

B

is the batch size;

M

is the number of individuals in a single frame, and it always equals one;

L

is the actual number of video frames;

N

is the number of joints in the skeleton graph;

L / 8

is the number of temporal nodes after down-sampling the actual frame count by a factor of eight. Since the maximum number of frames varies across videos, in this study

B = 1

.

To improve the GCN’s capability of extracting skeleton-specific features relevant to train driver actions, a skeleton topology graph tailored to these actions is constructed. The skeleton nodes and their connection are illustrated in Figure 2a.

Considering that train driver gestures mainly involve key-points related to the driver’s hands and torso, this study defines three adjacency matrices for node information updates, enabling the network to focus on aggregating information from action-relevant nodes. The first adjacency matrix

A_{1}

, corresponding to the topology graph in Figure 2b, represents the global connections of the human skeleton, capturing associations across the entire graph. Each key-point aggregates information from its directly adjacent nodes. The second adjacency matrix

A_{2}

focuses on fine-grained action information from the torso and hand nodes, reflecting local connections of action-relevant nodes (shown in Figure 2c). In this matrix, nodes aggregate information only from their direct neighbors. The third adjacency matrix

A_{3}

expands the receptive field for action-relevant nodes, allowing them to aggregate information from nodes at a distance of two, as shown in Figure 2d.

In the proposed design, self-loops

I

are incorporated into each adjacency matrix based on the corresponding skeleton topology graph, resulting in a matrix

{\tilde{A}}_{n}

, which is defined by Equation (1).

{\tilde{A}}_{n} = A_{n} + I, n = 1, 2, 3

(1)

Each adjacency matrix is symmetrically normalized and was defined as Equation (2):

\hat{A} = {\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{1 / 2}, \hat{A} \in R^{N \times N}

(2)

where

\tilde{D}

is the degree matrix of the adjacency matrix, yielding Equation (3):

{\hat{A}}_{n}, n = 1, 2, 3

(3)

The node information aggregation formula for the

l

th layer can be expressed by Equation (4).

H^{(l)} = σ ((k_{1} {\hat{A}}_{1} + k_{2} {\hat{A}}_{2} + k_{3} {\hat{A}}_{3}) H^{(l - 1)} W^{(l)}), H^{(l)} \in R^{N \times C^{(l)}}

(4)

where

k_{n}

indicates the learnable weight coefficients used for weighted aggregation of node information updated by each adjacency matrix, and

W^{(l)} \in R^{m^{(l)} \times m^{(l + 1)}}

is the learnable parameter weight matrix.

3.2. Temporal Proposal Subnetwork

An anchor-based method is adopted to generate anchor segments that cover actions of varying temporal durations. The feature extraction network produces a feature map whose temporal dimension is downsampled by a factor of eight. For each temporal node of the downsampled feature map, anchor segments are generated by extending symmetrically along the temporal dimension to form K anchors of different lengths. In total,

(L / 8) \times K

anchor segments are generated, where

L / 8

is the number of temporal nodes after downsampling.

The framework of the temporal proposal subnetwork is presented in Figure 3.

In the forward propagation of the classification branch, the task primarily focuses on spatial information. To extract spatial features effectively, the input feature map

X \in R^{B M \times 512 \times L / 8 \times N}

is processed by two 2D convolutional layers with output channels of 512 and 512, a kernel size of 3 × 3, and a stride of 1 × 1, thereby expanding the spatiotemporal receptive field while preserving the temporal dimension. Average pooling is then applied along the spatial dimension, reducing it to one. As a result, a feature map

X_{cls} \in R^{1 \times 512 \times L / 8 \times 1}

is obtained, given that

B = M = 1

, where each temporal node is represented by a 512-dimensional feature vector, which is used to predict foreground and background scores for

K

anchor segments.

In contrast, the regression branch focuses on temporal information during forward propagation. To reduce computational load, the input feature map

X \in R^{B M \times 512 \times L / 8 \times N}

is first processed by average pooling to reduce the spatial dimension to one. It is then passed through t two 2D convolutional layers with output channels of 512 and 512, a kernel size of 5 × 1, and a stride of 1 × 1, thereby expanding the temporal receptive field while preserving the temporal dimension. This produces a feature map

X_{r e g} \in R^{1 \times 512 \times L / 8 \times 1}

, where each temporal node is represented by a 512-dimensional feature vector, which is used at to predict the center

c_{i}

and length

l_{i}

offsets

\{δ {\hat{c}}_{i}, δ {\hat{l}}_{i}\}, i = 1, 2, \dots, K

.

K

is the number of anchor segments relative to the ground-truth activity segments.

The actual offset

\{δ c_{i}, δ l_{i}\}

of an anchor segment relative to its corresponding ground-truth activity segment is calculated by Equation (5).

\{\begin{matrix} δ c_{i} = (c_{i}^{*} - c_{i}) / l_{i} \\ δ l_{i} = \log (l_{i}^{*} / l_{i}) \end{matrix}

(5)

where

c_{i}^{*}

and

l_{i}^{*}

represent the center position and length of the anchor segment corresponding to the ground-truth activity segment, respectively.

3.3. Action Classification Subnetwork

The center and length offsets predicted by the temporal proposal subnetwork are applied to refine the boundaries of the predefined anchor segments, generating proposal segments. Non-maximum suppression (NMS) is then performed using the foreground scores to filter these proposals proposal segments and obtain high-quality action proposals. These proposals, along with a feature map

X

obtained from the feature extraction network, are subsequently fed into the action classification subnetwork.

The structure of the action classification subnetwork is presented in Figure 4.

Based on the temporal boundaries of the action proposals, feature subgraphs

X_{i}, i = 1, 2, \dots, M

are extracted from the feature map, where

M

denotes the total number of proposals. The action classification subnetwork then performs both action classification prediction and boundary regression prediction on the skeleton sequence information contained in each subgraph

X_{i}

. Unlike conventional designs that employ fully connected layers for both tasks, the proposed framework differentiates the predications branches. Specifically, the classification branch follows the ST-GCN model, relying on the spatial feature modeling performance of the GCNs, and output action category scores through a 2D convolutional layer. In contrast, the regression branch retains fully connected layers to predict boundary offsets. This task-specific design enhance network performance in both action classification and boundary regression.

Since the temporal dimensions of the action proposals vary, an adaptive average pooling layer is first applied to normalize the temporal length of each feature subgraph

X_{i}

to a fixed size

l

, producing feature map

X_{i} \in R^{1 \times 512 \times l \times N}

, which is subsequently merged into

X_{c l s} \in R^{M \times 512 \times l \times N}

. The merged feature map is then processed by two simplified spatiotemporal graph convolutional layers with 512 input channels and 1024 output channels. By increasing the dimensionality of the feature data, the loss of spatiotemporal feature information caused by temporal downsampling is compensated for, while simultaneously enriching the feature representation to enhance the network’s performance in predicting actions within non-standard boundary temporal segments. Average pooling is further applied along both spatial and temporal dimensions, generating a compact feature map

X_{c l s} \in R^{M \times 1024 \times 1 \times 1}

. Finally, a 2D convolutional layer is used to output the activity category scores for each action proposal.

In boundary regression prediction branch, each feature subgraph undergoes spatiotemporal 2D region of interest (ROI) pooling to generate

X_{i} \in R^{1 \times 512 \times l \times 1}

, which is subsequently merged into

X_{reg} \in R^{M \times 512 \times l \times 1}

. The resulting feature map is then processed by three fully connected layers to obtain

X_{reg} \in R^{M \times 1024}

. A 1024-dimensional feature vector from each proposal is finally used to predict the relative offsets

\{δ {\hat{c}}_{i}, δ {\hat{l}}_{i}\}, i = 1, 2, \dots, M

of the anchor segments with respect to the ground-truth activity segments.

3.4. Prediction Process

The temporal proposal subnetwork performs binary classification of predefined anchor segments into foreground and background, along with boundary regression. The predicted regression offsets are applied to refine anchor segment boundaries, and NMS is used to obtain high-quality action proposals. These proposals are then passed to the action classification subnetwork, which conducts action category prediction and boundary regression. The regression offsets further refine the proposal boundaries, followed by NMS filtering based on category confidence scores. The final output of the network consist of detected actions the action segments with their temporal boundaries and predicted action categories.

4. Experiments

This section provides an overview of the experimental work, including the setup, evaluation metrics and results. The experiments are designed to validate the feasibility of using skeleton data for temporal action detection, examine the performance of the proposed network in detecting abnormal train driver behaviors and evaluate the effectiveness of employing distinct forward-propagation layers for classification and regression tasks.

4.1. Experimental Setup

The experiments were conducted on a workstation running Windows 10, using Python 3.11.8 and the Pytorch 2.2.2 framework with CUDA 12.3 and cuDNN 8.9.2. The hardware configuration included an Intel Xeon Gold 6142 CPU (Intel Corporation, Santa Clara, CA, USA) and a NVIDIA RTX 3080 GPU (ASUS, Taipei, China).

4.1.1. Data

In the laboratory, an operational environment was simulated to record train driver operations. A total of 1130 edited video clips, each containing a single action, and 485 unedited long videos with four or five actions were collected. Edited clips covered four action categories lasting 2–7 s, while unedited videos contained continuous sequences of these categories, each lasting 20–40 s. Among the unedited videos, 276 videos contained standard actions, and 209 contained non-standard actions with random repetitions, omissions, or overlaps to simulate real-world conditions.

Edited clips were used to train the feature extraction network, where unedited long videos were used to train the full network. Edited clips were split into training, validation, and testing sets at a 7:1:2 ratio. Standard action videos were divided at a 6:2:2 ratio, while non-standard videos were split 7:3 into training and testing sets. Only standard action videos were used for validation to fine-tune parameters. In total, the training set comprised 312 videos, the testing set 118 videos, and the validation set 55 videos, the no overlap between sets.

The types of train driver actions, standard behaviors, and abnormal behaviors included in the dataset are shown in Figure 5.

4.1.2. Loss Function

The temporal proposal subnetwork and the action classification subnetwork perform action classification and boundary regression on predefined anchor segments or action proposal segments. To enable joint optimization of the entire network, the classification task adopts the softmax loss function, while the regression task employs the L1 smooth loss function. The overall loss function was defined by Equation (6).

L o s s = \frac{1}{N_{c l s}} \sum_{i} L_{c l s} (a_{i}, a_{i}^{*}) + λ \frac{1}{N_{r e g}} \sum_{i} a_{i}^{*} L_{r e g} (t_{i}, t_{i}^{*})

(6)

where

N_{c l s}

denotes the total number of samples used in the computation across all datasets, corresponding to the number of randomly selected anchor segments or action proposals;

N_{r e g}

is the number of positive samples;

L_{c l s}

is the cross-entropy loss function;

L_{r e g}

is the L1 smooth loss function;

a_{i}^{*}

is the category label assigned after matching with the ground-truth segment (for the temporal proposal subnetwork, background = 0 and action = 1; for the action classification subnetwork, background = 0 and action labels > 0);

a_{i}

is the predicted category;

i

is the sample index in the batch;

t_{i}^{*}

indicates the center and length offsets

\{δ c_{i}, δ l_{i}\}

of the anchor segment or action proposal segment relative to the ground-truth activity segment;

t_{i}

is the predicted offset

\{δ {\hat{c}}_{i}, δ {\hat{l}}_{i}\}

. It should be noted that only positive samples contribute to the regression loss, while for background segments

a_{i}^{*} = 0

.

In the temporal proposal subnetwork, an anchor segment was assigned as a positive sample if its temporal intersection over union (tIoU) with a ground-truth activity segment was ≥0.7, or if it had the highest tIoU with a given ground-truth segment. Anchor segments with tIoU values within

[0, 0.3]

were assigned as negative samples, while the rest were ignored. The sampling ratio of positive to negative samples was set to 1:1.

In the action classification subnetwork, the tIoU between each action proposal and the ground-truth activity segments were computed. Proposals with maximum tIoU value ≥ 0.5 were assigned the corresponding ground-truth label and classified as positive samples, whereas those with all tIoU < 0.5 were considered negative. The sampling ratio of positive to negative samples was set to 3:7.

4.1.3. Training Setup and Procedure

The temporal proposal subnetwork and the action classification subnetwork relied on the feature extraction network to encode the spatiotemporal features of actions. Therefore, 1130 edited video clips were used to train the feature extraction network (ST-GCN). During this stage, the initial learning rate was set to

1 \times 10^{- 3}

, reduced by half every 50 epochs, and training was performed for 150 epochs using the Adam optimizer.

For the initialization the temporal proposal subnetwork, the ST-GCN was first pretrained on edited clips and then jointly trained with the temporal proposal subnetwork using standard action videos. In this stage, the learning rate for ST-GCN was fixed at

1 \times 10^{- 4}

, while the temporal proposal subnetwork used a learning rate of

1 \times 10^{- 2}

The Adam optimizer with L2 regularization was employed and a cosine decay learning rate schedule with a minimum value of

1 \times 10^{- 5}

was applied. This network training was conducted for a total of 100 epochs.

After initializing the action classification subnetwork, the pretrained ST-GCN, temporal proposal subnetwork, and action classification subnetwork were jointly trained on the entire training dataset. The learning rates for the ST-GCN model and temporal proposal subnetwork were both set to

1 \times 10^{- 4}

, while the action classification subnetwork adopted a learning rate of

1 \times 10^{- 2}

. raining used the Adam optimizer with L2 regularization and a cosine annealing learning rate schedule with a minimum value of

1 \times 10^{- 5}

. This network training was conducted for a total of 150 epochs.

Since the maximum frame number varied across videos, the batch size was set to one for training both the temporal proposal and action classification subnetworks.

4.2. Evaluation Metrics

Following the standard temporal action detection protocol, the mAP metric was adopted to evaluate the proposed network in terms of temporal boundary localization and action classification. The mAP index under different tIoU thresholds was used as the primary indicator of the network’s ability localize driver actions in videos. Specifically, mAP values were computed at multiple tIoU thresholds, and their average value was taken as the final evaluation metric. The tIoU was calculated by Equation (7).

tIoU = \frac{p r e d i c t i o n \cap G T}{p r e d i c t i o n \cup G T}

(7)

where

p r e d i c t i o n

indicates the proposed model’s detected actions, and

G T

refers to the pre-annotated ground-truth activity segments.

The mAP value was calculated by Equation (8).

mAP = \frac{\sum_{j = 1}^{C} {AP}_{(j)}}{C}

(8)

where

C

is the number of action categories, and

AP

denotes the total area under the precision-recall curve.

In addition, the

P r e c i s i o n

and

R e c a l l

metrics were also used for the performance evaluation, and they were calculated using Equations (9) and (10), respectively.

P r e c i s i o n = \frac{T P}{T P + F P}

(9)

R e c a l l = \frac{T P}{T P + F N}

(10)

where TP represents true positives, FP indicates false positives, and FN represents false negatives.

The decision logic for evaluation was defined as follows: when the tIoU between a detected action and the ground-truth segment exceeded the predefined threshold and the category was correctly identified, and the detection was recorded as a true positive (TP). If the tIoU was below the threshold, the same ground-truth activity segment was detected multiple times, or the category was incorrect, it was recorded as a false positive (FP). Ground-truth segments that were not detected were recorded as false negatives (FN). The total number of TPs and FNs was equal to the number of ground-truth segments.

In addition to mAP, further metrics were employed to reflect the practical performance of the proposed network, including the average recall across all ground-truth segments, as well as the mean tIoU, recall, and precision of the highest-confidence detection for each ground-truth segment.

The specific calculation steps were as follows:

Compute the tIoU values between all detected actions and ground-truth segments according to Equation (7), forming an $[M, N]$ matrix, where $M$ and $N$ denote the numbers of detected actions and ground-truth segments, respectively.
For each detected action, identify the maximum tIoU value and the corresponding ground-truth index, thereby generating an $[M, 2]$ matrix that stores the matched ground-truth index and the associated tIoU value.

For evaluation, only the first detected action with the highest confidence is retained for each ground-truth segment. The mean tIoU, recall, and precision are then calculated based on these matches. For example, if the first detected action corresponds to the third ground-truth segment, only this detection is retained, and subsequent detections for the same segment are ignored. The relevant data calculation formulas were defined by Equation (11).

\{\begin{cases} mtIoU = \frac{\sum_{x = 1}^{N} {tIoU}_{(x)}}{N} \\ avg_mtIoU = \frac{\sum_{y = 1}^{M} {mtIoU}_{(y)}}{M} \\ top 1_recall = \frac{positive_G T}{all_G T} \\ top 1_precision = \frac{positive_p r o p o s a l s}{all_p r o p o s a l s} \end{cases}

(11)

where

N

is the number of ground-truth activity segments in each video, and

M

is the total number of videos;

mtIoU

is the mean tIoU of the highest-confidence detected actions of all ground-truth segments in a sample;

avg_mtIoU

is the average mIoU of all samples;

positive_G T

is the number of detected ground-truth segments;

all_G T

is the number of ground-truth segments;

positive_p r o p o s a l s

is the number of the highest-confidence detected actions with correct classification predictions;

all_p r o p o s a l s

is the number of highest-confidence detected.

The detection rate was used to evaluate the real-time performance of the proposed model, and the detection speed was calculated by Equation (12).

Detect_Speed = \frac{T o t a l_F r a m e N u m}{T o t a l_T i m e}

(12)

where

T o t a l_F r a m e N u m

is the number of all frames detected;

T o t a l_T i m e

is the total time used for prediction.

4.3. Experimental Results

In the experiment, the tIoU threshold was set in the range [0.5, 0.95] with a step of 0.05, and the mAP values were calculated using in Equation (8), as presented in Table 1.

The results in Table 1 show that the average mAP values across the three test sets were highly consistent within the tIoU threshold range, demonstrating the strong generalization capability of the proposed network in driver’s action detection. Furthermore, to illustrate the ability of the network to detect continuous driver operation actions, recall was calculated under different tIoU thresholds using Equation (9), and the recall-tIoU variation curve was plotted, as shown in Figure 6.

The recall curve showed that when the tIoU threshold was below 0.8, the recall of detected actions decreased gradually from 97.79% to 80.91%. At a tIoU threshold of 0.85, the recall dropped sharply to 55.59%, consistent with the trend observed for mAP. These results indicate that most detected actions matched ground-truth segments within the range of 0.75–0.8, which is sufficient for localizing and classifying most actions in continuous driver operation videos, thereby ensuring accurate identification of abnormal behaviors. However, at higher thresholds (≥0.85), the proposed network exhibited limited performance in boundary regression. This may be attributed to the inherent limitations of anchors-based methods in precise temporal localization [69], as well as the ambiguity introduced by manually annotated action boundaries in the dataset.

To further demonstrate the practical applicability of the proposed model, the average tIoU, recall and precision of the highest-confidence detections matched to each ground-truth segment were calculated, as presented in Table 2.

The results in Table 2 show that only the highest-confidence detected action for each ground-truth segment was considered, the proposed model achieved an average tIoU of about 80%, with precision exceeding 96% and recall approaching 100%. These findings demonstrate the strong capability of the proposed model in accurately localizing and classifying train driver actions in practical scenarios.

The effect of network on the positioning and classification of train driver action is shown in Figure 7.

The determination of standard and abnormal operational behaviors of train drivers (missing necessary actions or incorrect action sequences) relies on action sequence information. That is, the network must fully localize all action categories in continuous-action videos while correctly predicting their categories. Therefore, the average recall of 99.66% and average precision of 96.72% among the highest-confidence detected actions directly reflect the network’s excellent performance in detecting abnormal behaviors. Furthermore, the network achieves high-precision temporal localization of various train driver actions in the video, with an average tIoU of 81.17% for the highest-confidence detected actions matched to each ground-truth activity segment. This provides a solid foundation for calculating action similarity between detected actions and standard actions in the standard sequence, thereby enabling further judgment of abnormal behaviors from the perspective of whether actions conform to the standard.

For RGB data-based networks, the input consisted of raw RGB videos, where a typical frame size of 1280 × 720 was resized to 112 × 112 after data processing. In contrast, the input to the proposed network was skeleton data, represented by the 2D coordinates of 17 human keypoints per frame, with a size of 17 × 2. For the same temporal length, the compressed RGB image data were approximately 1106 times larger than the skeleton data, indicating a substantial reduction in storage requirements. Furthermore, the proposed network contained about 22.6 million parameters, fewer than those of commonly used RGB-based backbones, such as 3D Convolutional Networks (38 million) and RGB-stream Inflated 3D Convolutional Network (25 million).

Finally, the processing speed was evaluated. On the test set of 118 videos containing 176,649 frames, the proposed method achieved a computational efficiency of 5326.09 frames per second. The overall speed was mainly constrained by the efficiency of the human pose estimation network.

The experimental results demonstrated the feasibility of using skeleton data for temporal action detection, with the proposed network achieving favorable performance on the custom dataset. Based on the results of the highest-confidence detections matched to each ground-truth activity segment across the three evaluation metrics, the proposed network met the requirements for detecting abnormal train driver behaviors. Furthermore, the proposed network employed the foundational R-C3D framework and used the ST-GCN model to extract skeleton features. Recent research has proposed numerous improved networks, mainly focusing on multi-stream data fusion and enhanced temporal feature modeling capabilities. Therefore, this study did not perform a comparison with other networks on the custom dataset.

4.4. Ablation Experiment

This chapter validates, through experiments, the effectiveness of the following contributions proposed in this study: effectiveness of the proposed skeleton topology, the segmented-then-joint training strategy adopted for small-scale datasets, and the non-shared feature propagation layer design.

4.4.1. Effectiveness of the Proposed Skeleton Topology

To validate the effectiveness of the skeleton topology graph designed based on the characteristics of driver actions proposed in Section 3.1, this section selects two topology graph design methods from the original ST-GCN [44] paper that partition nodes into subsets according to joint positions. The first method divides nodes into two subsets: one consisting of the node itself and another comprising all nodes at a distance of 1 from it. The second method divides the nodes into three subsets: the node itself, centripetal nodes at a distance of 1 from the center, and centrifugal nodes at a distance of 2 from the center. In this section, the adjacency matrices corresponding to the former and latter methods are referred to as Matrix A and Matrix B, respectively.

These two adjacency matrices are separately applied to the ST-GCN for training, using the same pre-training strategy for ST-GCN as described in Section 4.1.3. Subsequently, the performance of the models corresponding to the three adjacency matrices is evaluated on the driver action classification task, with the results presented in Table 3.

The results in the table indicate that, compared to the two skeleton topology construction methods proposed in ST-GCN, the skeleton topology proposed in Section 3.1 has advantages in the task of classifying similar actions, helping the network improve the performance of the train driver action classification task.

4.4.2. Contributions of Training Strategy

To verify the contribution of the segmented training strategy to model convergence, this section conducts an experiment in which the pre-trained ST-GCN, the initialized temporal proposal subnetwork, and the action classification subnetwork are directly jointly trained on the training set. The training strategy is identical to that described in Section 4.1.3, except that the initial learning rate of the temporal proposal subnetwork is set to

1 \times 10^{- 2}

.

The loss function curves of the segmented training strategy are compared with those of direct joint training in Figure 8a, while the loss curves of the two subnetworks under direct joint training are shown in Figure 8b.

The results demonstrate that pre-training the temporal proposal subnetwork first enables faster overall convergence of the network. In contrast, direct joint training exhibits significantly slower convergence. This is because effective learning in the action classification subnetwork heavily depends on the quality of temporal candidate boxes provided by the upstream temporal proposal subnetwork. Under direct joint training, the temporal proposal subnetwork must first learn to generate high-quality temporal candidate boxes before the action classification subnetwork can learn effectively, thereby slowing down the convergence of the latter.

4.4.3. Ablation on Non-Shared Feature Propagation Design

In conventional temporal action detection networks using RGB video data, the feature propagation process is typically shared between the classification and regression heads. In contrast, the proposed model employs skeleton sequences as input and separates the processing of features for the classification and regression heads within the action classification subnetwork. Specifically, the classification head incorporates a spatiotemporal graph convolution module without batch normalization layers, ensuring that the two heads no longer share features.

Considering all the aforementioned, a comparative action classification subnetwork with shared features was designed to validate the effectiveness of the proposed method. In this baseline design, the classification head excluded spatiotemporal graph convolution layers. Instead, the input feature map was processed by temporal ROI 2D pooling, followed by three fully connected layers to upscale the feature dimensions, and finally fed to two independent fully connected layers for classification and regression prediction.

For clarity, the network employing spatiotemporal graph convolutional layers for feature forward propagation in the classification task is referred to as the Non-shared network, whereas the comparative baseline is denoted as the Shared network. Both networks are trained under the same training scheme, with identical optimization strategies, training datasets, and training parameters.

The two network models were compared in terms of mAP, recall, and the data from the highest-confidence detected actions matched to each ground-truth activity segment, as shown in Table 4.

The comparison results of the average mAP, presented in Table 3, indicated that the non-shared network achieves approximately +5.5% mAP on average and larger gains at high tIoU thresholds (0.75–0.85), indicating more precise temporal localization compared to the shared baseline. This could be attributed to the limited ability of fully connected layers to effectively model spatial features, resulting in inaccuracies in the classification task. These inaccuracies could reduce prediction precision and consequently lower the mAP.

The recall comparison plot of the two models (Non-shared network and Shared network) are shown in Figure 9.

The recall comparison curve showed that at lower tIoU threshold values, the recall rates of the two networks were nearly identical. However, for within the range of 0.7 to 0.85, the non-shared network achieve higher recall, indicating that separating feature map forward propagation improved the model’s boundary capability in boundary regression.

The comparison results of the highest-confidence detected action data for each ground-truth activity segment are showed in Table 5.

The results in Table 5 show that the Non-shared network outperformed the Shared network in terms of the +1.8% average tIoU, +3.31% precision, and +2.35% recall on the test set containing all actions. However, for the standard action test set, the Shared network achieved higher average tIoU and precision than the Non-shared network; still, its performance was inferior on the non-standard action test set, yielding at the absolute difference between average tIoU and precision on two datasets exceeds 10%. This indicated that fully connected layers performed well in classifying simpler standard action sequences but were less effective in complex action environments. In contrast, spatiotemporal graph convolutional layers could effectively model the spatiotemporal features of skeleton data sequences, achieving similar performance as the Non-shared network on both sub-test sets.

Furthermore, segmented optimization strategies were employed to train the Non-shared and Shared networks separately to further demonstrate the effectiveness of designing distinct feature extraction layers for classification and boundary regression in temporal action detection using skeleton data, verify the superior performance of spatiotemporal graph convolutional modules compared to fully connected layers in classification tasks, and compare the performance of joint optimization models versus segmented optimization models.

Model 3 involved freezing the parameters of the backbone network and the temporal proposal subnetwork while training only the action classification subnetwork. The action classification subnetwork in this model was designed according to the Non-shared network. Model 4 involved freezing the parameters of the backbone network and temporal proposal subnetwork while training only the action classification subnetwork; here, the action classification subnetwork was designed according to the Shared network. The pre-training of the backbone network, the joint optimization process of the backbone network and temporal proposal subnetwork, and the training parameters have been presented in Section 4.1.3. The parameter settings for the separate training of the action classification subnetwork were consistent with those used in the joint optimization training of the three networks, as shown in Section 4.1.3. For clarity, the previously jointly trained Non-shared network was referred to as Model 1, and the Shared network was referred to as Model 2. The optimization loss function curves for the four networks are depicted in Figure 10.

The comparison of the loss function variation curves of Models 3 and 4 showed that under the condition of unchanged backbone network parameters, the fully connected network struggled to effectively learn the spatiotemporal features of skeleton data for the classification task. This was evidenced by the slowly changing loss curve that consistently remained at high values. In contrast, the spatiotemporal graph convolutional network could effectively learn the spatiotemporal features of skeleton data.

Finally, the comparison of the loss function variation curves of Models 1 and 3 revealed that Model 1 had significantly lower loss function values during network fitting than Model 3, which demonstrated a substantial performance advantage of the end-to-end joint optimization model over the segmented optimization training model.

5. Discussion and Conclusions

This study addresses the detection of abnormal behaviors in train driver operations by proposing an end-to-end temporal action detection network based on skeleton data. The network employs ST-GCN to extract spatiotemporal features and incorporates a skeleton topology graph tailored to driver actions to enhance feature representation. A non-shared feature propagation design was introduced in the action classification subnetwork, where the classification branch utilizes spatiotemporal graph convolutional layers instead of fully connected layers, thereby improving classification performance.

Experimental results on custom dataset demonstrates strong performance, with the proposed network achieving a mAP of 79.39% at a tIoU threshold of 0.8, an average tIoU of 81.17%, an average precision of 97.32%, and average recall of 99.82% for highest-confidence detections matched to each ground-truth segment. In addition, the network showed advantages in parameter efficiency and data storage while maintaining high detection accuracy, highlighting its potential for practical deployment in urban rail transit systems.

5.1. Limitations

The network proposed in this study aims to provide a fast, accurate, and practical automated detection method for identifying abnormal behaviors of urban rail transit train drivers, taking into account real-world engineering deployment requirements. The network, trained and tested on a train driver dataset, achieved favorable performance in relevant evaluation metrics (mAP and metrics related to the highest-confidence detected actions). However, these results pertain specifically to train driver operational actions, and the network’s temporal detection performance for other human actions or behaviors has not been systematically tested or validated. Thus, the current experimental results do not demonstrate the network’s generalization performance.

In addition, the experimental results indicate that using skeleton data alone can effectively perform temporal action detection for short-duration (2–7 s) actions, such as specific train driver operations, with detection results sufficient to support the identification of abnormal behaviors. However, the temporal modeling capability of the feature extraction network (ST-GCN) is relatively limited, resulting in constraints on the network’s temporal detection performance for long-duration actions. This study also did not include performance testing for long-duration actions.

5.2. Future Work

Considering the limitations of this study and its primary objectives, the proposed network was not compared with state-of-the-art temporal action detection networks. Nevertheless, the non-shared feature propagation design tailored to the characteristics of skeleton data was proven to effectively enhance network performance. Future work will focus on improving the network’s temporal modeling capabilities and exploring multimodal data fusion models primarily based on skeleton data, aiming to further enhance the detection performance of abnormal train driver behaviors while keeping increases in computational load and parameter count manageable, and pursuing improvements in the network’s generalization performance.

Author Contributions

Conceptualization, F.W. and K.T.; methodology, K.T. and Z.L.; investigation, K.T.; writing—original draft preparation, K.T.; writing—review and editing, F.W.; supervision, F.W.; project administration, Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by National Natural Science Foundation of China (NSFC) under grant No. 62576205.

Data Availability Statement

Restrictions apply to the datasets presented in this article. The real-scene driver operation videos are confidential and cannot be shared publicly due to data protection agreements with the metro operator. The experimental training data recorded in the laboratory can be made available upon reasonable request. Requests to access the datasets should be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CCTV	Closed Circuit Television
CNN	Convolutional Neural Network
GCN	Graph Convolutional Network
mAP	Mean Average Precision
tIoU	Time Intersection over Union
RNN	Recurrent Neural Network
TAD	Temporal Action Detection
NMS	Non-maximum Suppression

References

Li, X. Detection of Power System Personnel’s Abnormal Behavior Based on Machine Vision. In Proceedings of the 2024 Boao New Power System International Forum—Power System and New Energy Technology Innovation Forum (NPSIF), Qionghai, China, 8–10 December 2024; pp. 783–786. [Google Scholar] [CrossRef]
De Geest, R.; Gavves, E.; Ghodrati, A.; Li, Z.; Snoek, C.; Tuytelaars, T. Online Action Detection. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 269–284. [Google Scholar] [CrossRef]
An, J.; Kang, H.; Han, S.H.; Yang, M.-H.; Kim, S.J. MiniROAD: Minimal RNN Framework for Online Action Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 10307–10316. [Google Scholar] [CrossRef]
Xu, M.; Gao, M.; Chen, Y.-T.; Davis, L.; Crandall, D. Temporal Recurrent Networks for Online Action Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 29 October–2 November 2019; pp. 5531–5540. [Google Scholar] [CrossRef]
Chen, J.; Mittal, G.; Yu, Y.; Kong, Y.; Chen, M. GateHUB: Gated History Unit with Background Suppression for Online Action Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 19893–19902. [Google Scholar] [CrossRef]
Guo, H.; Ren, Z.; Wu, Y.; Hua, G.; Ji, Q. Uncertainty-Based Spatial-Temporal Attention for Online Action Detection. In Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2022; pp. 69–86. [Google Scholar] [CrossRef]
Ying, B.; Xiang, J.; Zheng, W.; Wang, Z.; Ren, W.; Luo, S.; Liu, H. Skeleton-Based Online Action Detection with Temporal Enhancement. In Emotional Intelligence; Huang, X., Mao, Q., Eds.; Springer Nature: Singapore, 2025; pp. 145–156. [Google Scholar] [CrossRef]
Kang, H.; Zhang, C.; Jiang, H. Advancing Driver Behavior Recognition: An Intelligent Approach Utilizing ResNet. Autom. Control Comput. Sci. 2024, 58, 555–568. [Google Scholar] [CrossRef]
Rajkar, A.; Kulkarni, N.; Raut, A. Driver Drowsiness Detection Using Deep Learning. In Applied Information Processing Systems; Iyer, B., Ghosh, D., Balas, V.E., Eds.; Springer: Singapore, 2022; pp. 73–82. [Google Scholar] [CrossRef]
Darapaneni, N.; Arora, J.; Hazra, M.; Vig, N.; Gandhi, S.S.; Gupta, S.; Paduri, A.R. Detection of Distracted Driver Using Convolution Neural Network. arXiv 2022. [Google Scholar] [CrossRef]
Nguyen, D.-L.; Putro, M.D.; Jo, K.-H. Driver Behaviors Recognizer Based on Light-Weight Convolutional Neural Network Architecture and Attention Mechanism. IEEE Access 2022, 10, 71019–71029. [Google Scholar] [CrossRef]
Huang, W.; Liu, X.; Luo, M.; Zhang, P.; Wang, W.; Wang, J. Video-Based Abnormal Driving Behavior Detection via Deep Learning Fusions. IEEE Access 2019, 7, 64571–64582. [Google Scholar] [CrossRef]
Rai, M.; Asim Husain, A.; Maity, T.; Kumar Yadav, R. Advance Intelligent Video Surveillance System (AIVSS): A Future Aspect. In Intelligent Video Surveillance; Neves, A.J.R., Ed.; IntechOpen: London, UK, 2019. [Google Scholar] [CrossRef]
Vahdani, E.; Tian, Y. Deep Learning-Based Action Detection in Untrimmed Videos: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 4302–4320. [Google Scholar] [CrossRef]
Shou, Z.; Wang, D.; Chang, S.-F. Temporal Action Localization in Untrimmed Videos via Multi-Stage CNNs. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1049–1058. [Google Scholar] [CrossRef]
Lin, T.; Liu, X.; Li, X.; Ding, E.; Wen, S. BMN: Boundary-Matching Network for Temporal Action Proposal Generation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3888–3897. [Google Scholar] [CrossRef]
Xu, H.; Das, A.; Saenko, K. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; pp. 5794–5803. [Google Scholar] [CrossRef]
Pan, X.; Zhang, N.; Xie, H.; Li, S.; Feng, T. MBGNet: Multi-Branch Boundary Generation Network with Temporal Context Aggregation for Temporal Action Detection. Appl. Intell. 2024, 54, 9045–9066. [Google Scholar] [CrossRef]
Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4724–4733. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. In Proceedings of the 28th International Conference on Neural Information Processing Systems, 1st ed.; MIT Press: Montreal, QC, Canada; Cambridge, MA, USA, 2014; pp. 568–576. [Google Scholar]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar] [CrossRef]
Qiu, Z.; Yao, T.; Mei, T. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5534–5542. [Google Scholar] [CrossRef]
Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human Action Recognition From Various Data Modalities: A Review. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3200–3225. [Google Scholar] [CrossRef]
Cao, Z.; Simon, T.; Wei, S.-E.; Sheikh, Y. Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1302–1310. [Google Scholar] [CrossRef]
Fang, H.-S.; Li, J.; Tang, H.; Xu, C.; Zhu, H.; Xiu, Y.; Li, Y.-L.; Lu, C. AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 7157–7173. [Google Scholar] [CrossRef]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-Based Action Recognition With Directed Graph Neural Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7904–7913. [Google Scholar] [CrossRef]
Chi, S.; Chi, H.-G.; Huang, Q.; Ramani, K. InfoGCN++: Learning Representation by Predicting the Future for Online Skeleton-Based Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 514–528. [Google Scholar] [CrossRef]
Li, B.; Chen, H.; Chen, Y.; Dai, Y.; He, M. Skeleton Boxes: Solving Skeleton Based Action Detection with a Single Deep Convolutional Neural Network. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 10–14 July 2017; pp. 613–616. [Google Scholar] [CrossRef]
Yin, J.; Han, J.; Xie, R.; Wang, C.; Duan, X.; Rong, Y.; Zeng, X.; Tao, J. MC-LSTM: Real-Time 3D Human Action Detection System for Intelligent Healthcare Applications. IEEE Trans. Biomed. Circuits Syst. 2021, 15, 259–269. [Google Scholar] [CrossRef]
Chen, Y.-T.; Fang, W.-H.; Dai, S.-T.; Lu, C.-C. Skeleton Moving Pose-Based Human Fall Detection with Sparse Coding and Temporal Pyramid Pooling. In Proceedings of the 2021 7th International Conference on Applied System Innovation (ICASI), Chiayi, Taiwan, 24–25 September 2021; pp. 91–96. [Google Scholar] [CrossRef]
Li, Y.; Lin, W.; See, J.; Xu, N.; Xu, S.; Yan, K.; Yang, C. CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action Localization. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 510–527. [Google Scholar] [CrossRef]
Wang, Q.; Zhang, Y.; Zheng, Y.; Pan, P. RCL: Recurrent Continuous Localization for Temporal Action Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 13556–13565. [Google Scholar] [CrossRef]
Jiang, B.; Zhang, Z.; Lin, D.; Tang, J.; Luo, B. Semi-Supervised Learning With Graph Learning-Convolutional Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 11305–11312. [Google Scholar] [CrossRef]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12018–12027. [Google Scholar] [CrossRef]
Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-Based Action Recognition With Shift Graph Convolutional Network. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 180–189. [Google Scholar] [CrossRef]
Martin, M.; Voit, M.; Stiefelhagen, R. Dynamic Interaction Graphs for Driver Activity Recognition. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020; pp. 1–7. [Google Scholar] [CrossRef]
Caetano, C.; Sena, J.; Brémond, F.; Dos Santos, J.A.; Schwartz, W.R. SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition. In Proceedings of the 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Taipei, Taiwan, 18–21 September 2019; pp. 1–8. [Google Scholar] [CrossRef]
Le, T.M.; Inoue, N.; Shinoda, K. A Fine-to-Coarse Convolutional Neural Network for 3D Human Action Recognition. arXiv 2018. [Google Scholar] [CrossRef]
Xu, K.; Ye, F.; Zhong, Q.; Xie, D. Topology-Aware Convolutional Neural Network for Efficient Skeleton-Based Action Recognition. Proc. AAAI Conf. Artif. Intell. 2022, 36, 2866–2874. [Google Scholar] [CrossRef]
Lee, I.; Kim, D.; Kang, S.; Lee, S. Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1012–1020. [Google Scholar] [CrossRef]
Liu, J.; Wang, G.; Hu, P.; Duan, L.-Y.; Kot, A.C. Global Context-Aware Attention LSTM Networks for 3D Action Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3671–3680. [Google Scholar] [CrossRef]
Li, Z.; Yan, L.; Li, H.; Wang, Y. Environmental Factors-Aware Two-Stream GCN for Skeleton-Based Behavior Recognition. Mach. Vis. Appl. 2025, 36, 42. [Google Scholar] [CrossRef]
Zang, Y.; Yang, D.; Liu, T.; Li, H.; Zhao, S.; Liu, Q. SparseShift-GCN: High Precision Skeleton-Based Action Recognition. Pattern Recognit. Lett. 2022, 153, 136–143. [Google Scholar] [CrossRef]
Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. Proc. AAAI Conf. Artif. Intell. 2018, 32, 7444–7452. [Google Scholar] [CrossRef]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-Based Action Recognition With Multi-Stream Adaptive Graph Convolutional Networks. IEEE Trans. Image Process. 2020, 29, 9532–9545. [Google Scholar] [CrossRef]
Abdelfattah, M.; Hassan, M.; Alahi, A. MaskCLR: Attention-Guided Contrastive Learning for Robust Action Representation Learning. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 18678–18687. [Google Scholar] [CrossRef]
Liu, F.; Wang, C.; Tian, Z.; Du, S.; Zeng, W. Advancing Skeleton-Based Human Behavior Recognition: Multi-Stream Fusion Spatiotemporal Graph Convolutional Networks. Complex Intell. Syst. 2024, 11, 94. [Google Scholar] [CrossRef]
Wang, W.; Xie, W.; Tu, Z.; Li, W.; Jin, L. Multi-Part Adaptive Graph Convolutional Network for Skeleton-Based Action Recognition. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–7. [Google Scholar] [CrossRef]
Wu, L.; Zhang, C.; Zou, Y. SpatioTemporal Focus for Skeleton-Based Action Recognition. Pattern Recognit. 2023, 136, 109231. [Google Scholar] [CrossRef]
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3590–3598. [Google Scholar] [CrossRef]
Peng, Z.; Liu, H.; Jia, Y.; Hou, J. Attention-Driven Graph Clustering Network. In Proceedings of the 29th ACM International Conference on Multimedia, ACM Conferences, Chengdu, China, 20–24 October 2021; pp. 935–943. [Google Scholar] [CrossRef]
Chi, H.-G.; Ha, M.H.; Chi, S.; Lee, S.W.; Huang, Q.; Ramani, K. InfoGCN: Representation Learning for Human Skeleton-Based Action Recognition. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 20154–20164. [Google Scholar] [CrossRef]
Lee, J.; Lee, M.; Lee, D.; Lee, S. Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 10410–10419. [Google Scholar] [CrossRef]
Chen, C.; Chai, L. Multi-Attention Graph Convolutional Network for Skeleton-Based Action Recognition. In Proceedings of the 2024 36th Chinese Control and Decision Conference (CCDC), Xi′an, China, 25–27 May 2024; pp. 6190–6195. [Google Scholar] [CrossRef]
Yang, C.; Hou, L.; Aktar, M.M. Recognition of Miner Action and Violation Behavior Based on the ANODE-GCN Model. Multimed. Syst. 2024, 30, 357. [Google Scholar] [CrossRef]
Wang, B.; Ma, F.; Jia, R.; Luo, P.; Dong, X. Skeleton-Based Violation Action Recognition Method for Safety Supervision in Operation Field of Distribution Network Based on Graph Convolutional Network. CSEE J. Power Energy Syst. 2023, 9, 2179–2187. [Google Scholar] [CrossRef]
Li, P.; Lu, M.; Zhang, Z.; Shan, D.; Yang, Y. A Novel Spatial-Temporal Graph for Skeleton-Based Driver Action Recognition. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; pp. 3243–3248. [Google Scholar] [CrossRef]
Lin, Z.; Liu, Y.; Zhang, X. Driver-Skeleton: A Dataset for Driver Action Recognition. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 1509–1514. [Google Scholar] [CrossRef]
Li, T.; Li, X.; Ren, B.; Guo, G. An Effective Multi-Scale Framework for Driver Behavior Recognition With Incomplete Skeletons. IEEE Trans. Veh. Technol. 2024, 73, 295–309. [Google Scholar] [CrossRef]
Wei, X.; Yao, S.; Zhao, C.; Hu, D.; Luo, H.; Lu, Y. Lightweight Multimodal Feature Graph Convolutional Network for Dangerous Driving Behavior Detection. J. Real-Time Image Proc. 2023, 20, 15. [Google Scholar] [CrossRef]
Cheng, Q.; Cheng, J.; Ren, Z.; Zhang, Q.; Liu, J. Multi-Scale Spatial–Temporal Convolutional Neural Network for Skeleton-Based Action Recognition. Pattern Anal. Appl. 2023, 26, 1303–1315. [Google Scholar] [CrossRef]
Chao, Y.-W.; Vijayanarasimhan, S.; Seybold, B.; Ross, D.A.; Deng, J.; Sukthankar, R. Rethinking the Faster R-CNN Architecture for Temporal Action Localization. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1130–1139. [Google Scholar] [CrossRef]
Chen, Y.; Guo, B.; Shen, Y.; Wang, W.; Lu, W.; Suo, X. Boundary Graph Convolutional Network for Temporal Action Detection. Image Vis. Comput. 2021, 109, 104144. [Google Scholar] [CrossRef]
Zhao, Y.; Zhang, H.; Gao, Z.; Guan, W.; Nie, J.; Liu, A.; Wang, M.; Chen, S. A Temporal-Aware Relation and Attention Network for Temporal Action Localization. IEEE Trans. Image Process. 2022, 31, 4746–4760. [Google Scholar] [CrossRef]
Liu, X.; Bai, S.; Bai, X. An Empirical Study of End-to-End Temporal Action Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 19978–19987. [Google Scholar] [CrossRef]
Hu, K.; Shen, C.; Wang, T.; Xu, K.; Xia, Q.; Xia, M.; Cai, C. Overview of Temporal Action Detection Based on Deep Learning. Artif. Intell. Rev. 2024, 57, 26. [Google Scholar] [CrossRef]
Sooksatra, S.; Watcharapinchai, S. A Comprehensive Review on Temporal-Action Proposal Generation. J. Imaging 2022, 8, 207. [Google Scholar] [CrossRef]
Lin, C.; Ma, T.; Wu, F.; Qian, J.; Liao, F.; Huang, J. Application of Temporal Action Detection Technology in Abnormal Event Detection of Surveillance Video. IEEE Access 2025, 13, 26958–26972. [Google Scholar] [CrossRef]
Lu, C.-K.; Mak, M.-W.; Li, R.; Chi, Z.; Fu, H. Action Progression Networks for Temporal Action Detection in Videos. IEEE Access 2024, 12, 126829–126844. [Google Scholar] [CrossRef]

Figure 1. Network processing.

Figure 2. Illustration of the skeleton topology: (a) human skeleton joint distribution; (b) global connections; (c) local connections of action-relevant nodes; (d) expanded receptive field of local connections.

Figure 3. The structure of the temporal proposal subnetwork.

Figure 4. The structure of the action classification subnetwork.

Figure 5. Schematic diagram of driver’s actions and behaviors. (1) Action A: Point to the gap between the platform screen door and the train door at the station, confirm that there are no foreign objects or people trapped between the two doors. (2) Action B: Point to the signal indicator lights on the station platform and confirm that there are no abnormalities on the platform. (3) Action C: Point to the train departure signal, confirm that there is no train occupying the operating section ahead, and the train can safely exit the station. (4) Action D: Point to the turnout signal and confirm that the turnout position is correct. (5) Abnormal Behaviors: Omitting the critical information confirmation action (e.g., getting action sequence “A-C-D”) and Performing actions in an incorrect sequence (e.g., getting action sequence “A-D-B-C”). The former jeopardizes train operation safety, while the latter violates the driver’s operational regulations.

Figure 6. Proposal Recall under Different tIoU Thresholds.

Figure 7. Schematic diagram of driver action recognition and positioning effect.

Figure 8. Loss function curve. (a) Loss function curve for both training strategies. (b) Loss function curves of two sub-networks under the joint training strategy.

Figure 9. The recall comparison plot.

Figure 10. Loss function variation curves. The loss function values for Models 1 and 2 represent the sum of the loss function values from the temporal proposal subnetwork and the action classification subnetwork. The loss function values for Models 3 and 4 correspond to the loss function values of the action classification subnetwork.

Table 1. The activity detection results, showing the mAP values of the proposed network across three test sets under different tIoU thresholds. The average mAP values across thresholds are highlighted in bold.

	mAP @ tIoU (%)
Test Set Type	0.5	0.55	0.6	0.65	0.7	0.75	0.8	0.85	0.9	0.95	Avg.
All	96.86	96.86	95.97	94.52	92.99	89.10	79.39	54.64	18.82	2.45	72.16
Standard	95.43	95.43	93.84	92.46	91.66	89.48	75.27	52.61	17.57	0.66	70.44
Abnormal	96.91	96.91	96.91	95.57	93.34	87.58	76.55	49.52	21.39	3.25	71.79

Table 2. Detection results of the highest-confidence detected actions, showing the average tIoU, precision, and recall of the highest-confidence detected actions matched to each ground-truth activity segment for the three test sets.

Test Set Type	Avg tIoU (%)	Precision (%)	Recall (%)
All	81.17	96.72	99.66
Standard	81.96	96.19	99.36
Abnormal	80.28	97.32	99.82

Table 3. Results of ST-GCN using different adjacency matrices on the train driver action classification task.

	Classification Precision (%)
Type of Matrix	Action A	Action B	Action C	Action D
Matrix A [44]	99.52	99.36	91.43	94.95
Matrix B [44]	98.26	98.67	96.87	93.11
Proposed	99.68	99.31	98.79	99.07

Table 4. Detected mAP results of the two networks under different tIoU thresholds.

	mAP @ tIoU (%)
Network Design	0.5	0.55	0.6	0.65	0.7	0.75	0.8	0.85	0.9	0.95	Avg.
Shared	94.55	94.34	94.21	93.81	91.87	78.44	64.99	40.26	12.82	1.28	66.66
Non-shared	96.86	96.86	95.97	94.52	92.99	89.10	79.39	54.64	18.82	2.45	72.16

Table 5. Detection results of the highest-confidence detected actions for the two networks.

Test Set Type	Network Design	Avg. tIoU (%)	Precision (%)	Recall (%)
All	Shared	79.37	93.41	97.31
All	Non-shared	81.17	96.72	99.66
Standard	Shared	85.89	97.58	98.55
Standard	Non-shared	81.96	96.19	99.36
Abnormal	Shared	75.10	87.15	93.27
Abnormal	Non-shared	80.28	97.32	99.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tao, K.; Wang, F.; Liu, Z.; Huang, Y. A Lightweight Spatiotemporal Skeleton Network for Abnormal Train Driver Action Detection. Appl. Sci. 2025, 15, 13152. https://doi.org/10.3390/app152413152

AMA Style

Tao K, Wang F, Liu Z, Huang Y. A Lightweight Spatiotemporal Skeleton Network for Abnormal Train Driver Action Detection. Applied Sciences. 2025; 15(24):13152. https://doi.org/10.3390/app152413152

Chicago/Turabian Style

Tao, Kaijia, Fen Wang, Zhigang Liu, and Yuanchun Huang. 2025. "A Lightweight Spatiotemporal Skeleton Network for Abnormal Train Driver Action Detection" Applied Sciences 15, no. 24: 13152. https://doi.org/10.3390/app152413152

APA Style

Tao, K., Wang, F., Liu, Z., & Huang, Y. (2025). A Lightweight Spatiotemporal Skeleton Network for Abnormal Train Driver Action Detection. Applied Sciences, 15(24), 13152. https://doi.org/10.3390/app152413152

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Spatiotemporal Skeleton Network for Abnormal Train Driver Action Detection

Abstract

1. Introduction

2. Related Work

2.1. Skeleton-Based Action Recognition

2.2. Temporal Action Detection

3. Proposed Method

3.1. Feature Extraction Subnetwork

3.2. Temporal Proposal Subnetwork

3.3. Action Classification Subnetwork

3.4. Prediction Process

4. Experiments

4.1. Experimental Setup

4.1.1. Data

4.1.2. Loss Function

4.1.3. Training Setup and Procedure

4.2. Evaluation Metrics

4.3. Experimental Results

4.4. Ablation Experiment

4.4.1. Effectiveness of the Proposed Skeleton Topology

4.4.2. Contributions of Training Strategy

4.4.3. Ablation on Non-Shared Feature Propagation Design

5. Discussion and Conclusions

5.1. Limitations

5.2. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI