Next Article in Journal
Special Issue “Bioprocess Engineering: Sustainable Manufacturing for a Green Society”
Next Article in Special Issue
Analysis on the Transient Synchronization Stability of a Wind Farm with Multiple PLL-Based PMSGs
Previous Article in Journal
Exergoeconomic Assessment of a Cogeneration Unit Using Biogas
Previous Article in Special Issue
Multi-Time-Scale Source–Storage–Load Coordination Scheduling Strategy for Pumped Storage with Characteristic Distribution
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Safety Behavior Recognition for Substation Operations Based on a Dual-Path Spatiotemporal Network

1
College of Electrical Engineering and New Energy, China Three Gorges University, Yichang 443002, China
2
College of Electrical Engineering, Xi’an University of Technology, Xi’an 710048, China
3
Electric Power Research Institute of State Grid Shanxi Electric Power Co., Ltd., Taiyuan 030001, China
*
Author to whom correspondence should be addressed.
Processes 2026, 14(1), 133; https://doi.org/10.3390/pr14010133
Submission received: 25 November 2025 / Revised: 25 December 2025 / Accepted: 27 December 2025 / Published: 30 December 2025

Abstract

The integration of large-scale renewable energy sources has increased the complexity of operation and maintenance in modern power systems, causing on-site substation operation and maintenance activities to exhibit stronger continuity and dynamics, and thereby placing higher demands on real-time operational perception and safety judgment. However, existing behavior recognition methods have difficulty accurately identifying operational states in complex scenarios involving continuous actions, partial occlusions, and fine-grained manipulations. To address these challenges, this paper proposes a safety behavior recognition method for substation operations based on a dual-path spatiotemporal network. Personnel localization is achieved using YOLOv8, while behavior classification is performed through the SlowFast framework. In the Slow pathway, an ECA attention mechanism is integrated with residual structures to enhance the representation of sustained operational postures. In the Fast pathway, a multi-path excitation residual network is introduced to fuse temporal, channel, and motion information, improving the multi-scale representation of local action variations. Furthermore, to mitigate the issue of class imbalance in substation operation data, Focal Loss based on binary cross-entropy is incorporated to adaptively down-weight easily classified samples. Experimental results demonstrate that the proposed method achieves a recognition accuracy of 87.77% and an F1-score of 85.56% across multiple operation scenarios. The results further indicate improved recognition stability and adaptability, supporting safe substation operation and maintenance in renewable energy-integrated power systems.

1. Introduction

With the rapid integration of large-scale renewable energy, the complexity of operation and dispatch in modern power systems has increased significantly. As a result, substation operation and maintenance activities exhibit more frequent task switching and enhanced process continuity, placing higher demands on on-site operational safety as well as the standardization and stability of operational behaviors. Traditional manual inspection and monitoring approaches have difficulty achieving real-time and accurate identification of operator behaviors in complex operational environments, thereby constraining further improvements in risk prevention and control capabilities [1,2]. In recent years, incidents such as the 2023 Yuanjiang “11.15” accident, the 2024 Nanjing “7–27” accident, and the 2024 Lanzhou “5.14” and “9.12” accidents have resulted in severe casualties and substantial negative societal impacts [3,4]. Therefore, the development of efficient behavior recognition methods tailored to substation operational scenarios is essential for enhancing on-site safety awareness and risk prevention capabilities.
Current research on behavior recognition algorithms mainly falls into three categories: pose-estimation-based methods, object-detection-based methods, and video action recognition methods [5,6,7,8]. For example, Liu et al. [9] proposed a lightweight pose estimation method combined with the YOLO framework and employed a spatiotemporal graph convolutional network (ST-GCN) to classify skeleton sequences, achieving behavior recognition for live working activities in distribution networks. Wang et al. [10] constructed a skeleton-sequence-driven violation detection model based on a graph convolutional network, achieving detection of irregular operator actions. However, pose-estimation-based methods usually rely on complete visibility and accurate detection of key points. When operators are partially occluded or undergo large pose variations, skeleton key points are prone to deviation or loss, which significantly limits the overall recognition performance. Li et al. [11] constructed a safety generative pre-trained Transformers model, which encodes visual and textual features and uses a cross-attention mechanism to achieve multimodal feature alignment and fusion, thereby completing automatic recognition and response generation for unsafe behaviors. Chandramouli et al. [12] fused CNN and BiLSTM to construct a downsampled action recognition model for early warning of abnormal behaviors, but it remains limited under occlusion, motion blur, and small sample conditions. In contrast, video action recognition methods can extract complementary spatial and temporal features through two-stream structures and exhibit stronger capability in representing dynamic behaviors. For example, Wei et al. [13] proposed a dual-attention feature fusion module, CMDA, which enhances spatiotemporal information interaction between the two pathways of the SlowFast network and improves video action recognition performance. Chen et al. [14] proposed the action granularity pyramid network (AGPN), which builds a hierarchical action granularity pyramid module and fuses multi-granularity spatiotemporal features to improve feature representation in complex spatiotemporal scenarios. However, the recognition objectives of the aforementioned improved methods remain primarily focused on clip-level behavior classification, and their performance gains are mainly attributed to enhanced inter-path feature interactions. These approaches do not explicitly address the recognition stability challenges arising from behavior state transitions in continuous operation scenarios.
In typical substation environments, Ma et al. [15] addressed the complex multi-factor interactions involved in distribution network operations by proposing a ladder-climbing behavior recognition method based on semantic feature analysis. By extracting and analyzing semantic features within the operational scene, their approach enabled accurate identification of interactions between personnel and ladders. Feng et al. [16] employed pose sequence quantization features in combination with spatiotemporal graph convolutional networks (ST-GCNs) for spatial feature extraction and further incorporated object detection tools to identify equipment usage states, thereby achieving precise recognition of operational behaviors. Zhang et al. [17] proposed a vision-based method for recognizing high-altitude work behaviors that integrates human pose estimation with a one-dimensional convolutional neural network (1D CNN). The method extracts action features through a human key points encoding structure, enabling automatic recognition of high-altitude operations even under partial occlusion. As shown in Table 1, although existing methods can achieve relatively high recognition accuracy in specific scenarios, they are mostly designed for single task types. Such methods inadequately model the sequential structure of continuous posture variations and fine-grained local actions that are prevalent in substation operations, and their adaptability remains limited under complex conditions, such as partial personnel occlusion and multi-person collaboration.
In summary, this study investigates the temporal characteristics of continuous operational behaviors involving the alternating coupling of long-term posture states and short-term local actions and extends the recognition objective to the temporal perception and cooperative discrimination of continuous operation behavior transitions. Accordingly, taking power maintenance scenarios as the application context, this study proposes a dual-path spatiotemporal network-based method for recognizing safety behaviors in substation operations. The proposed aims to simultaneously capture long-duration continuous actions and critical fine-grained local variations of operators, thereby meeting the dual requirements of high accuracy and high efficiency in practical applications. The main contributions of this work are as follows:
  • To model the temporal structure formed by continuous posture variations and the alternation of fine-grained local actions, this paper proposes a behavior recognition method based on a dual-path spatiotemporal network. By combining static target localization from YOLOv8 with dynamic behavior recognition from SlowFast, the method achieves accurate detection and continuous action analysis of operator behaviors in substation operation scenarios for both rapid movements and long-duration sequences, thereby improving multi-scenario adaptability.
  • To address the problems of partial occlusion of operators and limited computational efficiency, an attention mechanism is integrated with the original residual block in the Slow Pathway to form a lightweight ECA-Res module. This design enables efficient channel weight allocation, thereby strengthening the temporal representation capability of long-duration operational postures, while effectively mitigating occlusion-induced interference in dynamic scenes and reducing overall network complexity.
  • Considering that operator behaviors exhibit continuity and similarity, a multi-path excitation residual network (SCM-Res) is introduced in the Fast Pathway. By performing weighted fusion of spatial–temporal, channel, and motion information, this module effectively captures fine-grained temporal variations of rapid behaviors between adjacent frames and enhances multi-scale feature fusion capability for fast motions.
  • To address the class imbalance problem in substation operation datasets, a binary cross-entropy-based Focal Loss function is designed to effectively reduce the loss weight of easily classified samples and improve the attention and recognition ability for infrequent classes.

2. Basic Theory of Behavior Recognition

2.1. YOLOv8 Network

During behavior detection, target detection results directly influence the extraction of spatiotemporal behavioral features, thereby affecting subsequent operator behavior classification. In addition, the detection speed of the target detection algorithm also affects the efficiency of the overall video behavior detection. SlowFast, a 3D convolutional neural network model for video behavior recognition and detection tasks, usually adopts the Faster R-CNN as the target detector [18,19]. However, Faster R-CNN has limitations in capturing rich spatiotemporal information from video data and requires more fine-grained features to further improve classification accuracy. Compared with the Faster R-CNN algorithm, YOLOv8 is superior in terms of detection speed, accuracy and target tracking [20,21]. Therefore, YOLOv8 is selected as the object detector in this study and integrated with the improved SlowFast model to provide accurate and efficient target localization for operation behavior detection.
Figure 1 illustrates the network architecture of YOLOv8. The backbone part of YOLOv8 consists of Conv module, C2f module and SPPF module. The C2f module mainly consists of two convolutional layers and multiple bottleneck layers, which adjust the number of different channels for different scale models, so that YOLOv8 can obtain a lightweight structure while capturing richer gradient flow information. The SPPF module uses a serial structure of MaxPool2d, which reduces the number of parameters while preserving the ability to aggregate multi-scale features [22]. The head adopts a decoupled structure that separates the classification and localization branches and replaces the traditional anchor-based mechanism with an anchor-free strategy that directly predicts object locations and sizes. This design enables YOLOv8 to capture information from targets at various scales, thereby improving the accuracy of detecting power operation personnel within bounding boxes.

2.2. SlowFast Network

In most cases, human actions are completed within a short duration, whereas the surrounding environment remains largely stable or undergoes only slow variations during the action period. Based on this observation, Feichtenhofer et al. proposed the two-stream 3D convolutional network SlowFast, which models slow spatial semantic information (such as color and shape) and fast temporal dynamic information separately to mimic the human visual system’s selective attention to motion cues of different speeds during action recognition [23].
The structure of this network is shown in Figure 2. The Slow Pathway operates with low temporal and high spatial resolution, mainly capturing the spatial semantic information in the video. The Fast pathway operates with high temporal and low spatial resolution, mainly capturing the dynamic information of the rapid change of the behavior, and adopts smaller convolutional width to lower the channel capacity in order to reduce the amount of computation. The length of the video processed by SlowFast network is τ × T frames each time (τ denotes the time step, and T denotes the number of sampled frames). In the Slow Pathway, frames are sampled with a step size of τ (τ is usually taken as 16), inputting T frames into the network. The Fast Pathway samples αT frames into the network according to the step τ/α (α is usually taken as 8). Meanwhile, in order to reduce the computational effort, the Fast Pathway adopts a smaller convolution width, usually with a frame rate ratio of 1: α (where α > 1) and a channel number ratio of 1: β (where β > 1). The network fuses the dual frames by connecting them multiple times horizontally. The network fuses the features of the dual-stream branches through multiple lateral connections, and finally, the fused output is pooled as a global average and used as an input to the fully connected classifier.
For the action detection of onsite personnel, the extraction of global spatial information by the SlowFast network helps determine the interaction between workers and surrounding equipment [24]. The SlowFast network structure parameter table is shown in Table 2, with ResNet50 as the backbone network. Here, T is the number of time-dimensioned video frames, S is the image size of the video frames, and Res2-Res5 denotes the residual modules.

3. Methods

3.1. Method Overview

To address the need for identifying safe work behaviors in substation field operations, this paper proposes a video-aware dual-path spatiotemporal network. Using video frames captured by surveillance cameras as input, the proposed method consists of three stages: worker region extraction, temporal feature learning, and behavior classification. In complex substation operation scenarios, operation behaviors exhibit diversity in temporal scale and action patterns. Operation processes typically involve long-duration postural states as well as short-duration, localized operational behaviors. To accommodate these characteristics, this study enhances the SlowFast network through three key modifications, with the overall network architecture illustrated in Figure 3.
Specifically, when video is input into the model, the video frames are cut at a rate of 30 frames per second, with frames extracted at both 1 frame per second and 30 frames per second. At the data layer, frame sequences with different temporal resolutions are generated using two distinct temporal sampling intervals and are routed to the Slow Pathway and the Fast Pathway, respectively. The Slow Pathway adopts a lower temporal sampling rate of 1 frame per second (FPS), corresponding to a temporal sampling interval of τ = 16. Under a 30 FPS video setting, one frame is sampled every 16 frames, corresponding to a temporal sampling interval of approximately 0.5 s. This pathway is used to model operational postural states that require sustained observation over longer time windows for reliable recognition, such as standing and pole climbing operations. In contrast, the Fast Pathway processes video frames at a higher frame rate of 30 FPS, enabling the capture of rapid and localized motion details that occur within short-duration intervals, such as squatting movements. The frames then enter the 3D residual network, where the Slow Pathway and Fast Pathway use ECA-Res and SCM-Res modules, respectively, to extract motion and spatial information from the operator’s actions. Meanwhile, frames at the position of (τ × T)/2 are selected as key frames within that time period, which are then input into the YOLOv8 model to detect the operator’s location. The results are fed into the RoIAlign module, which projects the bounding boxes onto the 3D feature map to obtain the corresponding feature matrices. By expanding each 2D RoI along the time dimension, a 3D RoI is formed. Finally, after pooling and fully connected layers, the feature maps of uniform dimensions are classified using a Sigmoid classifier to recognize the behavior of onsite personnel.

3.2. Improvements to the SlowFast Network Architecture

3.2.1. Multi-Path Excitation Residual Network

The Fast Pathway captures operators’ rapid and fine-grained behaviors at a higher frame rate, and its feature variations are more dependent on short-term correlations within local spatial regions. However, in substation maintenance scenarios, different task categories exhibit certain similarities in local motion patterns. Under these conditions, directly adopting a standard 3D residual structure in the Fast Pathway tends to over-smooth rapidly changing local behaviors, thereby reducing its ability to distinguish fine-grained behavioral differences. Therefore, this study utilizes three complementary attention modules from ACTION—namely spatial-temporal excitation (STE), channel excitation (CE), and motion excitation residual (ME) [25]—combined with the original residual block to form the multi-path excitation residual network (SCM-Res). By strengthening the representation of spatial context, this design enables the network to better focus on localized differences in fast behaviors, thereby improving the accuracy of recognition. The structure of SCM-Res is illustrated in Figure 4. In this study, the SCM-Res block consists of three convolutions and a residual edge, with the ACTION module added before each convolution and residual connection. This allows the network to capture multi-dimensional spatiotemporal patterns, channel information, and motion data, which are then convolved to obtain finer-grained features.
The structure of the STE module is shown in Figure 5a. The STE module effectively captures the spatiotemporal dynamic features of power operation personnel behaviors, enhancing the spatial and temporal relationships in the video and improving the model’s ability to recognize behaviors at different scales. It primarily generates a spatiotemporal mask M to produce a spatiotemporal attention map, which is then used to extract spatiotemporal features from the video.
Traditional spatiotemporal feature extraction typically uses 3D convolutions. However, directly using 3D convolutions significantly increases the computational load of the model. Therefore, for the input features X N × T × C × H × W , the STE module first performs global pooling along the channel dimension to obtain a compact channel-level spatiotemporal representation. Here, N, T, C, H and W denote the batch size, temporal length, number of channels, height, and width, respectively. The features are then rearranged to ensure compatibility with 3D convolution operations. The processed features are subsequently fed into a 3D convolutional layer with a kernel size of 3 × 3 × 3 to capture local spatiotemporal correlations, thereby generating enhanced spatiotemporal feature representations map. The mathematical expression is given by Formula (1).
F o 1 * = K × F 1 *
Here, K denotes the three-dimensional convolution kernel and F 1 * represents the intermediate spatiotemporal feature representation obtained from the input features after global pooling along the channel dimension and subsequent dimension rearrangement.
The resulting spatiotemporal feature map is activated using a Sigmoid function to generate a spatiotemporal attention mask, which is then fused with the original input features through element-wise weighting. In addition, a residual connection is introduced to preserve the original information, yielding the final output Y1 of the STE module, as defined in Equation (2).
Y 1 = X + X M 1
The CE attention module is used to extract appropriate channel-level features to emphasize information from different channels within the network. The CE module is similar to the squeeze and excitation (SE) attention mechanism [26]. The key difference lies in the incorporation of temporal information inherent in video actions. Specifically, a one-dimensional convolution layer is introduced between two fully connected layers to capture temporal variations in channel features, thereby enhancing the temporal interdependence among behavior-related feature representations. The structure of the CE module is shown in Figure 5b. The structure of the CE module is shown in Figure 5b.
For a given input X N × T × C × H × W , a global spatial information tensor F 2 N × T × C × 1 × 1 is obtained through average pooling, and the global spatial information tensor is defined as follows:
F 2 = 1 H × W i = 1 H j = 1 W X : , : , : , i , j
Subsequently, a lightweight convolution is applied to the channel features F2 for feature mapping and compression, where a channel reduction ratio r is used to reduce the feature dimensionality and obtain a new channel representation Fr. The representation Fr is then reshaped to incorporate temporal information, and a one-dimensional convolution is applied to generate an intermediate feature representation Ftemp.
Finally, the intermediate feature representation Ftemp is projected using a two-dimensional convolution kernel K1 of size 1 for feature expansion, followed by a Sigmoid activation to generate an action mask for the operator. The resulting mask is then multiplied element-wise with the original features, and the outputs are fused via a residual connection to produce the final output Y2 of the CE module.
Y 2 = X + X δ K 1 × F t e m p
Here, K1 denotes a two-dimensional convolution kernel of size 1 × 1, δ (•) denotes the Sigmoid activation function, and ⊙ denotes element-wise multiplication.
The ME attention module primarily focuses on extracting motion-related information between two adjacent frames, in order to better capture the motion features of operators’ behaviors during power operations. Its structure is shown in Figure 5c. For a given input, the dimensionality is first reduced using a 1 × 1 convolutional kernel K2, and then the features of adjacent frames Fm is calculated using Equation (5):
F m = K 2 × F r [ : , t + 1 , : , : , : ] F r [ : , t , : , : , : ]
In the formula, Fm represents the motion feature map, F r : , t , : , : , : and F r [ : , t + 1 , : , : , : ] represent the features of the current frame and the previous frame, respectively.
Subsequently, the differences between each pair of adjacent frames are aggregated along the temporal dimension to obtain FM. The resulting features are then processed through spatial average pooling and lightweight convolutional mapping to generate an intermediate response Fo3 for motion attention. A Sigmoid activation is applied to Fo3 to produce a motion mask, which is used to reweight the input features. Finally, a residual connection is employed to yield the output Y3, as shown in Equation (6).
Y 3 = X + X δ F o 3
The feature information of operation behaviors is processed through the STE, CE, and ME attention mechanisms. The three generated excitation features are then element-wise summed to obtain the final output Y of the multi-path excitation residual network. The calculation process is shown in Formula (7):
Y = Y 1 + Y 2 + Y 3

3.2.2. ECA-Res

In maintenance scenarios, the working environment is often complex and variable, with potential occlusions caused by surrounding objects or background clutter, which can hinder the accurate capture of changes in the operator’s movements. Such problems may affect the accuracy of behavior detection, and therefore effective mechanisms need to be employed to enhance the ability to capture critical channel information in response to the interference of environmental factors. In this context, the ECANet attention mechanism was proposed based on the SENet attention mechanism [27]. ECANet adopts a cross-channel interaction strategy without dimensionality reduction, enabling more effective information exchange among channels. Compared with traditional attention mechanisms, it introduces only a small number of parameters while maintaining computational efficiency, thereby simplifying the model structure and reducing overall complexity.
Because the Slow Pathway operates at a lower frame rate to capture pose variations over extended time intervals, its global channel features play a more critical role. Accordingly, integrating the ECA module into the Slow Pathway enhances the effectiveness of channel-wise information interaction, enabling more accurate modeling of fine-grained spatial and channel characteristics—particularly in complex operational environments—while mitigating interference from background noise. Its network structure is shown in Figure 6.
The network structure diagram of the ECA attention module is shown in Figure 7. The model is able to adapt itself to the size of the convolution kernel during training. The specific approach is as follows:
  • The feature map is first compressed through convolution to obtain a new feature map χ, which is then processed by global average pooling to produce a 1 × 1 × C vector.
  • Calculate the size of the adaptive one-dimensional convolution kernel as shown in Equation (8).
    k = φ C = log 2 C γ + b γ
    where γ and b are hyperparameters, with γ = 2 and b = 1; k represents the kernel size, and C denotes the channel size. The parameters γ and b govern the scaling relationship between the convolution kernel size and the number of channels. This design maintains a small and stable odd-sized convolution kernel across different channel scales, thereby achieving a good balance between channel-wise coverage and computational efficiency.
  • Use the adaptive convolution kernel into the one-dimensional convolution to get the weights of each channel, allowing layers with larger number of channels to interact more between neighboring channels.
  • The generated weights σ are multiplied channel-by-channel with the original feature map χ to obtain the final channel-weighted feature map χ ˜ . This process enhances the response of important channels while suppressing background noise or unimportant information.

3.2.3. Focal Loss Function

There are more checking electricity behaviors and standing behaviors of the operators in the dataset, while the number of behaviors such as ladder climbing and squatting is small, which leads to an obvious category imbalance problem. So, the original CE Loss function for some small samples of the operator’s behavioral accuracy is not guaranteed [28,29]. In order to improve this problem, this paper applies the Focal Loss function to replace the original loss function, by adding a moderating factor to the standard cross-entropy criterion to reduce the relative loss of the well-classified samples, and to improve the classification performance for the few samples of the class, so as to help the model to better deal with the problem of category imbalance [30,31], in order to quickly obtain accurate training results. The formula of Focal Loss is formulated as follows.
F L ( p t ) = α t ( 1 p t ) γ log ( p t )
where pt is the predicted probability score of the model, αt is the weighting factor and α t 0 , 1 , 1 P t γ is the adjustment factor, and γ is the adjustment parameter. When Pt tends to 1, the sample is easy to classify samples, at this time the adjustment factor approaches 0, indicating a smaller contribution to the loss, that is, reduces the proportion of the loss of easy to classify samples; when Pt tends to 0, meaning a sample is misclassified as a positive sample, at this time the adjustment factor tends to 1, increasing the proportion of the loss of difficult to classify samples.
However, Focal Loss only addresses the class imbalance problem and does not explicitly capture behavior states across consecutive video frames. Therefore, this paper introduces a Temporal Smoothing Loss (TSL) based on Focal Loss, aiming to enhance the model’s ability to maintain continuity in behavior prediction by penalizing discrepancies between predictions of adjacent frames. The Temporal Smoothing Loss is defined as follows:
L smooth = 1 T 1 t = 1 T 1 f t f t + 1 1
Let T be the total number of frames in a video segment, · 1 denotes the L1 norm, ft and ft+1 represent the output feature vectors of adjacent frames.
The total loss function is composed of the weighted sum of Focal Loss and Temporal Smoothing Loss:
L L o s s = F L p t + λ L s m o o t h
where λ is the weighting coefficient for the smoothing term.

4. Experimental Design and Result Analysis

4.1. Dataset Construction

In this paper, a dataset was constructed based on videos collected from power maintenance scenarios involving electric power operators at a substation in a specific city. The dataset contains 1520 video clips with a total of 10,139 behavior annotations, covering six typical operation behaviors. It includes both indoor and outdoor scenes, different lighting conditions, as well as single-operator and multi-operator cooperative scenarios. The dataset composition is shown in Table 3. Regarding the class distribution, stand and check electricity account for 33.4% and 25.3% of the dataset, respectively, indicating relatively sufficient samples. In contrast, ladder climb and squat represent only 5.3% and 6.3%, respectively, and thus have comparatively fewer samples.
The data preprocessing pipeline is as follows: (a) segment the video in the form of 30 frames per second; (b) extract key frames in the video according to the step size of 1s, and input these key frames into the YOLOv8 target detector to automatically annotate the positional information of electric power operators in the image; (c) the detected bounding boxes of power workers were manually refined using the VIA annotation tool, after which behavior category labels were assigned to generate AVA format dataset annotation files [32]. Behavior annotations in this dataset were performed by a single annotator. To reduce potential subjectivity associated with individual annotation, label quality was constrained through standardized behavior definitions, a unified annotation workflow, and a combination of automated pre-detection and manual refinement, with the corresponding behavior discrimination criteria summarized in Table 3. In addition, under the same annotation guidelines, a second annotator independently conducted frame-level behavior annotation. Inter-annotator agreement was evaluated using Cohen’s Kappa coefficient, yielding an agreement rate of 84.95% with k = 0.72.
Regarding data partitioning, the dataset is randomly split at the video level into training and testing sets, with 80% of the videos used for model training and the remaining 20% for evaluation. Due to privacy and security management constraints, the dataset is not publicly available at this time. But this paper provides detailed descriptions of the data construction process and behavior definitions to ensure research reproducibility.

4.2. Experimental Environment

This study primarily focuses on algorithm design and performance evaluation, and all experiments were conducted in PC computing environment. The hardware platform was built on a general-purpose computing system to support model training and inference on video data for the proposed method. The experimental platform operated on a 64-bit Windows 11 operating system and was equipped with an NVIDIA GeForce RTX 4060 Ti GPU with 8 GB of memory. Regarding the software environment, the experiment utilized Python version is 3.8, and the deep learning framework used is PyTorch 1.10.0. The GPU acceleration libraries include CUDNN 8.2.0 and CUDA 11.3. The model was trained for a total of 300 epochs, with a batch size of 8. The model training process employed the SGD optimizer. Based on the variation in the loss curve during the initial training phase and balancing the requirements for convergence speed and overfitting prevention, the initial learning rate and weight decay coefficient were set to 0.1125 and 0.00001, respectively. For the learning rate scheduling strategy, a cosine annealing scheme was applied to achieve smooth decay, thereby maintaining stable convergence in the later stages of training.

4.3. Experimental Results and Analysis

4.3.1. Ablation Studies

To evaluate the contributions of the SCM-Res and ECA-Res modules, ablation experiments are conducted. The detection results for the four behavior categories are shown in Table 4, where √ indicates that the module was used, and × indicates that the module was not used.
From Table 4, it can be seen that the original model (i.e., Model 1) has an average recognition accuracy of 80.86% for the behaviors “check electricity,” “squat,” “ladder climb,” and “live working,” showing only moderate performance. In Model 2, Introducing the ECA-Res module leads to a 1.43% improvement in mAP@0.5, with notable gains in the “check electricity” and “pole climb” categories, indicating enhanced channel-wise feature discrimination. In Model 3, the SCM-Res module is introduced into the Fast Pathway, which improves the mAP@0.5 by 4.88% compared to the original model. Specifically, the average accuracy for “squat” increased by 5.22%, demonstrating its effectiveness in capturing spatiotemporal dependencies. Model 4 combines both the SCM-Res and ECA-Res modules, leveraging the enhanced capabilities of both pathways to achieve superior overall performance. Compared to the original model, the mAP@0.5 increased by 6.91%, and the average precision for each behavior category exceeding that of the previous models. The synergistic and complementary effects of these two attention mechanisms led to cumulative performance improvements. From the comparison results of model parameters (Params) and GFLOPs, it can be observed that the introduction of the SCM-Res and ECA-Res modules leads to only marginal increases in both the number of parameters and computational cost. Specifically, both the parameter count and computational cost increase by less than 1%, indicating that the proposed model achieves effective performance improvement while maintaining controlled model size and computational complexity.
The overall mean average precision variation curves are shown in Figure 8, where the horizontal axis represents the number of iterations and the vertical axis indicates the overall mAP value. Each curve corresponds to the performance of the respective model. From the trend of the curves, SlowFast-SCM-ECA converges faster than both SlowFast-SCM and SlowFast-ECA, and achieves slightly better performance in mAP@0.5 compared to other models. This indicates that the joint application of SCM and ECA modules provides complementary enhancement of spatial and channel features, demonstrating stronger adaptability and discriminative power in complex action recognition tasks. Consequently, the model achieves notable improvements in both feature extraction and recognition capabilities.

4.3.2. Stability and Robustness Analysis

To quantitatively evaluate the recognition stability of the proposed model across different operational behaviors, the bootstrap resampling method was employed to analyze the recognition results for each behavior category, as summarized in Table 5. The results indicate that the standard deviations of recognition performance for all behavior categories are constrained within a narrow range of 1–2%, and the corresponding 95% confidence intervals are relatively concentrated, without exhibiting pronounced fluctuations.
Building upon the above analysis, Figure 9 presents the confusion matrix results of the proposed method for six substation operation behaviors. The observed confusion primarily occurs between behaviors with similar short-duration motion patterns, such as squatting. These misclassifications are mainly attributed to transitional postures and overlapping motion characteristics.
Additionally, this paper constructs a subset of the dataset under different illumination levels, camera viewpoints, and motion intensity conditions. Specifically, illumination enhancement was implemented using linear brightness scaling, and a controlled brightness enhancement factor of 1.3 was applied to video frames during the testing phase to simulate potential lighting variations encountered in real substation environments. Camera viewpoints were divided into frontal and rear views. Motion conditions were categorized based on the continuity of behaviors during operation into continuous motion and intermittent motion. With the model architecture and parameters kept unchanged, quantitative evaluations were conducted on these condition-specific subsets, and the results are summarized in Table 6.

4.3.3. Analysis of Experimental Results on Public Datasets

Additional comparative experiments were conducted on two widely used public behavior recognition benchmarks, HMDB51 and UCF101, as summarized in Table 7. The results show that the recognition accuracies of all compared models are generally higher on UCF101 than on HMDB51, which is consistent with the advantages of UCF101 in terms of video quality and dataset scale. The proposed method achieves superior recognition accuracy among all models, indicating that it is not limited to specific substation operation scenarios and can maintain stable performance under different data distribution conditions.

4.3.4. Comparative Experiments

In order to verify the superiority of the improved SlowFast model proposed in this study for the safety behavior recognition effect of electric power operators, this paper selects different behavioral detection techniques and model architectures to conduct comparative experiments with the proposed improved model. Figure 10 presents the visualization results of various models, while Table 8 summarizes the comparative experimental results of recognition accuracy across different models.
The comparison models include mainstream architectures such as Slow Only, TimeSformer, and AMP-HOI, each demonstrating their respective strengths and limitations in the behavior recognition task. Slow Only is a model that performs behavior recognition using only slow-frame sequences, emphasizing the extraction of fine-grained features from high-quality images. However, by neglecting temporal dynamic features, it struggles to provide a comprehensive understanding of complex behaviors, resulting in an mAP@0.5 value of 78.92%. TimeSformer employs a global attention mechanism to capture long-range temporal dynamics, demonstrating superior recognition capability in tasks involving complex temporal dynamics, such as “squat” and “ladder climb”. Its mAP@0.5 value reaches 81.59%. AMP-HOI integrates human–object interaction detection with temporal segment prediction, explicitly modeling the interactions between humans and objects to enhance the understanding of fine-grained behaviors. This approach attains an mAP@0.5 of 84.82%, yet it demonstrates limited effectiveness in recognizing fast dynamic behaviors such as “squat”. In contrast, the method proposed in this study exhibits notable advantages in key feature representation as well as in capturing long-term temporal dependencies and fine-grained local dynamics, achieving the highest mAP@0.5 of 87.77% among all compared methods.
Table 9 presents the performance of the proposed method and the comparison models in terms of Precision, Recall, and F1-Score. The experimental results show that the proposed method achieves Precision, Recall, and F1-Score values of 89.07%, 82.47%, and 85.56%, respectively, attaining a good balance between precision and recall. Meanwhile, the proposed method outperforms all comparison models across all metrics, fully demonstrating its superior target-capturing capability and recognition accuracy.
Additionally, as shown by the computational efficiency comparison results in Table 8, the main issues with the TimeSformer and AMP-HOI models lie in its high computational complexity, large memory consumption, and slow inference speed, making them less suitable for scenarios with high real-time requirements. The memory usage of each trained model weight file is approximately three times that of the proposed model, significantly increasing the computational and storage burden during deployment and inference. In contrast, the proposed method reduces computational and storage overhead while maintaining recognition performance.
In summary, the proposed substation operation safety behavior recognition method based on a dual-path spatiotemporal network is capable of jointly modeling spatiotemporal information and dynamic behavioral features, achieving an mAP@0.5 of 87.77%. Both the recognition accuracy and computational efficiency outperform all comparative models mentioned above, demonstrating the effectiveness and superiority of the proposed approach in complex substation operation scenarios.

4.3.5. Operational Scenario Testing

The proposed method combines features from both slow and fast frame sequences and incorporates the ACTION and ECA attention mechanisms. Figure 11 presents the detection results for different operation types. It can be observed that the proposed method accurately identifies both the personnel and their behaviors across various operation categories and maintains stable performance under different scene conditions, confirming the effectiveness of the method.
Meanwhile, to thoroughly evaluate the effectiveness of the proposed method under various challenging conditions, multiple experiments were conducted on our self-constructed dataset. As shown in Figure 12a, for fast-motion categories such as “check electricity,” “ladder climb,” and “pole climb,” video segments containing drastic motion changes were selected for evaluation, highlighting the model’s capability to capture high-dynamic behaviors and subtle motion variations. As shown in Figure 12b, for partial-occlusion cases involving “squat” and “live working,” test data containing localized obstructions such as buildings or trees were employed, reflecting the model’s robustness when parts of the target are missing. As shown in Figure 12c, for complex scene interference, datasets incorporating diverse backgrounds and long-range targets were used to validate the model’s discriminative ability. As shown in Figure 12d, to examine class imbalance, experiments were conducted on video segments of “ladder climb” and “pole climb” with highly skewed class distributions, demonstrating the model’s stability in recognizing small-sample categories. In summary, the experimental results across these challenging scenarios intuitively verify that the proposed method maintains prediction accuracy and adaptability when addressing practical issues such as rapid motion, partial occlusion, complex backgrounds, and class imbalance.

5. Conclusions

In the context of deep integration of renewable energy and the new power system, power system operation and control have become increasingly complex, resulting in substation operation and maintenance scenarios characterized by strong action continuity, partial occlusions, and multi-operator collaboration. Therefore, this paper proposes a substation operation safety behavior recognition method based on a dual-path spatiotemporal network, which effectively improves behavior detection accuracy by processing continuous action changes and fine-grained differences. The main conclusions are as follows:
  • YOLOv8 is first used to replace the original Faster R-CNN region detection head, enabling precise localization of operators in video frames and providing reliable spatial information for subsequent behavior feature extraction and recognition.
  • By incorporating the ECA attention mechanism into the Slow Pathway and the multi-path excitation residual network into the Fast Pathway, the two pathways form a complementary structure during multi-scale spatiotemporal feature fusion, thereby enabling effective representation of long-term postural states and rapid dynamic behaviors in complex operation scenarios.
  • A binary cross-entropy-based Focal Loss is designed to balance the weight distribution of positive and negative samples across different categories, supporting more stable recognition performance for categories with limited samples.
The experimental results indicate that the proposed method maintains stable recognition performance under representative substation operating modes, with an overall mAP of 87.77% and an F1-score of 85.56%, while operating at approximately 118 frames per second with moderate computational complexity. The results validate the effectiveness of the proposed method in achieving stable recognition of complex substation operation and maintenance behaviors.

6. Limitations

Although the proposed substation operation safety behavior recognition method based on a dual-path spatiotemporal network achieves promising experimental performance, several aspects of the method remain to be further improved in the context of deep integration of renewable energy and the new power system. First, the current model mainly relies on human motion features and does not explicitly model human–object interactions between operators and tools, which are important for accurate safety risk assessment. Second, while the model shows relatively high inference efficiency on GPU platforms, real-world deployment on resource constrained edge devices may still be limited by computational and power constraints. Future work will focus on incorporating model compression techniques, such as pruning and quantization, to improve deployment efficiency.
Building upon the above considerations, future work is expected to proactively identify and intervene in potential hazardous situations at an early stage, thereby enabling proactive risk prevention and meeting the growing demand for intelligent safety operation and maintenance under the integrated operation of renewable energy and the new power system.

Author Contributions

Conceptualization, X.Z., F.M., G.C. and S.L.; Methodology, X.Z. and F.M.; Software, X.Z. and Q.L.; Formal analysis, X.Z. and S.L.; Investigation, F.M. and G.C.; Data curation, X.Z., F.M. and S.L.; Writing– original draft, X.Z.; Writing—review and editing, X.Z., F.M. and G.C.; Visualization, X.Z., F.M. and Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China, grant number 52407143.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors sincerely acknowledge the invaluable support and guidance provided by Fuqi Ma throughout this research and the authors thank Ge Cao for his assistance.

Conflicts of Interest

Author Shixuan Lv was employed by the company Electric Power Research Institute of State Grid Shanxi Electric Power Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Ma, F.; Wang, B.; Dong, X. Scene understanding method utilizing global visual and spatial interaction features for safety production. Inf. Fusion 2025, 114, 102668. [Google Scholar] [CrossRef]
  2. Ramirez-Bettoni, E.; Eblen, M.L.; Nemeth, B. Analysis of live work accidents in overhead power lines and other electrical systems between 2010–2022. In Proceedings of the 2024 IEEE IAS Electrical Safety Workshop (ESW), Tucson, AZ, USA, 4–8 March 2024; pp. 1–5. [Google Scholar] [CrossRef]
  3. Lin, C.; Xu, Q.; Huang, Y. Pro-control method of power safety accidents based on event evolutionary graph. J. Saf. Sci. Technol. 2021, 17, 39–45. [Google Scholar] [CrossRef]
  4. Guan, C.; Yan, Y.; Zhang, S. Research on behavioral causes of casualties in electric power enterprises based on 24Model. J. Saf. Environ. 2023, 23, 2788–2793. [Google Scholar] [CrossRef]
  5. Vukicevic, A.M.; Petrovic, M.; Milosevic, P.; Peulic, A.; Jovanovic, K.; Novakovic, A. A systematic review of computer vision based personal protective equipment compliance in industry practice: Advancements, challenges and future directions. Artif. Intell. Rev. 2024, 57, 319. [Google Scholar] [CrossRef]
  6. Meng, L.; He, D.; Ban, G. Incremental spatio-temporal augmented sampling for power grid operation behavior recognition. Electronics 2025, 14, 3579. [Google Scholar] [CrossRef]
  7. Meng, L.; He, D.; Ban, G. Active hard sample learning for violation action recognition in power grid operation. Information 2025, 16, 67. [Google Scholar] [CrossRef]
  8. Kulsoom, F.; Narejo, S.; Mehmood, Z. A review of machine learning-based human activity recognition for diverse applications. Neural Comput. Appl. 2022, 34, 18289–18324. [Google Scholar] [CrossRef]
  9. Liu, K.; Zhao, H.; Wu, T.; Wu, C.; Wan, Y. Lightweight human pose estimation for safety identification in live power distribution network operations. J. Saf. Environ. 2025, 25, 3445–3455. [Google Scholar] [CrossRef]
  10. Wang, B.; Ma, F.; Jia, R. Skeleton-based violation action recognition method for safety supervision in operation field of distribution network based on graph convolutional network. CSEE J. Power Energy 2021, 9, 2179–2187. [Google Scholar] [CrossRef]
  11. Li, W.; Ma, F.; Zuo, Z. SafetyGPT: An autonomous agent of electrical safety risks for monitoring workers’ unsafe behaviors. Int. J. Electr. Power Energy Syst. 2025, 168, 110672. [Google Scholar] [CrossRef]
  12. Chandramouli, A.N.; Natarajan, S.; Alharbi, H.A. Enhanced human activity recognition in medical emergencies using a hybrid deep CNN and bi-directional LSTM model with wearable sensors. Sci. Rep. 2024, 14, 30979. [Google Scholar] [CrossRef]
  13. Wei, D.; Tian, Y.; Wei, L. Efficient dual attention SlowFast networks for video action recognition. Comput. Vis. Image Underst. 2022, 222, 103484. [Google Scholar] [CrossRef]
  14. Chen, Y.; Ge, H.; Liu, Y. AGPN: Action granularity pyramid network for video action recognition. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 3912–3923. [Google Scholar] [CrossRef]
  15. Ma, F.; Liu, Y.; Wang, B. Research on intelligent identification method of distribution grid operation safety risk based on semantic feature parsing. Int. J. Electr. Power Energy Syst. 2024, 160, 110139. [Google Scholar] [CrossRef]
  16. Feng, X.; Wu, T.; Wan, Y. Behavior recognition method of live working personnel based on human–object interaction detection. J. Saf. Sci. Technol. 2024, 20, 205–211. [Google Scholar] [CrossRef]
  17. Zhang, Z.; Zhang, Q.; Xu, X. Unsafe behavior recognition model of high-climbing workers based on vision. China Saf. Sci. J. 2025, 35, 144–151. [Google Scholar] [CrossRef]
  18. Ren, S.; He, K.; Girshick, R. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  19. Zou, M.; Zhou, Y.; Jiang, X. Spatio-temporal behavior detection in field manual labor based on improved SlowFast architecture. Appl. Sci. 2024, 14, 2976. [Google Scholar] [CrossRef]
  20. Li, J.; Xie, S.; Zhou, X. Real-time detection of coal mine safety helmet based on improved YOLOv8. J. Real-Time Image Process. 2024, 22, 26. [Google Scholar] [CrossRef]
  21. Chen, H.; Zhou, G.; Jiang, H. Student behavior detection in the classroom based on improved YOLOv8. Sensors 2023, 23, 8385. [Google Scholar] [CrossRef] [PubMed]
  22. Zou, H.; Yang, J.; Sun, J. Detection method of external damage hazards in transmission line corridors based on YOLO-LSDW. Energies 2024, 17, 4483. [Google Scholar] [CrossRef]
  23. Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. SlowFast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
  24. Sun, N.; Leng, L.; Liu, J. Multi-stream slowfast graph convolutional networks for skeleton-based action recognition. Image Vis. Comput. 2021, 109, 104141. [Google Scholar] [CrossRef]
  25. Wang, Z.; She, Q.; Smolic, A. ACTION-net: Multipath excitation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA, 19–25 June 2021; pp. 13214–13223. [Google Scholar]
  26. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  27. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
  28. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
  29. Cui, Y.; Jia, M.; Lin, T.-Y.; Song, Y.; Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 15–20 June 2019; pp. 9268–9277. [Google Scholar]
  30. Li, Y.; Zou, G.; Zou, H. Insulators and defect detection based on the improved focal loss function. Appl. Sci. 2022, 12, 10529. [Google Scholar] [CrossRef]
  31. Li, Y.; Shi, F.; Hou, S. Feature pyramid attention model and multi-label focal loss for pedestrian attribute recognition. IEEE Access 2020, 8, 164570–164579. [Google Scholar] [CrossRef]
  32. Yang, F. A multi-person video dataset annotation method of spatio-temporally actions. arXiv 2022, arXiv:2204.10160. [Google Scholar]
Figure 1. YOLOv8 network architecture diagram.
Figure 1. YOLOv8 network architecture diagram.
Processes 14 00133 g001
Figure 2. SlowFast network structure.
Figure 2. SlowFast network structure.
Processes 14 00133 g002
Figure 3. Architecture of the proposed dual-path spatiotemporal network.
Figure 3. Architecture of the proposed dual-path spatiotemporal network.
Processes 14 00133 g003
Figure 4. Structure of SCM-Res.
Figure 4. Structure of SCM-Res.
Processes 14 00133 g004
Figure 5. Structural components of the SCM module. The module consists of (a) spatiotemporal excitation (STE) module, (b) channel excitation (CE) module, and (c) motion excitation residual (ME) module.
Figure 5. Structural components of the SCM module. The module consists of (a) spatiotemporal excitation (STE) module, (b) channel excitation (CE) module, and (c) motion excitation residual (ME) module.
Processes 14 00133 g005
Figure 6. ECA-Res structure.
Figure 6. ECA-Res structure.
Processes 14 00133 g006
Figure 7. Structure of ECA network.
Figure 7. Structure of ECA network.
Processes 14 00133 g007
Figure 8. Comparison of model val_mAP.
Figure 8. Comparison of model val_mAP.
Processes 14 00133 g008
Figure 9. Confusion matrix of six substation operation behaviors.
Figure 9. Confusion matrix of six substation operation behaviors.
Processes 14 00133 g009
Figure 10. Visualization of behavior detection results obtained by different methods under the same substation operation scenario. The displayed results correspond to (a) Slow Only, (b) TimeSformer, (c) AMP-HOI, and (d) the proposed method.
Figure 10. Visualization of behavior detection results obtained by different methods under the same substation operation scenario. The displayed results correspond to (a) Slow Only, (b) TimeSformer, (c) AMP-HOI, and (d) the proposed method.
Processes 14 00133 g010aProcesses 14 00133 g010b
Figure 11. Qualitative visualization of behavior detection results of the proposed method across different power operation types. The operation types include (a) electricity checking, (b) standing, (c) squatting, (d) ladder climbing, (e) live-line working, and (f) pole climbing.
Figure 11. Qualitative visualization of behavior detection results of the proposed method across different power operation types. The operation types include (a) electricity checking, (b) standing, (c) squatting, (d) ladder climbing, (e) live-line working, and (f) pole climbing.
Processes 14 00133 g011aProcesses 14 00133 g011b
Figure 12. Qualitative visualization of behavior detection results produced by the proposed method under representative challenging substation operation scenarios, including (a) fast-motion behaviors, (b) partial occlusion, (c) complex background interference, and (d) rare-class behaviors.
Figure 12. Qualitative visualization of behavior detection results produced by the proposed method under representative challenging substation operation scenarios, including (a) fast-motion behaviors, (b) partial occlusion, (c) complex background interference, and (d) rare-class behaviors.
Processes 14 00133 g012aProcesses 14 00133 g012b
Table 1. Performance comparison of behavior recognition models in power operation scenarios.
Table 1. Performance comparison of behavior recognition models in power operation scenarios.
ReferenceModelRecognized
Behaviors
AP/%Parameters/MGFLOPsLimitations
[9]YOLOv8n-Pose + ST-GCNLive working88.006.234.50Not optimized or validated under complex background clutter and occlusion conditions
[15]ResNet101 +
Transformer
Ladder climbing91.2121.40202.30Limited behavior categories and insufficient generalization capability
[16]OpenPose + ST-GCN + YOLOv5sLive working88.9Human pose estimation is constrained under occlusion conditions
[17]PCT-1DCNNHigh-altitude
operation
90.3Limited generalization to other behavior types
Table 2. SlowFast network architecture parameters.
Table 2. SlowFast network architecture parameters.
StageSlow PathwayFast PathwayS/F Output T × S2
Raw clip64 × 2242
Data layerStride16, 12Stride2, 124/32 × 2242
Conv11 × 72, 645 × 72, 644/32 × 1122
Pool11 × 32, max1 × 32, max4/32 × 562
Res2{1 × 12, 64
1 × 32, 64
1 × 12, 256} × 3
{3 × 12, 8
1 × 32, 8
1 × 12, 32} × 3
4/32 × 562
Res3{1 × 12, 128
1 × 32, 128
1 × 12, 512} × 4
{3 × 12, 64
1 × 32, 64
1 × 12, 256} × 4
4/32 × 282
Res4{3 × 12, 256
1 × 32, 256
1 × 12, 1024} × 6
{3 × 12, 32
1 × 32, 32
1 × 12, 128} × 6
4/32 × 142
Res5{3 × 12, 512
1 × 32, 512
1 × 12, 2048} × 3
{3 × 12, 64
1 × 32, 64
1 × 12, 256} × 3
4/32 × 72
Global average pool, concate, fc#classes
Table 3. Composition of the behavior dataset.
Table 3. Composition of the behavior dataset.
Behavior CategoryVideosLabelsAnnotation Strategy
TrainTest
Check electricity307772296Operator holds a voltage detector and performs electricity-checking behaviors near equipment.
Stand4061023615Operator holds a voltage detector and checks for the absence of voltage to ground near the equipment.
Squat7719554Operator adopts a crouched posture with a lowered center of gravity.
Ladder climb6416520Operator maintains contact with a ladder while moving upward or downward, or remaining stationary.
Live working283712580Operator performs maintenance, testing, or operational tasks in close proximity to energized equipment.
Pole climb7820574Operator climbs along a utility pole or performs tasks while in contact with the pole.
Total121530510,139——
Table 4. Ablation study results.
Table 4. Ablation study results.
ModelSCM-ResECA-ResAccuracy of Different Behaviors Under Operating Conditions (%)mAP
/%
Params
/Million
GFLOPs/G
Check
Electricity
StandSquatLadder
Climb
Live WorkingPole Climb
1××81.0383.7175.0379.9183.2782.1980.8633.56853.10
2×83.5784.9375.4382.0183.6884.1382.2933.57253.12
3×86.9788.6580.2584.9985.7787.7985.7433.63153.30
488.9290.8382.4487.9087.3689.1587.7733.63453.37
Table 5. Stability of recognition accuracy for different behaviors.
Table 5. Stability of recognition accuracy for different behaviors.
Behavior CategoryVarianceStd/%95% CI/%
Check electricity0.000211.45[86.10, 91.63]
Stand0.000181.34[88.42, 93.10]
Squat0.000391.97[78.66, 85.98]
Ladder climb0.000261.61[84.82, 90.77]
Live working0.000151.22[85.01, 89.74]
Pole climb0.000311.76[85.90, 92.11]
Table 6. Robustness evaluation under different operating conditions.
Table 6. Robustness evaluation under different operating conditions.
Scenario FactorConditionAccuracy of Different Behaviors Under Operating Conditions (%)
Check
Electricity
StandSquatLadder
Climb
Live WorkingPole Climb
LightingOriginal90.2492.3983.7087.9587.6590.31
Brightness
enhancement
89.4491.1083.5087.4487.4089.85
Camera viewpointsFront view89.1592.7284.1586.8387.4690.21
Back view88.9191.2182.1484.4285.4389.96
Motion PatternContinuous motion90.5191.4583.8188.5887.3489.42
Intermittent motion89.3291.7083.2387.7586.7889.10
Table 7. Experimental results of different behavior recognition models on public datasets.
Table 7. Experimental results of different behavior recognition models on public datasets.
ModelAccuracy/%
HMDB51UCF101
Slow Only66.880.3
TimeSformer69.582.5
AMP-HOI76.889.9
Proposed method77.691.6
Table 8. Comparison results with other methods.
Table 8. Comparison results with other methods.
ModelAccuracy of Different Behaviors Under Operating Conditions (%)mAP/%
Check ElectricityStandSquatLadder ClimbLive WorkingPole Climb
Slow Only78.9287.7774.3676.7584.6771.0378.92
TimeSformer84.1785.9877.3380.2985.3776.4281.59
AMP-HOI87.3088.7179.5084.3186.5982.5284.82
Proposed method88.9290.8382.4487.9087.3689.1587.77
Table 9. Performance of different models on multiple metrics.
Table 9. Performance of different models on multiple metrics.
ModelPrecision/%Recall/%F1-Score/%Params/MGFLOPsFPS (Frame/s)
Slow Only78.6476.6877.6431.6363.92101.50
TimeSformer81.7277.8279.72115.76146.5475.20
AMP-HOI82.9180.0081.43265.16129.8382.60
Proposed method89.0782.4785.5633.6353.37117.90
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, X.; Ma, F.; Cao, G.; Lv, S.; Liu, Q. Safety Behavior Recognition for Substation Operations Based on a Dual-Path Spatiotemporal Network. Processes 2026, 14, 133. https://doi.org/10.3390/pr14010133

AMA Style

Zhao X, Ma F, Cao G, Lv S, Liu Q. Safety Behavior Recognition for Substation Operations Based on a Dual-Path Spatiotemporal Network. Processes. 2026; 14(1):133. https://doi.org/10.3390/pr14010133

Chicago/Turabian Style

Zhao, Xiaping, Fuqi Ma, Ge Cao, Shixuan Lv, and Qian Liu. 2026. "Safety Behavior Recognition for Substation Operations Based on a Dual-Path Spatiotemporal Network" Processes 14, no. 1: 133. https://doi.org/10.3390/pr14010133

APA Style

Zhao, X., Ma, F., Cao, G., Lv, S., & Liu, Q. (2026). Safety Behavior Recognition for Substation Operations Based on a Dual-Path Spatiotemporal Network. Processes, 14(1), 133. https://doi.org/10.3390/pr14010133

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop