Substation Personnel Fall Detection Based on Improved YOLOX

: With the continuous promotion of smart substations, staff fall detection has become a key issue in automatic detection of substations. The injuries and safety hazards caused by falls among substation personnel are numerous. If a timely response can be made in the event of a fall, the injuries caused by falls can be reduced. In order to address the issues of low accuracy and poor real-time performance in detecting human falls in complex substation scenarios, this paper proposes an improved algorithm based on YOLOX. A customized feature extraction module is introduced to the YOLOX feature fusion network to extract diverse multiscale features. A recursive gated convolutional module is added to the head to enhance the expressive power of the features. Meanwhile, the SIoU(Soft Intersection over Union) loss function is utilized to provide more accurate position information for bounding boxes, thereby improving the model accuracy. Experimental results show that the improved algorithm achieves an mAP value of 78.45%, which is a 1.31% improvement over the original YOLOX. Compared to other similar algorithms, the proposed algorithm achieves high accuracy prediction of human falls with fewer parameters, demonstrating its effectiveness.


Introduction
Substations play a vital role in the stable operation of the power system, as the hub connecting the transmission and distribution networks.Intelligent detection has been applied in multiple fields [1][2][3].With the continuous promotion of smart grids, intelligent monitoring of substations has become a trend [4].In an industrial environment like a substation, personnel are required to independently perform various tasks, including equipment maintenance, inspection, and troubleshooting.Due to the presence of complex power systems, high-voltage equipment, and various facilities in substations, personnel may face the risk of falling during their operations, which can result in personal injury, equipment failure, or even power outages.Intelligent monitoring of substations should not only include the detection of equipment operation but also ensure the personal safety of the workers, as it is an essential part of ensuring the safe operation of substations [5].The study of personnel fall detection in substations can not only improve the safety of workers by enabling timely rescue measures but also provide important support for evaluation of risks, improving the work environment and cultivating safety awareness through the analysis of fall event data.This can reduce accident risks and provide a more reliable means for responding quickly to potential hazards.
Currently, there are two main types of fall detection methods: sensor-based methods [6][7][8] and computer-vision-based methods [9][10][11].Sensor-based methods were first applied in the field of fall detection and have been widely researched and applied due to their low cost, scalability, and flexibility.Sensors can be classified as wearable sensors and environmental sensors.Wearable sensors refer to devices carried on the body that detect fall events by monitoring changes in body motion and posture.This approach requires individuals to continuously wear devices, which can lead to poor user experience.Environmental sensors, on the other hand, are installed in the surroundings and detect falls by monitoring physical changes in the environment.However, this approach has limitations, such as installation position restrictions and complex data interpretation.With the rapid development of artificial intelligence technology, deep learning-based methods such as convolutional neural networks have made significant progress in image and video analysis tasks.These methods provide better real-time performance, do not interfere with the daily activities of workers, and offer higher accuracy and reliability for fall detection.Visually based algorithms first capture images through cameras, then extract relevant human features using object detection models to determine whether a person has fallen [12].
Chen et al. utilized the Mask R-CNN method to detect moving objects on complex backgrounds and proposed an attention-guided bidirectional LSTM model for final fall event detection [13].Cai et al. designed a vision-based multitask mechanism, achieving accurate fall detection by assigning the secondary task of frame reconstruction and the primary task of fall detection [14].García et al. employed an LSTM model for time series classification combined with data augmentation and developed a robust and accurate fall detection model [15].
However, most current research is based on experiments conducted in ideal environments, and the robustness of models for complex backgrounds like substations is generally poor.Moreover, these models have high model weights and complex network structures, which fail to meet real-time requirements.Therefore, this paper proposes an improved fall detection model based on YOLOX [16] to address the issues of low detection accuracy and poor real-time performance in the complex scenarios of substations.In the feature fusion part of YOLOX, a custom feature extraction module is implemented to enhance neck feature extraction capability, and a convolutional module is added to the head to improve detection speed, achieving accurate detection of falls in substation environments.

YOLOX
YOLOX is a new generation of object detection algorithm proposed by Megvii Technology in 2021.It shows significant improvements in performance compared to its predecessors, YOLOv3 [17], YOLOv4 [18], and YOLOv5.Compared to YOLOv7 [19], which further improves the target regression rate by introducing an anchor box mechanism, YOLOX uses an anchor-free box mechanism to improve the model's computational speed while maintaining detection accuracy.The overall network structure of YOLOX is depicted in Figure 1 [20], consisting of three parts: the backbone network, the feature fusion network, and the prediction heads.

The Backbone Network
The backbone network of YOLOX adopts the CSPDarknet53 architecture, which is responsible for extracting features from the input image and utilizing these features for subsequent object detection tasks.The basic idea behind CSPDarknet53 is to split the input features into two parts, where one part is processed directly through a series of convolutional layers and the other part is processed after being connected through a CSP block.This approach helps alleviate the gradient-vanishing problem and improves the efficiency of feature propagation.
The input image for detection is resized to a uniform size of 640 × 640 × 3 and fed into the Focus network structure.In this structure, every alternate pixel is selected to obtain one value, which divides the input feature map into four subfeature maps.These four subfeature maps are transposed and concatenated following certain rules to obtain a 320 × 320 × 12 feature map, which is then input into the backbone network for feature

The Feature Fusion Network
To better fuse multiscale feature information, YOLOX incorporates the FPN (Feature Pyramid Network) [21] algorithm and the PAN (Path Aggregation Network) [22] algorithms as the upsampling and downsampling paths, respectively, in the feature fusion network.In the upsampling path, high-level feature maps extracted from the backbone network are upsampled and added element-wise to adjacent low-level feature maps to achieve crosslevel feature fusion.In the downsampling path, low-level feature maps obtained from the upsampling path are downsampled and added element-wise to adjacent high-level feature maps.After passing through the feature fusion network, feature maps of different resolutions obtain rich semantic and positional information, enabling better object detection and localization.

The Prediction Head
To address the issue of conflicting objectives between classification, regression, and evaluation criteria in traditional object detection algorithms, YOLOX introduces a decoupled head structure.The decoupled head in YOLOX consists of two subheads: a classification subhead and a regression subhead.The classification subhead is responsible for predicting the class probabilities of the objects, while the regression subhead is responsible for predicting the bounding box positions and sizes of the objects.By separating the tasks of object classification and bounding box regression into independent subhead networks, the decoupled head allows them to be learned and optimized independently.Finally, the information is fused and output through concatenation.This design of the decoupled head in YOLOX facilitates more effective information exchange between the two tasks, thus improving the convergence performance and detection accuracy of the model.

Improved YOLOX
The complexity of outdoor environments typically found in substations [23] can have a negative impact on image quality and the effectiveness of pedestrian detection algorithms.To improve the accuracy of personnel fall detection, this paper proposes an enhanced YOLOX network structure, as shown in Figure 2. Specific improvements include the addition of a custom feature extraction module, TModule, to the feature fusion network to enhance the network's feature extraction capability; the addition of recursive gated convolution, gnConv, to the head to facilitate context information fusion and improve detection capability; and the replacement of the original IoU loss function with the SIoU loss function to enhance target localization accuracy.
The main contributions of this paper are as follows: 1.In order to extract rich multiscale features, a feature extraction module is designed in the feature fusion part of YOLOX.This module enhances the neck's feature extraction capability while reducing computational complexity and parameter count.It extracts semantic information that includes diverse characteristics of substation personnel.

Tmodule
In the substation scenario, where the background complexity is high, this paper proposes a redesign of the feature extraction module, as shown in Figure 3, to better capture the local features and contextual information of personnel falls.The input of the customized module is first split, and each branch compresses the number of channels by half using a 1 × 1 convolution.Then, the upper branch continues to split, maintaining spatial invariance of features with a 3 × 3 convolution and a stride of 1, then stacks with the lower branch.The features are then integrated through a 3 × 3 convolution and a 1 × 1 convolution before being stacked and merged with the original branch.Finally, the features are output through a 1 × 1 convolution.This module is placed in the neck of the feature extraction network, enhancing the feature extraction capability of the convolutional neural network while reducing model complexity.

Gated Non-Local Convolution
gnConv (gated non-local convolution) [24] combines gated convolution and a recursive design to effectively capture the contextual relationship in image data to achieve highorder feature interactions.A schematic diagram of the gnConv structure is shown in Figure 4.The input of gnConv is a feature map with channel C, and after the first layer of convolution, the number of channels doubles.In parentheses, C represents the number of output channels, and the remaining information is represented by *.The convolutional output of the first layer is divided into two parts: the first part is used by the next layer, and the second part is fed into the deep separable convolution to output three parts as inputs for the other three layers.It enhances the feature representation without introducing additional computational complexity.
The input feature map is denoted as x (with dimensions of H × W × C).After passing through a linear layer, we obtain two feature maps: p0 (with dimensions of H × W × C) and q0 (with dimensions H × W × C).Feature map q0 undergoes a depth-wise convolution operation and is then dot-multiplied by feature map p0, resulting in feature map p1.Finally, feature map p1 is processed through a linear layer to produce the output feature map (y).The output of the recursive gated convolution can be represented as follows: where f represents the depth-wise convolution, and denotes the dot product operation.
In the YOLOX head, after the feature map goes through convolutional normalization and activation functions, the recursive gated convolution is introduced to further extract the crucial information from the feature layers.This improves the accuracy and speed of the model detection.

Improvement of Loss Function
In object detection, the definition of the loss function has a significant impact on the final performance of the model [25][26][27].In YOLOX, the GIoU (generalized intersection over union) [28] loss function is used as the localization loss function.However, GIoU only considers position and shape information and does not account for the angle loss between the predicted and ground truth bounding boxes.To effectively improve the regression accuracy of the predicted boxes, in this paper, we replace GIoU with SIoU (soft intersection over union) [29].
The SIoU loss function consists of two components: 1. IoU Loss: This component is used to measure the overlap between the predicted box and the ground truth box.It uses the standard IoU (intersection over union) calculation formula to compute the intersection-over-union ratio of the predicted box and the ground truth box and combines it with the target classification loss required in the object detection task.2. Smooth L1 Loss: This component is used to smooth the process of bounding box regression.It applies the smooth L1 loss function to the difference between the coordinates of the predicted box's bounding box and the ground truth box to mitigate noise and instability during the regression process.
Given a predicted box P and a ground truth box G, the SIoU loss function can be defined as follows: 4) ) where , , and w, w gt represent the width of the predicted box and the ground truth box, respectively.Similarly, h, h gt represents the height of the predicted box and the ground truth box, respectively.

Dataset and Experimental Platform
The person falling dataset is a crucial component for training, evaluating, and improving the fall detection model.It provides the model with learning material and validates and optimizes the model during the training process [30,31], enabling the fall detection model to better learn the features of the target.To solve the problem of limited scale and inability to cover various situations and changes in the current fall detection dataset in the field of substations, we comprehensively utilized an open source fall detection dataset (https://aistudio.baidu.com/aistudio/datasetdetail/94809,accessed on 25 November 2021) in Baidu AIStudio and self-made substation scene fall data.In the experiment, the data were expanded using methods such as horizontal flipping, random cropping, and angle rotation.There are a total of over 7000 datasets that cover various indoor and outdoor scenes, as shown in Figure 5; some typical fall scenarios are presented as references for evaluation of the performance of the model in real situations.
The dataset was annotated with targets using the LabelImg tool, and after annotation was completed, it was saved as an xml file, which was then converted into VOC2007 data format.In order to improve the generalization ability of the network model, 90% of the images were used for model training in the experiment, and the remaining 10% of the data was used to verify the model performance.The dataset used for model training was divided into a training set and a validation set in a 9:1 ratio.The function of the training set is to set the parameters of the classifier and regressor, then train the classification and regression algorithms and, finally, fit multiple classification regressors for the fall detection algorithm.The function of the validation set is to identify the algorithm weights with the highest recognition accuracy, detect the weights of each trained algorithm, record the algorithm accuracy, and select the weight parameters corresponding to the algorithm with the highest accuracy.The function of the test set is to predict the optimal algorithm obtained from the training and validation sets and measure the effectiveness of the algorithm.
We used the PyTorch framework to train the network model, with a total of 300 epochs trained.In the network model, the input size is 640 × 640.The server configuration used in this study is presented in Table 1.

Evaluation Metrics
In order to further evaluate the detection accuracy of the model in this article, indicators such as precision (P), recall (R), average precision (AP), and mean average precision (mAP) were selected for evaluation.AP and mAP avoid the impact of unequal confidence levels in different models on evaluation and can be used for the vast majority of models in the field of object detection.The mAP is the average value of AP across all classes.We used mAP to calculate the mean accuracy of each category corresponding to the specified intersection over union in the fall detection model.The mAP value ranges from 0 to 1; a higher value indicates better performance of the object detection algorithm across multiple classes.The calculation formulas for AP and mAP are shown as follows: where TP (true positive) represents the targets that were originally labeled as positive samples and also predicted as positive samples by the model, FP (false positive) represents the targets that were originally labeled as negative samples but predicted as positive samples by the model, FN (false negatives) represents the targets that were originally labeled as positive samples but predicted as negative samples by the model, P denotes precision, and R represents recall.AP is obtained by calculating the area under the precisionrecall curve, k represents the total number of categories, and mAP is derived by averaging the AP values across all classes.

Model Training
The process of model training is essentially the process of fitting model parameters.In this study, the model utilizes Adam (adaptive moment estimation) as the optimizer.A total of 300 epochs were trained, divided into two steps: In the first step, the parameters of the backbone network are frozen to expedite the training process.The learning rate is set to 0.001, and the batch size is set to 32.
In the second step, the parameters of the backbone network are unfrozen to fully learn the features of the detection targets and achieve better convergence.The learning rate is set to 0.0001, and the batch size is set to 16.To prevent the model from getting stuck in local optima, a cosine annealing decay schedule is employed for learning rate adjustment.
The changes in the loss function throughout the entire training process are shown in Figure 6.It can be seen that at the 140th epoch, the network tends to converge, the loss function changes smoothly, and the fluctuation amplitude is not significant, indicating that the improved model has the best training effect.

Test Results
In order to demonstrate the effectiveness of the proposed method, it was experimentally compared with four classic object detection methods: Faster RCNN [32], YOLOv5, YOLOv7, and YOLOX.Faster RCNN is a classic two-stage detection algorithm that generates candidate boxes through a region recommendation network, then performs target classification and boundary box regression.YOLOv5, YOLOv7, and YOLOX are classic single-stage detection algorithms.The comparative experimental results are shown in Table 2.The results demonstrate that the mAP detection accuracy of the improved algorithm presented in this paper reaches 78.45%, which is an improvement of 9.32% over Faster-RCNN, 3.98% over YOLOv5, and 0.27% over YOLOv7.Moreover, the improved algorithm significantly enhances the detection speed of the model.Compared to the original baseline model, the detection accuracy is increased by 1.31%, while only adding an additional 0.1 M parameters.The effectiveness of the algorithm proposed in this paper is verified through comparisons with mainstream object detection algorithms.

Ablation Experiments
To evaluate the impact of each improvement strategy adopted in this paper on the detection performance of YOLOX, ablation experiments were conducted on the dataset, as shown in Table 3.In the table, Model A represents the addition of SIoU on top of the base model, Model B represents the addition of TModule on top of Model A, and Model C represents adding SIOU and gnConv on top of the base model.From the data in the table, it can be observed that the inclusion of the SIoU loss function improvement strategy leads to a 0.16% increase in mAP.Building upon this, the addition of the designed TModule further improves mAP by 0.96%.Finally, with the inclusion of the gnConv improvement strategy, there is a further increase in mAP.The results of the ablation experiments demonstrate that the improved algorithm results in an overall mAP enhancement of 1.31% compared to the original baseline model.Additionally, they validate that the improvement strategies proposed in this paper effectively enhance the detection accuracy of pedestrian falls in substation scenarios.

Visualization of Detection Results
In this study, the detection performance of the original YOLOX model and that of the algorithm proposed in this paper were visually compared, as shown in Figure 7.The improved algorithm performs better than the baseline model, which addresses the issues of false negatives and false positives in the original algorithm.The benchmark model often fails to detect small target personnel, such as those located at a distance, mainly due to insufficient feature extraction of the target, insufficient attention to the target in complex backgrounds, and the inability to solve the problem of large differences in target scales among substation personnel.By improving the algorithm, the localization and feature extraction capabilities of multiscale targets in complex backgrounds can be enhanced, thereby generating more accurate detection frames and alleviating the situation of missed detections to a certain extent.

Conclusions
This article proposes an improved algorithm based on YOLOX to address the issue of low detection accuracy in personnel fall detection in actual substation working environments.The algorithm designs a feature extraction module in the YOLOX feature fusion section, enhancing the neck feature extraction ability.By optimizing the loss function and adding a recursive gated convolution module at the head, the detection speed is improved, resulting in better model convergence and regression performance during the training process, as well as accurate detection of personnel falling in substation scenarios.The experimental results show that compared with the original algorithm, the improved algorithm proposed in this paper is associated with an increase of 1.31% in mAP.The improved algorithm has more advantages in balancing parameter quantity and accuracy.Although it adds fewer parameter quantities, mAP is the best, indicating that it can improve the detection accuracy of multiscale targets in substations while meeting the real-time detection requirements at the expense of a certain detection speed, indicating the effectiveness of the improved algorithm.In future research, we will attempt to prune and quantify the algorithm before transplanting it to the development board to achieve real-time detection of on-site terminals.

Figure 2 .
Figure 2. The network structure of Improved YOLOX.

Figure 3 .
Figure 3.The Network Structure of TModule.

Figure 4 .
Figure 4.The network structure of gnConv.

2 ,
γ = 2 − Λ, IoU(P, G) represents the intersectionover-union ratio between the predicted box P and the ground truth box G. ∆ represents the distance loss, C w represents the width of the minimum bounding rectangle for the ground truth box and the predicted box, C h represents the height of the minimum bounding rectangle for the ground truth box and the predicted box, Λ represents the angle loss, x = c h σ , c h represents the vertical distance between the centers of the ground truth box and the predicted box, σ represents the horizontal distance between the centers of the ground truth box and the predicted box, Ω represents the shape loss, ω w = |w−w gt | max(w,w gt )

Figure 5 .
Figure 5. Partial dataset images of personnel falling in substations.

Figure 6 .
Figure 6.Changes in the total loss function.

Figure 7 .
Figure 7.Comparison of detection results before and after improvement.
2. In the YOLOX head, after the feature map undergoes convolutional normalization and activation functions, gnConv (gated non-local convolution) is introduced.This recursive convolution captures key information from the feature layers, improving the accuracy and speed of the model detection without introducing additional parameters.3. The smoothed IoU (SIoU) loss function is used to address the problem of the IoU (intersection over union) loss function not considering the angle information of the bounding boxes.By fully considering the influence of angle on model training, the SIoU loss function allows the model to adapt better to targets with different angles and shapes.It provides more accurate position information for bounding boxes and improves the model's regression capability.

Table 3 .
The Results of ablation experiments.