1. Introduction
The coal mine air door is typically installed in vehicle transportation and pedestrian roadways, serving the dual purpose of regulating the passage of individuals and vehicles while also managing airflow [
1]. Currently, some coal mines continue to employ manual or semi-manual methods for controlling these air doors. This approach not only results in poor timeliness and low operational efficiency but also increases the risk of unsafe incidents. Consequently, it is essential to investigate real-time monitoring systems for mine dampers, as this research holds significant implications for advancing intelligent construction practices and ensuring safe production within the coal mining industry.
In the traditional air door scenario, the monitoring method involves real-time oversight by a dispatcher who observes the conditions at both the inlet and outlet of the air door. Subsequently, the dispatcher manually operates the air door or utilizes various sensors to facilitate this process [
2]. However, these methods are associated with several drawbacks, including high costs, low monitoring efficiency, and slow closure times for the air door. With advancements in artificial intelligence technology, object detection techniques based on deep learning have found extensive applications in previously underutilized scenarios within underground environments. These applications include areas such as miner safety equipment (e.g., helmets), monkey cars, coal gangue management, and other related fields [
3,
4,
5,
6]. Cui Lizhen et al. [
7] employed a Convolutional Neural Network (CNN) model for feature extraction, considering the various movement states of underground personnel. They classified the behavior patterns of these individuals using a CNN-LSTM model. This model significantly improves accuracy and resilience against interference in personnel positioning within complex environments. However, its effectiveness depends on the quality of data provided by the sensors, and its adaptability to other devices is currently limited. Xiao Zhenjiu et al. [
8] proposed a helmet detection algorithm based on YOLOv8n, which effectively monitored small target objects, such as helmets, by integrating modifications to the feature fusion network, feature pyramid, and partial convolution techniques. The model effectively adapts to the complex environment beneath the mine while maintaining a low computational cost. However, there is a slight increase in model complexity and the number of parameters compared to the original version. Furthermore, there is still potential for optimization in monitoring small targets and occluded objects. Xie Beijing et al. [
9] developed a monkey car detection algorithm based on YOLOv8n, taking into account the varying human states of the monkey cars. The approach incorporates several innovative strategies: an adaptive direct square strategy is introduced to enhance image quality; a variable convolution technique is employed to improve the target receptive field within the C2f module; and a coordinate attention mechanism is integrated to effectively capture global key information. Consequently, this method significantly improves the detection accuracy of monkey cars. It enhances detection accuracy in complex backgrounds, such as dimly lit scenes and occlusions; however, there is still potential for further improvement in detection speed. Qin Yulong et al. [
10] addressed the transportation challenges associated with large lump coal on belt conveyors by proposing a detection method based on YOLOv5s. This approach involves replacing specific standard convolution layers in the backbone network with parallel cavity convolutions and incorporating a joint attention module, thereby enhancing the monitoring capabilities for large lump coal. At the same time, the frequency of missed and incorrect detections of large lump coal has decreased. However, the computational complexity of the model has increased, and its adaptability in extreme scenarios requires further validation. Chenxi Liu et al. [
11] proposed an object detection algorithm based on YOLOv8n. By incorporating the Multi-Head Self-Attention (MHSA) mechanism and modifying the loss function, the model significantly enhanced its ability to detect multiple objects while also improving positioning accuracy.
Currently, there is a limited amount of research focused on the intelligent recognition of air door scenarios. Therefore, it is essential to conduct research and apply target detection technology within the context of air door scenarios in coal mines.
For the monitoring task associated with air door scenarios, the primary detection targets to consider are vehicles and personnel passing through the air door. In these scenarios, the vehicle in question is an explosion-proof trackless rubber-wheeled vehicle. To further regulate the types of vehicles traversing the air door and to ensure the safety of mining operations, this paper identifies the specific type of explosion-proof trackless rubber-wheeled vehicle involved. License plate recognition is primarily employed for managing entry and exit in conventional intelligent vehicles [
12]. However, this approach demonstrates low detection efficiency in complex mining environments, which are characterized by factors such as dim lighting conditions, obscured monitoring lenses due to construction activities, and coal dust obstructing license plates. Zhu Jinrong et al. [
13] employed morphological operations in conjunction with Fourier descriptors to extract vehicle contour curves, subsequently facilitating vehicle recognition. This method is well-suited for high-speed traffic environments and can significantly reduce both false positives and missed detections. However, the model’s adaptability to complex traffic scenarios remains limited. To tackle the challenges associated with the hybrid detection and recognition of both large and small targets, Yu Jie et al. [
14] proposed a vehicle recognition algorithm based on a cascade multi-task deep neural network. This approach integrates an enhanced YOLO algorithm with the DeepSort algorithm to achieve effective vehicle recognition, localization, and target tracking. The system is well-suited for complex environments and demonstrates robust performance in scenarios involving license plate ambiguity and occlusion. However, its tracking accuracy is vulnerable to the challenges presented by occlusion. To address the challenges posed by complex traffic scenarios, Yang Rening et al. [
15] proposed improvements to YOLOv5s by substituting the upsampling module with the CARAFE operator, thereby broadening the target receptive field. They also replaced the loss function with EIoU, incorporated a small target detection layer, and modified the decoupling head to improve detection performance for small targets. Additionally, channel pruning techniques were employed to reduce the model size. These modifications facilitate effective monitoring of both small and occluded targets. The model shows promise in enhancing detection accuracy while simultaneously reducing its overall size. However, there is still considerable room for further improvement in practical deployment scenarios. In scenarios involving air doors, detection personnel typically utilize manual or infrared methods. Conversely, the machine vision-based monitoring algorithm for underground personnel has emerged as the predominant detection approach for individuals operating in subterranean environments. Li Xianguo et al. [
16] proposed a detection algorithm for underground personnel using the Single Shot MultiBox Detector (SSD) network. By incorporating dense connection networks and residual networks, they improved both detection accuracy and real-time monitoring speed. The detection speed achieved was 48 Frames Per Second; however, the recognition speed remained relatively slow. Furthermore, the problem of missed detections is common in situations characterized by a high density of pedestrians or substantial occlusion. Zou Sheng et al. [
17] introduced an enhanced CornerNet-Squeeze method for detecting personnel in coal mines, effectively addressing the challenges posed by complex underground environments and the limited detailed features at the edges of personnel targets. They implemented OctConv to strengthen feature extraction capabilities and employed a dual-scale image fusion algorithm to improve image quality. This approach facilitates increased detection accuracy while preserving the original model’s detection speed. The computational complexity of the model presents challenges for practical deployment, and the detection accuracy for large and medium-sized targets is somewhat diminished compared to the original model. The research conducted by Moawiah Alhulayil et al. [
18] on relay networks demonstrates that the model’s performance can be significantly enhanced through the application of various advanced algorithms in complex environments. This finding provides a theoretical foundation for utilizing the improved model to enhance target detection capabilities in mine-related scenarios. Unlike vehicle and personnel detection, identification in air door scenarios requires the detection of both large and small targets. Although previous research has made advancements in various aspects, it has not achieved an optimal balance among lightweight design, detection accuracy, and detection speed. With the advancement of intelligent construction in coal mines, the detection of personnel and vehicles within mine safety monitoring systems has garnered increasing attention. The mining environment presents considerable challenges for target detection due to low illumination and complex backgrounds. Although current target detection methods have made significant progress, notable shortcomings remain in research pertaining to complex environments. These challenges include multi-target detection, difficulties in identifying small targets, and low visibility conditions specific to mine air door scenarios.
In light of this, the author has enhanced the YOLOv8n model and introduced a lightweight CGSW-YOLO model specifically designed for multi-target monitoring in air door scenarios. The primary contributions of this paper are as follows: The FasterNet network has been integrated into the C2f module of the backbone network, resulting in the design of the C2F-FASTER module. This innovation enables the model to maintain a lightweight configuration without sacrificing accuracy. To minimize redundant information generated during feature extraction, GhostConv has been introduced as a substitute for specific standard convolutions within the backbone network. The slim neck module has been designed by integrating GSConv and the cross-stage partial network VOV-GSCSP module into the neck architecture, effectively reducing both computational load and parameter count while preserving accuracy. Additionally, we have replaced the loss function with WIoUv3 to enhance the quality of boundary frames, significantly improving the model’s ability to locate and detect multiple targets in air door scenarios. This approach effectively meets the deployment requirements for video surveillance in coal mines, particularly concerning lightweight design and real-time performance. It aims to provide an efficient monitoring solution for man-vehicle interactions in intelligent coal mine air door scenarios.
3. Experiment and Analysis
3.1. Experimental Environment Configuration
The computer hardware used in this experiment is configured with a Windows 10 operating system, an Intel® Core™ i5-12400F CPU (Intel, Santa Clara, CA, USA) operating at 2.5 GHz, an RTX 4060 Ti graphics card (NVIDIA, Santa Clara, CA, USA), and 32 GB of RAM. The software environment includes Python version 3.8.0, along with PyTorch version 1.8.1 and CUDA version 11.1.
The model parameters are configured as follows: the input image size is set to 640 × 640 pixels; the training cycle consists of 300 iterations; the batch size is established at 8; the initial learning rate is specified as 0.01; the momentum parameter is defined as 0.937; the weight decay factor is set to 0.0005; and early stopping criteria are applied with a threshold of 50 rounds without improvement.
3.2. Experimental Dataset
Due to the presence of numerous vehicles and individuals traversing the mine air door, it is essential to enhance the model’s generalization capabilities. To achieve this, it is necessary to gather comprehensive information on all vehicles and individuals that may pass through the air door during the data collection process. The image dataset is composed of various scenes of wind doors captured on-site at the Shanxi Jinshen Ciyao Coal Industry in Xinzhou, China, as well as from surveillance videos recorded at the Jinneng Holding Tashan Coal Industry in Datong, China. A total of 861 effective images depicting explosion-proof diesel trackless rubber wheels and personnel movement across various scenarios—including air doors, adits, and rubber conveyor heads—were collected through frame extraction from video surveillance and recordings. The dataset was annotated using the LabelImg software (version number is 1.8.6) and categorized into the following eight categories.
Due to the variety of vehicles operating in different scenarios, along with the various categories of miners navigating the coal mine, it is essential to improve safety measures for both personnel and company property in air door situations. Furthermore, there is a pressing need to enhance safety monitoring and inspection efforts within the coal mine. The various types of vehicles and personnel operating through the air door are categorized into eight distinct groups: WC9R explosion-proof trackless rubber-wheeled vehicles (WC9R), WC3S explosion-proof trackless rubber-wheeled vehicles (WC3S), WCJ8E explosion-proof trackless rubber-wheeled vehicles (WCJ8E), WC28E explosion-proof trackless rubber-wheeled vehicles (WC28E), WC60Y(A) explosion-proof trackless rubber-wheeled vehicles (WC60Y(A)), maintenance workers (miner_yellow), and coal workers (miner_blue). Samples of the different vehicle types and job categories are presented in
Figure 8.
To mitigate the risk of overfitting during the training process and enhance the model’s performance, the dataset was augmented to include 2400 images using various data augmentation techniques. These techniques included flipping, translation, contrast enhancement, adaptive histogram equalization, and random occlusion. Subsequently, the dataset was partitioned into a training set, validation set, and test set in a ratio of 7:2:1.
3.3. Model Evaluation Indicators
In order to objectively assess the multi-objective intelligent recognition method for coal mine damper scenes, this experiment utilizes several performance evaluation metrics: Precision (P), Recall (R), Average Precision (AP), mean Average Precision (mAP), Parameters, and Frames Per Second (FPS).
Params is the number of model training parameters, the larger the number of parameters of the model, the heavier the computational burden during training and reasoning. FPS refers to the number of images that the model can process per second. The formulas for calculating the remaining indicators are presented in Equations (20)–(23).
In the formula, TP represents the number of samples that have been correctly identified, while FP denotes the number of samples that have been incorrectly identified. AP refers to the area under the Precision–Recall (P–R) curve, and n indicates the total number of categories for the identified samples. Based on the dataset utilized in this study, n is equal to 8.
3.4. Ablation Experiment
To assess the effectiveness of the enhanced methods presented in this paper, YOLOv8n was utilized as the benchmark model for conducting ablation experiments. The results are displayed in
Table 1 below. First, the C2f model of the backbone network is replaced with the C2F-faster model to improve feature fusion. Second, standard convolution in the backbone network is substituted with ghost convolution. Subsequently, the lightweight Slim-neck model is introduced for multi-scale feature fusion. Finally, the loss function is modified to WIoUv3.
As illustrated in
Table 1, the incorporation of the C2f-Faster module into Model ① resulted in a 2.7% increase in Precision (P) and a 1.3% increase in mean Average Precision at IoU = 0.5 (mAP@0.5) compared to the original model. Additionally, there was a 12.3% reduction in the number of parameters. These findings suggest that integrating the C2f-Faster module into the backbone network significantly enhances the effectiveness of feature fusion. Model ② employs a lightweight ghost convolution approach. While maintaining a mean Average Precision (mAP) of 0.5, it achieves a 2.6% increase in Precision (P). Additionally, the number of parameters decreases by 6.3%, and the Frames Per Second (FPS) improves from 100 to 102.04.
These results indicate that the model successfully reduces its weight while preserving detection accuracy. In Model ③, the Slim-neck model is introduced independently, resulting in a 2.9% increase in Recall (R). The number of parameters is reduced by 7.0%, and the Frames Per Second (FPS) increased by 2.04%. Meanwhile, the mean Average Precision at an Intersection over Union (IoU) threshold of 0.5 (mAP@0.5) shows an improvement of 1.1%. This indicates that a slight compromise in detection speed is made to achieve enhanced accuracy. After substituting the loss function in Model ④, the metrics Precision (P), Recall (R), and mean Average Precision at an Intersection over Union (IoU) threshold of 0.5 (mAP@0.5) showed increases of 0.4%, 2.6%, and 0.3%, respectively, while the Frames Per Second (FPS) decreased by 28.6%. This indicates a significant improvement in model performance. Model ⑤, which integrates C2f-Faster and incorporates ghost convolution, demonstrated enhancements in Precision (P), Recall (R), and mean Average Precision at an IoU threshold of 0.5 (mAP@0.5) of 1.0%, 0.3%, and 0.1%, respectively. Simultaneously, the number of parameters decreased by 18.3%, while the FPS increased by 19%. This indicates that the model achieved a lightweight design while enhancing both detection accuracy and processing speed. The model ⑥ builds upon the foundation established by model ⑤ by incorporating the Slim-neck model. This integration has resulted in consistent improvements in Precision (P), Recall (R), and mean Average Precision at an Intersection over Union (IoU) threshold of 0.5 (mAP@0.5), while also reducing the number of parameters. These findings indicate that the introduction of the Slim-neck model not only enhances detection accuracy but also significantly improves detection speed. Model ⑦ represents the final enhanced version presented in this paper. Additionally, C2f-Faster, GhostConv, Slim-neck, and WIoUv3 have been integrated into the framework. The performance of this model has yielded optimal results, with Precision (P), Recall (R), mean Average Precision at an IoU threshold of 0.5 (mAP@0.5), and Frames Per Second (FPS) reaching 88.4%, 93.9%, 98.0%, and 135.14 Frames Per Second, respectively.
3.5. Contrast Experiment
3.5.1. Comparative Analysis of Various Models
To further validate the superiority of the CGSW-YOLO model, we conducted comparative experiments with several current mainstream algorithms, including YOLOv3-Tiny, YOLOv5s, YOLOv7-Tiny, YOLOv8s, YOLOv8n, Faster-RCNN, EfficientDet [
32], RE-DETE [
33], and Yolov11n [
34]. Under consistent experimental parameters and identical software and hardware environments, we compared various metrics, such as Precision (P), Recall (R), mean Average Precision (mAP), the number of parameters, Frames Per Second (FPS), floating point operations per second (FLOPS), and model size. The results are presented in
Table 2 below.
Table 2 illustrates that the Precision (P) of the YOLOv3-Tiny model surpasses that of the CGSW-YOLO model by 4.7%. When compared to YOLOv5s, YOLOv7-Tiny, YOLOv8s, and YOLOv8n, the mean Average Precision (mAP) is greater by 1.1%, 6.6%, 1.8%, and 1.3%, respectively. However, the mean Average Precision (mAP) of the YOLOv3-Tiny model is slightly lower than that of the CGSW-YOLO model. However, the parameters, floating point operations per second (FLOPS), and model size of the Yolov5s model are less favorable compared to the improved algorithm, which presents challenges for the practical deployment of equipment in underground coal mines. The Precision (P) of the Yolov5s model is 4.0% higher than that of the CGSW-YOLO model. However, its mean Average Precision (mAP), parameters, and Frames Per Second (FPS) are lower than those of the CGSW-YOLO model by 1.5% and 48.1%, respectively. Additionally, the parameters, floating point operations per second (FLOPS), and model size of the Yolov5s model are all greater than those of the improved algorithm. The mean Average Precision (mAP) of the CGSW-YOLO model was found to be 7.0%, 2.2%, and 1.7% higher than that of the YOLOv7-Tiny, YOLOv8s, and YOLOv8n models, respectively. The parameters of the CGSW-YOLO model demonstrated reductions of 66.7%, 78.7%, and 21.6%, respectively. Additionally, the floating point operations per second (FLOPS) for the CGSW-YOLO model decreased by 53.0%, 78.2%, and 23.5%. Furthermore, the Frames Per Second (FPS) performance of the CGSW-YOLO model improved by 64.5%, 40.5%, and 26.0%. Lastly, the model size of the CGSW-YOLOv8n variant was reduced by 67.5%, 77.9%, and 20.6%, respectively.
As illustrated in
Table 2, the mean Average Precision (mAP) of the Faster R-CNN model and the RE-DETE model is only 0.4% and 0.6% lower than that of the CGSW-YOLO model, respectively. However, it is noteworthy that both models exhibit significantly higher numbers of parameters, computational demands, and overall model sizes compared to the CGSW-YOLO model. At the same time, the Frames Per Second (FPS) of these two models decreased by 65.7% and 64.3%, respectively, compared to the CGSW-YOLO model. Although both models demonstrate high accuracy in detection performance, they do not sufficiently meet the lightweight requirements necessary for practical deployment. The mean Average Precision (mAP) of the EfficientDet model and the YOLOv11n model was found to be 1.8% and 2.4% lower, respectively, than that of the CGSW-YOLO model. Furthermore, their reductions in Frames Per Second (FPS) were 28.9% and 29.4%, respectively, compared to the CGSW-YOLO model. At the same time, the EfficientDet model demonstrated a 21.6% increase in floating point operations per second (FLOPS) compared to the CGSW-YOLO model. However, it also contained a significantly larger number of parameters than the CGSW-YOLO model, which impeded its ability to maintain Frames Per Second (FPS) while ensuring high detection accuracy. Compared to the benchmark model, YOLOv8n, the YOLOv11n model has a smaller size. However, this reduction compromises both detection accuracy and real-time performance, as it sacrifices parameters and floating point operations per second (FLOPS). In practical applications within mining environments, it is clear that using YOLOv8n as the benchmark model offers a more effective balance between model accuracy and Frames Per Second (FPS) capabilities.
Therefore, the comparative experiments presented above demonstrate that the CGSW-YOLO model effectively balances lightweight design with high precision. This further emphasizes the advantages of the improved method proposed in this paper.
3.5.2. Different Convolution Contrast Experiments
To evaluate the effectiveness of the GhostConv module, we utilized the reference model YOLOv8n to replace the ODConv [
35], DWConv, and CondConv [
36] modules, while ensuring consistent experimental conditions for comparative analysis. The results of these experiments are presented in
Table 3 below.
As shown in
Table 3, the accuracy of all convolution types has improved compared to the YOLOv8n model. Among the various convolution methods, GhostConv demonstrates the highest mean Average Precision (mAP), achieving an impressive 97.4%. Although ODConv exhibits a lower average accuracy than GhostConv, it surpasses it by 2.0% in Frames Per Second (FPS). CondConv can achieve a maximum performance of 232.56 FPS, albeit at the expense of mAP. DWConv results in a modest reduction of 0.4% in mAP and a decrease of 2.0% in FPS, while simultaneously maintaining minimal parameters, floating point operations per second (FLOPS), and model size. After integrating the YOLOv8n model with GhostConv, ODConv, CondConv, and DWConv, the performance metrics of GhostConv show significant improvements. Specifically, Precision (P), mean Average Precision (mAP), and Frames Per Second (FPS) increased by 2.6%, 1.1%, and 2.0%, respectively, compared to the original model. Furthermore, the number of parameters, floating point operations per second (FLOPS), and model size were reduced by 6.3%, 2.5%, and 6.3%, respectively, when compared to the baseline model. To summarize, GhostConv not only reduces the number of parameters and floating point operations per second (FLOPS) but also significantly enhances FPS during the feature extraction stage. This approach demonstrates clear and comprehensive advantages, effectively meeting the monitoring requirements in the context of mine air door scenarios.
3.5.3. Comparison of Different Loss Functions
To evaluate the effectiveness of the WIoU loss function, comparative experiments were conducted using the GIoU, EIoU, and SIoU loss functions. The results of these experiments are presented in
Table 4 below.
As illustrated in
Table 4, the WIoU loss function employed in this study demonstrates a significant improvement over the CIoU loss function utilized by the benchmark model. Specifically, it enhances the mean Average Precision (mAP) by 0.3%, increases Precision (P) by 0.4%, boosts Recall (R) by 2.6%, and elevates Frames Per Second (FPS) performance by 31.58%.
The mean Average Precision (mAP) of the Weighted Intersection over Union (WIoU) is slightly lower than that of the Generalized Intersection over Union (GIoU) and the Enhanced Intersection over Union (EIoU) loss functions, with differences of 0.1% and 0.4%, respectively. However, the Frames Per Second (FPS) performance of WIoU exceeds that of the other three loss functions. This suggests that WIoU is more suitable for real-time monitoring on mobile devices.
3.5.4. Visualization Experiment
The visualization results of the final improved model across all categories of the dataset are presented in
Figure 9. To effectively illustrate the detection performance of the enhanced algorithm, a visual comparison was conducted between the YOLOv8n model and the CGSW-YOLO model, with the corresponding results displayed in
Figure 10.
In
Figure 10, scene A represents noise, scene B illustrates occlusion, and scene C depicts contrast enhancement. The figure presents the detection results for various categories of vehicles and personnel.
As illustrated in
Figure 10a, both models are capable of performing the detection task; however, the YOLOv8n model exhibits instances of missed detections and false positives during the detection process. For example, it fails to detect WC3S and WC28E vehicles. Additionally, there are cases of misidentification where WC9R vehicles are incorrectly classified as WC60Y(A) vehicles, as well as erroneous detections involving miner_yellow personnel, among others. The detection performance of the YOLOv8n model is suboptimal under noisy conditions. Furthermore, in scenarios involving occlusion and contrast enhancement, the detection efficacy remains inadequate.
As illustrated in
Figure 9 and
Figure 10, the CGSW-YOLO model effectively detected various vehicles and personnel, with no instances of missed or incorrect detections. Furthermore, the detection confidence under conditions of noise, occlusion, and contrast enhancement showed a slight improvement compared to that of the YOLOv8n model. Therefore, the CGSW-YOLO model is more proficient at accurately detecting vehicles and individuals in mine air door scenarios.