ECViST: Mine Intelligent Monitoring Based on Edge Computing and Vision Swin Transformer-YOLOv5

Zhang, Fan; Tian, Jiawei; Wang, Jianhao; Liu, Guanyou; Liu, Ying

doi:10.3390/en15239015

Open AccessArticle

ECViST: Mine Intelligent Monitoring Based on Edge Computing and Vision Swin Transformer-YOLOv5

by

Fan Zhang

^1,2,3,†

,

Jiawei Tian

^1,*,†,

Jianhao Wang

^1,†,

Guanyou Liu

^1,† and

Ying Liu

^1,†

¹

School of Mechanical Electronic & Information Engineering, China University of Mining and Technology, Beijing 100083, China

²

Key Laboratory of Intelligent Mining and Robotics, Ministry of Emergency Management of China, Beijing 100083, China

³

Institute of Intelligent Mining and Robotics, China University of Mining and Technology, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to the manuscript.

Energies 2022, 15(23), 9015; https://doi.org/10.3390/en15239015

Submission received: 26 October 2022 / Revised: 24 November 2022 / Accepted: 25 November 2022 / Published: 29 November 2022

(This article belongs to the Special Issue Intelligent Coal Mining Technology)

Download

Browse Figures

Versions Notes

Abstract

Mine video surveillance has a key role in ensuring the production safety of intelligent mining. However, existing mine intelligent monitoring technology mainly processes the video data in the cloud, which has problems, such as network congestion, large memory consumption, and untimely response to regional emergencies. In this paper, we address these limitations by utilizing the edge-cloud collaborative optimization framework. First, we obtained a coarse model using the edge-cloud collaborative architecture and updated this to realize the continuous improvement of the detection model. Second, we further proposed a target detection model based on the Vision Swin Transformer-YOLOv5(ViST-YOLOv5) algorithm and improved the model for edge device deployment. The experimental results showed that the object detection model based on ViST-YOLOv5, with a model size of only 27.057 MB, improved the average detection accuracy is by 25% compared to the state-of-the-art model, which makes it suitable for edge-end deployment in mining workface. For the actual mine surveillance video, the edge-cloud collaborative architecture can achieve better performance and robustness in typical application scenarios, such as weak lighting and occlusion, which verifies the feasibility of the designed architecture.

Keywords:

mine intelligent monitoring; edge-cloud collaboration; object detection; vision swin transformer; YOLOv5

1. Introduction

In recent years, video surveillance has played an important role in ensuring the safety of coal mine production and workers’ lives, and many coal mine production companies handle surveillance videos [1]. Relying on the subjective judgment of surveillance personnel inevitably generates a series of problems, such as inefficiency, untimely response and human physiological fatigue [2]. With the development of artificial intelligence technology, intelligent video surveillance in coal mines is undoubtedly a major trend in the future. Compared with manual video monitoring, intelligent video surveillance is not only faster, with better processing speed, but also greatly reduces the cost of coal mining enterprises, and the application of machine vision technology for intelligent video surveillance is gradually and effectively replacing manual identification methods.

However, modern video surveillance for intelligent mines is simply distributed and generates a large amount of data that, without proper pre-training, fails to yield satisfactory results. Therefore, intelligent video surveillance requires a large amount of storage and computing resources. Artificial intelligence (AI) models for video processing are usually deployed on cloud servers with abundant computing and storage resources. The centralized processing of mine surveillance video through cloud servers takes a lot of time and the mine network environment is limited, and there is a need for timely processing of emergencies within the monitoring range; hence the high latency of cloud computing, network congestion, and other problems seriously affect the safety of coal production. In order to solve the above problems, traditional intelligent video surveillance frameworks and artificial intelligence algorithms must be improved. Shi et al. [3,4] defined edge computing as a new computing model that implements computing functions at the edge of the network. With the proposed development of smart mine theory [5,6], applying edge computing to the field of smart mines can effectively relieve the computational pressure of mine video surveillance processing and solve the problem of video data transmission under the restricted network environment [7,8]. Therefore, an edge-computing-oriented architectural scheme is proposed for the current mine video surveillance technology requirements, where target detection and identification of tasks with high real-time requirements can be performed at the edge. In the proposed scheme, the cloud performs the update of algorithm parameters and regularly pushes the parameters to the edge to facilitate collaborative work and effectively improve the efficiency of mine video surveillance.

In order to meet the real-time and intelligent requirements of mine surveillance, neural network models need to be deployed at the edge. The computing power and memory resources of edge devices are small and cannot match the computation of neural network models with large resource consumption, so models with small memory and computation resources need to be deployed under the premise of meeting certain accuracy and computation speed. Lightweight neural networks with a small number of parameters and low computational complexity can be applied to edge devices after improvement. Lightweight neural networks can be trained by minimal cross-server communication, and a standard one-stage detection algorithm. The YOLO series of one-stage detection algorithm features fast detection speed and high detection accuracy is the most widely used target detection algorithm [9]. YOLO is a target detection algorithm proposed in the CVPR2016 competition, the core idea of which is to transform the target detection task from a classification problem into a regression solution and is based on a separate end-to-end network with inputs image data and outputs location and category of objects in the image. YOLOv5 is an improvement on YOLOv4 and YOLOv4-tiny. The YOLOv5 model has fewer parameters, occupies less memory, which allows high real-time performance, and can now be used as a more complete lightweight neural network model for deployment on edge devices. Considering the actual needs of mine monitoring, for the problems of low light and difficult deployment of underground models, the generalization ability of the YOLOv5 detection model needs to be improved; such an improvement could achieve real-time accurate identification of mine workers, etc., under the limited environment conditions at the edge end.

Recently, ViT (Vision Transformer) demonstrated its remarkable ability for computer vision tasks in the field of natural language processing (NLP) by utilizing a Transformer model [10,11]. With the development of transformer structure in the image field, ViT has better results in image classification, detection, denoising recovery, etc., compared with the traditional convolutional neural network model (CNN), which allows ViT to achieve image shallow feature extraction, containing global information, and reduce image-specific induction bias. A large number of attention mechanisms such as SE, CBAM, CAM, and other modules of transformer applied in natural language processing (NLP) can effectively improve network performance without relying on CNN structure.

A transformer faces two problems when applied from the field of natural language processing to the field of vision: (a) the inability to recognize multiple scales of the same target on the image, categorized as the same according to semantics and (b) the excessive length of the sequence composed by pixel points, as the basic unit for high-resolution images, which is computationally intensive and has insufficient video memory resources. To solve the above problems, the Swin Transformer reduces the sequence length by dividing the input image into multiple windows of the same size to determine people or objects with different scales in the foreground and background by moving the window approach [12], which restricts the global attention mechanism to the window. The Swin Transformer is applied to the target detection model YOLOv5, and the Swin Transformer-YOLOv5 detection model is constructed to improve detection accuracy and achieve multi-target multi-scale detection tasks under different environmental conditions by combining the actual monitoring scenarios in mines.

In this paper, we present a method that can synthesize highly detailed ECViST-YOLOv5 models from the Swin Transformer and YOLOv5 within a reduced computation time. Specifically, we propose a coarse-to-fine optimization approach that uses edge-cloud collaboration and a Vision Swin Transformer with YOLOv5 to optimize the object detection model, thus enabling the achievement of both better object detection performance as well as more robustness in typical monitoring scenarios.

Our contribution can be summarized as follows:

We present a coarse model using edge-cloud collaborative architecture and update it to realize the continuous improvement of the detection model.
We propose a target detection model based on a Vision Swin Transformer-YOLOv5 algorithm and improve the model for edge device deployment at the mining workface.
We highlight the effectiveness of our approach by combining it with the latest developed video backbones and achieve significant improvements over the state-of-the-art results on the dataset.

2. Overall Architecture

Currently, the coal mining industry mostly adopts a manual video surveillance approach based on the cloud computing paradigm, where video data collected at the front end is transmitted to the cloud for data processing, and some of the processed data is transmitted to the monitor’s client through the network [13]. The increasing scale of front-end camera deployment in mine monitoring systems, especially in risky monitoring areas, increases pressure on the network and cloud computing with video data transmission to the cloud, and so the existence of delays in video processing is inevitable. When emergency problems occur in the detection area, immediate decisions cannot then be made, causing problems in production safety and other areas.

The mine edge devices are deployed industrially to constitute the underground ring network, and each edge node communicates with each other to display all kinds of information collected through the edge data collection gateway, through the monitoring terminal and the ground base station and, finally, through the network in the scheduling center of the monitoring platform [14,15,16], as shown in Figure 1. For different scenarios and different emergency response levels, the information collected by the devices needs to be implemented at the edge gateway for task offloading [17,18]; according to this detection method, it can ensure the efficiency of mine production tasks, improve production safety, ensure the effective utilization of mine edge resources, and reduce task processing time delay. Therefore, edge architecture and improved target detection algorithms need to be designed to ensure the effectiveness of real-time mine monitoring and network utilization in edge nodes, and so improve monitoring task efficiency and reduce task resource occupancy.

2.1. Implementation Details

The system architecture designed in this paper is shown in Figure 2. The terminal mine surveillance device collects video data to be deployed to the edge device, and the cloud receives the data transmitted by the edge to produce a dataset for training the target detection model. After the dataset is produced, the cloud trains the established model. The updated network model trained by the cloud is transmitted to the edge side, and the edge device analyzes the video data to make a timely response. SFTP is used to transmit data between the edge device and the server to improve the real-time performance of the system. The cloud is responsible for storing the completed detection data at the edge, optimizing the detection model according to the field data transmitted by the edge, and then transmitting the optimized model to the edge to realize the continuous improvement of the system target detection network model, forming a virtuous cycle and improving the quality of video surveillance. Finally, the surveillance video of the optimized network model is output, the detection indexes of mine video surveillance, such as detection accuracy, detection speed, and detection capability to cope with different environmental conditions are provided, and the detection capability of the model is received in real-time while the surveillance results are obtained.

2.2. Edge Computing Based on Task Offloading

The mine intelligent monitoring system adopts a cloud-edge-terminal collaborative architecture design, including the terminal device layer, edge computing layer and cloud service layer. Among these, the mine surveillance target tracking algorithm deployment strategy deploys the algorithm’s feature extraction and depth estimation modules to the edge server and the algorithm update module to the cloud server. The mine intelligent surveillance system uses task offloading technology to deploy video processing algorithms in the intelligent surveillance system with cloud-edge collaboration, which speeds up the processing of unstructured data, such as video images, by offloading the computational tasks to the edge server or cloud server.

In this paper, the EdgeCloudSim virtual edge simulation platform was used. The task-offloading edge computing algorithm model is shown in Figure 3. The mine monitoring system has n terminal devices, and the mine target monitoring algorithm P decomposes of m task submodules, denoted as P = {p₁, p₂, p₃, …, p_m} with k edge servers on the edge computing side providing distributed computing and data access services for the monitoring terminal devices.

The implementation steps of the mine monitoring target edge computing algorithm are as follows:

(1): Video sequence initialization, dividing the video processing task into modules;
(2): Video processing task offload, feature extraction, depth estimation and model update based on each target image in the video sequence, video processing task offload to cloud computing task queue and edge computing task queue;
(3): Edge computing, which uploads the feature and scale value data completed by the module task processing to the edge server and cloud server and performs edge computing.

The experimental platform used a Dell Precision 7920 workstation (Dell (China) Co., Ltd. Beijing Branch) as the cloud side and the PC side as the edge side for video surveillance with the technical specifications shown in Table 1.

3. Method

3.1. Mine Intelligent Monitoring Model

In this paper, a mine intelligent monitoring model was proposed based on the YOLOv5 algorithm. The feature extraction network structure is shown in Figure 4, where the input image data resolution is 640 × 640 and the combination of bracketed numbers indicates the output feature map resolution and the number of channels.

The backbone network used for feature information extraction is CSPDarkNet, which adds CSPNet modules to each residual module of DarkNet53. Compared with CSPDarknet53, the activation function in the network structure is modified so the SiLU activation function is faster: an improved version of the Sigmoid function and ReLU function with non-monotonicity, smoothness, self-stability, and the global minimum with zero derivative able to suppress the learning of a large number of weights with equalizing effect. The CSPDarkNet feature network consists of a DBS (DarknetConv2D_BN_SigRelu) module and an easily optimized residual module (Resblock_body), which alleviates the problem of gradient disappearance due to increasing network depth.

3.2. Data Input

In the actual target detection experiments, the pictures have different size aspect ratios, and the traditional method fills the aspect ratio to the required aspect of the model input 1:1, which has the problem of missing images and incomplete retained feature information. Therefore, the adaptive picture scaling method is used to calculate the shrinkage ratio and calculate the pixels to be filled according to the shrunken picture aspect to achieve the aspect adjustment of the picture.

To improve the detection accuracy, this paper refers to the Cut Mix data enhancement method and adopts the Mosaic data enhancement method [19]; this increases the number of picture stitching, collects four pictures at a time, performs operations such as scaling, flipping, and color gamut change, puts them into four positions and puts them together to enrich the data set, and so can simultaneously normalize the data of four pictures, which is beneficial to improving the training of the model; the effect is shown in Figure 5. This method increases the number of small targets and enhances the robustness of the detection model for the small target detection problem that exists in the YOLO series algorithm at one stage.

3.3. Backbone Improvements

As illustrated in Figure 6, the structure of the Swin Transformer is similar to CNN and can be added to other models as a backbone network. The Swin Transformer is divided into 4 main stages to reduce the resolution of the input image and expand the perceptual field.

The model backbone network Swin Transformer Block, shown in Figure 7, mainly consists of multi-layer perceptron, a window multi-head self-attention (W-MSA) layer, a sliding window multi-head self-attention (SW-MSA) layer (shifted window-based multi-head self-attention) and layer normalization (LN) [20]. Therefore, the expressions of the outputs of the components after the Swin Transformer backbone network can be written:

{\hat{Z}}^{l} = W - M S A (L N {(Z)}^{l - 1}) + Z^{l - 1}

(1)

Z^{l} = M L P (L N ({\hat{Z}}^{l})) + {\hat{Z}}^{l}

(2)

{\hat{Z}}^{l + 1} = S W - M S A (L N (Z^{l})) + Z^{l}

(3)

Z^{l + 1} = M L P (L N ({\hat{Z}}^{l + 1})) + {\hat{Z}}^{l + 1}

(4)

The input mine image is divided into 8 × 8 blocks, and the whole window is 4 × 4 blocks, the division result is shown in Figure 8a; there is no overlap between the layer windows, and each window calculates self-attention. Owing to the lack of information exchange between each window, the model characterization ability is poor. SW-MSA introduces sliding windows and sliding between each window to move and fill the region to realize the information interaction between windows, and the effect is shown in Figure 8b.

The Swin Transformer network structure is added to the YOLOv5 enhanced feature extraction network to form the enhanced feature extraction network of the ST-YOLOv5 detection model. The backbone feature extraction network outputs three effective feature layers, which are predicted by the output of the enhanced feature extraction network, and the structure is illustrated in Figure 9.

3.4. Training Prediction

The input image goes through the feature extraction network and outputs three feature layers, which are passed into the YOLO head to obtain the prediction results. The YOLO head is essentially a 3 × 3 convolution plus a 1 × 1 convolution, where the 3 × 3 convolution serves to integrate the features and the 1 × 1 convolution serves to adjust the number of channels.

The prediction output of YOLOv5 consists of a loss function, an optimization function, and non-maximum suppression (NMS). Intersection over Union (IoU) is a common loss function used to measure the degree of overlap between the prediction frame generated by the model and the true frame [21]. The optimization function is commonly used in SGD (stochastic gradient descent) optimization algorithms and Adam (Adaptive moment estimation) optimization algorithms. The prediction results are sorted by score and filtered by non-maximal suppression, and non-maximal suppression is used in the post-processing of feature extraction for target detection, in which a large number of candidate boxes appear at the same target location, and the selected boxes overlap with each other. The GIoU algorithm solves the bounding box regression problem in the case of no overlap between the real and predicted boxes, and the DIoU algorithm improves the convergence speed by directly minimizing the center distance between the real and predicted boxes; whereas, the CIoU algorithm improves the convergence speed by simultaneously considering three geometric measures, including the center distance between the real and predicted boxes. The best performance is obtained by considering all three geometric measures, including the true border and the predicted border center distance. However, the CIoU algorithm degrades to DIoU when the aspect ratio of the real border is equal to that of the predicted border when the aspect ratio of the real border is calculated consistently with that of the predicted border. Therefore, based on the CIoU algorithm, the weight coefficient α is introduced and alpha-IoU [22] is used to replace CIoU as the loss function of the bounding box, and the calculation formula can be represented by

I^{'} = I^{α} - \frac{ρ^{2 α} (b, b^{g t})}{d^{2 α}} - {(β γ)}^{α}

(5)

l = 1 - I^{'}

(6)

where I′ and I are the value of alpha-IoU and the value of IoU, respectively, and ρ² (b, b^gt) is the Euclidean distance between the center point b of the predicted box and the center point b^gt of the real box; d denotes the diagonal length of the smallest closed region that can contain the predicted and real boxes, β is the trade-off parameter, γ is a parameter measuring the consistency of the aspect ratio of the frame and l is the loss value. By adaptively adjusting the weight coefficient α according to the IoU value of the target, the YOLO detector learns the IoU target faster and improves the target detection accuracy.

The SGD optimization algorithm has the problem of slow convergence speed, to increase the convergence speed, the gradient direction with small changes in the dimension accelerates the update and reduces the update magnitude of the gradient direction with large changes. In this paper, the Adam optimization algorithm [23] with good convergence effect was used. Specifically, the algorithm is formulated by

g_{t} = \nabla_{θ} J (θ_{i - 1})

(7)

m_{i} = β_{1} m_{i - 1} + (1 - β) g_{i}

(8)

v_{i} = β_{2} m_{i - 2} + (1 - β) g_{i}^{2}

(9)

where g_i is the gradient, β₁ denotes the first-order moment exponential decay, set to 0.9, and β₂ denotes the second-order moment exponential decay; the weighted mean of the gradient squared, defaulted to 0.99. m_i and v_i are the first-order moment of the gradient g_i and the second-order moment of the gradient g_i, respectively.

4. Experiments

The flow of the object detection model based on the ST-YOLOv5 model is shown in Figure 10. The input mine surveillance image or video (here video needs to be extracted frame by frame) and the size of the input image needs to be adjusted to prevent distortion. Each feature layer has three prior frames for each feature point, and the regression parameters for each feature point need to be judged. The regression parameters include the center coordinates of the generated a priori box and the width and height of the box; the rest of the parameters include the judgment of whether each feature point contains the detected object, i.e., the object confidence, and the judgment of the kind of object contained in each feature point, i.e., the kind confidence. All parameters of the a priori box were adjusted to 0–1 size by the sigmoid function to facilitate the subsequent matching operation. The data were fed into the network for prediction, and the prediction results were decoded to get the position information of the prediction box: the coordinates of the top left point of the box, the coordinates of the bottom right point of the box, the confidence of the object, and the confidence of the kind. Finally, non-greatly suppressed filtering was performed, and the final results were obtained by drawing the filter box.

4.1. Data Acquisition

Considering that there are many scenes in coal mines, it is necessary to collect and make a dataset suitable for multiple scenes, which is applicable to the daily monitoring of coal mines and the edge equipment under the mines. The dataset used a VOC2012 + 2007 partial dataset plus coalmine image data, the input video was read frame by frame and converted to .jpg image format, and a total of 5000 images were collected (3500 VOC + 1500 mine images). The image data were classified and annotated using the makesense.ai dataset creation tool; they were divided into a training set, a validation set and a test set, according to an 8:1:1 ratio.

4.2. Testing and Analysis

In this paper, we proposed that the improved ST-YOLOv5 model and YOLOv5 model are deployed in the cloud for training, and the algorithm was based on the deep learning framework Pytorch with NVIDIA CUDA version 11.3. The training parameter batch-size was set to 16, the learning rate was 1×10⁻³, and the epoch was 300. In order to improve the generalization performance of the model, avoid overfitting, and improve the accuracy of detection, label smoothing was used during training to reduce the absolute trust in the classification labels of the dataset. The models were evaluated during training, and the loss function curves are shown in Figure 11a,b. The models were evaluated after every 10 rounds of training, and the decrease in loss value leveled off after 210 rounds and the loss function curves tended to converge, with a final loss value of about 0.035 for the ST-YOLOv5 model and 0.089 for the YOLOv5 model.

The IoU value was set to 0.5, the mAP curves after training are shown in Figure 12. After training, the ST-YOLOv5 model curve tends to flatten out after 100 rounds of training with a final value of about 0.878, and for the YOLOv5 model, the curve tends to flatten out after 250 rounds of training with a final value of about 0.704.

To further verify the performance of the algorithm, the loss values, and detection classification effects of the improved ST-YOLOv5 and YOLOv5, YOLOv4, and YOLOv4-tiny were compared, as shown in Figure 13.

From Figure 13, it can be seen that the algorithm proposed in this paper performs better compared to the original YOLOv5 algorithm, the previous generation YOLOv4 algorithm, and the YOLOv4-tiny algorithm applicable to edge computing.

The trained ST-YOLOv5 model and the base model YOLOv5 were deployed to the PC side, and the detection results were compared as shown in Figure 14. The detection accuracy of the ST-YOLOv5 target detection algorithm was 0.84 and 0.86 for the underground and ground scenes, respectively. In contrast, the initial YOLOv5 algorithm had low detection accuracy for the same scenes: the target was not detected underground and the ground accuracy was 0.77 due to the light intensity. This comparison shows that the ST-YOLOv5 model has high detection accuracy and better detection performance.

Based on the YOLOv4-tiny model, two improved algorithms ECA-YOLOv4-tiny and CBAM-YOLOv4-tiny were formed by adding different attention mechanisms to the backbone network improvement part for testing, and the performance was compared with the detection model in this paper. The different network models were trained separately and the trained models were deployed to the PC. The performance of the network was measured by the average detection accuracy (mAP), the frame rate (FPS) of video detection for the trained models, and the speed (speed) of single-frame image detection. The experiments use public videos with high frame rates, and the experimental results are shown in Table 2.

From the table, it can be inferred that the ST-YOLOv5 model dominates in terms of average detection accuracy, frame rate, and detection speed. Compared with the original algorithm, YOLOv5 average detection accuracy improved by 19.8%, video detection frame rate improved by 11.7%, and detection speed improved by 16.9%. Combining the three measures, the ST-YOLOv5 detection model works best when the difference in average detection accuracy is not large, improving the real-time system response, when the detection speed of edge devices is considered.

The size of the whole model parameters is 27.057 MB, which is suitable for memory-constrained edge-side deployment. Test data are collected from the PC side, and each frame of surveillance video is detected by a network model completed by training transmitted from the workstation to the edge side. To verify the edge-cloud collaboration architecture, the analysis experiments on the frame rate of video data detection are usually conducted. The video data of different scenes are transmitted to the cloud, and the video is detected using the trained completed network model, while the results are saved. On the other hand, surveillance videos are processed separately by the scene in real-time at the edge. The target detection results with the same network environment are shown in Figure 15.

From the detection results, it can be seen that, for the monitoring area with good light effect and a large detection target, the edge-end and cloud-end detection is good. In the surveillance area with poor light effect, occlusion and a smaller detection target, the cloud detection and recognition is good but the edge end fails to detect. It can be inferred that the cloud-side model can solve the non-detection problems that are due to light and occlusion. The frame rate of detection in the cloud is 49.02 fps, and the frame rate of detection in the edge end is 29.39 fps, and there is a slight difference between the frame rate of real-time video detection deployed in the edge end compared to the cloud end. The simulation compared the edge-cloud collaboration, the frame rate (FPS) of the same video detection, and the latency of the processing task at the cloud and the edge, and the comparison results are shown in Table 3.

From the table, we can see that the difference between the detection frame rate of edge-cloud collaborative processing and cloud processing is not much, and both are better than the detection frame rate of the edge side. In terms of task processing latency, the edge-cloud collaborative architecture reduces surveillance video processing latency significantly.

5. Conclusions

We present a novel ECViST model for mine intelligent monitoring by employing edge computing and machine learning that is jointly trained with a YOLOv5 network; the model unitizes the edge-cloud collaborative architecture, where the edge is responsible for real-time detection and analysis, reducing the time delay of video transmission and ensuring detection accuracy and speed, and the cloud is responsible for transmitting the model to the edge and continuously updating it to store the detection data at the edge. It solves the problems of high latency, network congestion, memory, and the computing pressure of traditional cloud-based manual video surveillance in typical application scenarios, such as weak lighting and occlusion., which verifies the feasibility of the designed architecture.

Based on the YOLOv5 model, the feature extraction part is added to the Swin Transformer network, while the overall network modules are improved; this reduces the amount of model computation when improving information extraction capability, solving the problem of large computation and the low detection accuracy of transformers in the field of target detection. The problem of large computation and low detection accuracy in the field of target detection is solved. In terms of providing training datasets, mosaic enhancement is performed to enrich the datasets. Overall, it solves the problem of small target detection in the YOLO series, obtains more accurate location information and underlying feature information, and improves the detection accuracy and robustness of the edge computing detection model. The experiments demonstrate ViST-YOLOv5 is superior to the state-of-the-art object detection algorithm in terms of average detection accuracy, video detection frame rate, and single-frame picture detection speed. In addition, the average detection accuracy is improved by 25%.

Author Contributions

Methodology, J.T. and F.Z.; software, J.T. and F.Z.; validation, J.T. and F.Z.; formal analysis, Y.L.; investigation, J.W.; data curation, G.L.; writing—original draft, J.T. and F.Z.; writing—review & editing, J.T. and F.Z.; visualization, J.T. and F.Z.; Project administration, J.T. and F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key R&D Program of China (2022YFC3004600) and the NSFC project No.52121003.

Data Availability Statement

The data are available on request, subject to restrictions (e.g., privacy or ethical restrictions).

Acknowledgments

The study was approved by the China University of Mining and Technology (Beijing).

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, X.; Li, B.; Liu, Z.P. Underground video pedestrian detection method. J. Mine Autom. 2020, 46, 54–58. [Google Scholar] [CrossRef]
Xu, Z.; Li, J.; Zhang, M. A Surveillance Video Real-Time Analysis System Based on Edge-Cloud and FL-YOLO Cooperation in Coal Mine. IEEE Access 2021, 9, 68482–68497. [Google Scholar] [CrossRef]
Shi, W.S.; Zhang, X.Z.; Yifan, W.; Qinyang, Z. Edge Computing: State-of-the-Art and Future Directions. J. Comput. Res. Dev. 2019, 56, 69–89. [Google Scholar]
Zhang, F.; Xu, Z.; Chen, W.; Zhang, Z.; Zhong, H.; Luan, J.; Li, C. An Image Compression Method for Video Surveillance System in Underground Mines Based on Residual Networks and Discrete Wavelet Transform. Electronics 2019, 8, 1559. [Google Scholar] [CrossRef]
Tan, Z.L.; Wu, Q.; Xiao, Y.X. Research on information visualization of smart mine. J. Mine Autom. 2020, 46, 26–31. [Google Scholar] [CrossRef]
Zhang, F.; Ge, S.R. Construction method and evolution mechanism of mine digital twins. J. China Coal Soc. 2022, 1, 1–13. [Google Scholar] [CrossRef]
Qu, S.J.; Wu, F.S.; He, Y. Research on edge computing mode in coal mine safety monitoring and control system. Coal Sci. Technol. 2022, 1, 1–8. [Google Scholar]
Wang, H.; He, M.; Zhang, Z.; Zhu, J. Determination of the constant mi in the Hoek-Brown criterion of rock based on drilling parameters. Int. J. Min. Sci. Technol. 2022, 32, 747–759. [Google Scholar] [CrossRef]
Xu, Y.J.; Li, C. Light-weight Object Detection Network Optimized Based on YOLO Family. Comput. Sci. 2021, 48, 265–269. [Google Scholar]
Li, T.; Wang, H.; Li, G.; Liu, S.; Tang, L. SwinF: Swin Transformer with feature fusion in target detection. J. Physics Conf. Ser. 2022, 2284, 012027. [Google Scholar] [CrossRef]
He, M.; Zhou, J.; Li, P.; Yang, B.; Wang, H.; Wang, J. Novel approach to predicting the spatial distribution of the hydraulic conductivity of a rock mass using convolutional neural networks. Q. J. Eng. Geol. Hydrogeol. 2022, 1, 17–25. [Google Scholar] [CrossRef]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; van Gool, L.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; Volume 17, pp. 1833–1844. [Google Scholar] [CrossRef]
Qu, S.J.; Wu, F.S. Study on Environmental Safety Monitoring and Early Warning Method of Intelligent Working Face for Coal Mines. Safety in Coal Mines. 2020, 51(08), 132–135. [Google Scholar] [CrossRef]
Wang, L.; Wu, C.K.; Wenhui, F. A Survey of Edge Computing Resource Allocation and Task Scheduling Optimization. J. Syst. Simul. 2021, 33, 509–520. [Google Scholar] [CrossRef]
Zhou, H.; Wan, W.G. Task scheduling strategy of edge computing system. Electr. Meas. Technol. 2020, 43, 99–103. [Google Scholar] [CrossRef]
Zhu, X.J.; Zhang, H. Research on task allocation of edge computing in intelligent coal mine. J. Mine Autom. 2021, 47, 32–39. [Google Scholar] [CrossRef]
Fang, L.; Ge, C.; Zu, G.; Wang, X.; Ding, W.; Xiao, C.; Zhao, L. A Mobile Edge Computing Architecture for Safety in Mining Industry. In Proceedings of the 2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), Leicester, UK, 19–23 August 2019; Volume 25, pp. 1494–1498. [Google Scholar] [CrossRef]
Sun, L.; Li, Z.; Lv, J.; Wang, C.; Wang, Y.; Chen, L.; He, D. Edge computing task scheduling strategy based on load balancing. MATEC Web Conf. 2020, 309, 03025. [Google Scholar] [CrossRef][Green Version]
Hu, J.P.; Li, Z.; Huang, H.Q.; Hong, T.S.; Jiang, S.; Zeng, J.Y. Citrus psyllid detection based on improved YOLOv4-Tiny model. OLOv4-Tiny. Nongye Gongcheng Xuebao Trans. Chin. Soc. Agric. Eng. 2021, 37, 197–203. [Google Scholar] [CrossRef]
Ye, M.L.; Zhou, H.Y.; Wang, F. Forest fire detection algorithm based on an improved Swin Transformer. J. Cent. South Univers. For. Technol. 2022, 42, 101–110. [Google Scholar] [CrossRef]
Jiang, B.; Luo, R.; Mao, J.; Xiao, T.; Jiang, Y. Acquisition of Localization Confidence for Accurate Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 816–832. [Google Scholar] [CrossRef]
He, J.B.; Erfani, S.; Ma, X.; Bailey, J.; Chi, Y.; Hua, X.-S. Alpha-IoU: A Family of Power Intersection over Union Losses for Bounding Box Regression. Adv. Neural Inf. Process. Syst. 2021, 1, 28–35. [Google Scholar]
Kingma, D.P.; Ba, J.L. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference for Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]

Figure 1. Overall architecture of mine monitoring system.

Figure 2. System architecture based on edge computing.

Figure 3. The task-offloading edge computing algorithm model.

Figure 4. YOLOv5 backbone feature extraction network.

Figure 5. Mosaic data enhancement effect.

Figure 6. Swin Transformer network structure.

Figure 7. Swin Transformer Block.

Figure 8. The division window effect. (a) W-MSA division window. (b) SW-MSA dividing window.

Figure 9. ST-YOLOv5 enhanced feature extraction network.

Figure 10. ST-YOLOv5 model object detection process.

Figure 11. Comparison of different models’ training loss function. (a) ST-YOLOv5 model training loss function. (b) YOLOv5 model training loss function.

Figure 12. Comparison of mAP between ST-YOLOv5 and YOLOv5. (a) ST-YOLOv5 model mAP curve. (b) YOLOv5 model mAP curve.

Figure 13. Comparison of different algorithms. (a) Comparison of loss values. (b) Comparison of detection and classification effects.

Figure 14. Comparison of results between ST-YOLOv5 and YOLOv5. (a) ST-YOLOv5 model detection. (b) YOLOv5 model detection.

Figure 15. Cloud and edge-side detection results. (a) Cloud detections. (b) Edge-side detection.

Table 1. Parameter specifications for cloud-based and edge-based devices.

Equipment Name	Role	CPU	GPU	Memory	HardDrive
Dell Precision 7920	Cloud	Intel Xeon Silver 4215R	RTX3080	64 GB	1T
PC	Edge Side	Intel Core i7	GTX1080	16 GB	512G

Table 2. Training and testing results of the improved model.

Model	mAP/%	FPS/(Frame/s)	Speed/ms
ST-YOLOv5(ours)	87.8	113.56	8.55
CBAM-YOLOv4-tiny	73.1	107.55	9.96
ECA-YOLOv4-tiny	72.4	115.56	9.70
YOLOv5	70.4	100.28	10.29
YOLOv4-tiny	65.8	80.86	12.37

Table 3. Detection of frame rates for different processing architectures.

Processing Architecture	FPS/(Frame/s)	Time Delay(ms)
Edge Cloud Collaboration(ours)	100.58	59.68
Cloud	103.55	103.25
Edge	89.52	-

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, F.; Tian, J.; Wang, J.; Liu, G.; Liu, Y. ECViST: Mine Intelligent Monitoring Based on Edge Computing and Vision Swin Transformer-YOLOv5. Energies 2022, 15, 9015. https://doi.org/10.3390/en15239015

AMA Style

Zhang F, Tian J, Wang J, Liu G, Liu Y. ECViST: Mine Intelligent Monitoring Based on Edge Computing and Vision Swin Transformer-YOLOv5. Energies. 2022; 15(23):9015. https://doi.org/10.3390/en15239015

Chicago/Turabian Style

Zhang, Fan, Jiawei Tian, Jianhao Wang, Guanyou Liu, and Ying Liu. 2022. "ECViST: Mine Intelligent Monitoring Based on Edge Computing and Vision Swin Transformer-YOLOv5" Energies 15, no. 23: 9015. https://doi.org/10.3390/en15239015

APA Style

Zhang, F., Tian, J., Wang, J., Liu, G., & Liu, Y. (2022). ECViST: Mine Intelligent Monitoring Based on Edge Computing and Vision Swin Transformer-YOLOv5. Energies, 15(23), 9015. https://doi.org/10.3390/en15239015

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ECViST: Mine Intelligent Monitoring Based on Edge Computing and Vision Swin Transformer-YOLOv5

Abstract

1. Introduction

2. Overall Architecture

2.1. Implementation Details

2.2. Edge Computing Based on Task Offloading

3. Method

3.1. Mine Intelligent Monitoring Model

3.2. Data Input

3.3. Backbone Improvements

3.4. Training Prediction

4. Experiments

4.1. Data Acquisition

4.2. Testing and Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI