Intelligent Mining Road Object Detection Based on Multiscale Feature Fusion in Multi-UAV Networks

Xu, Xinkai; Zhao, Shuaihe; Xu, Cheng; Wang, Zhuang; Zheng, Ying; Qian, Xu; Bao, Hong

doi:10.3390/drones7040250

Open AccessArticle

Intelligent Mining Road Object Detection Based on Multiscale Feature Fusion in Multi-UAV Networks

by

Xinkai Xu

^1,2

,

Shuaihe Zhao

^3,4

,

Cheng Xu

^2,*

,

Zhuang Wang

²,

Ying Zheng

¹,

Xu Qian

¹ and

Hong Bao

²

¹

School of Mechanical Electronic & Information Engineering, China University of Mining & Technology-Beijing, Beijing 100083, China

²

Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing 100101, China

³

The School of Automation, Beijing Institute of Technology, Beijing 100081, China

⁴

Aerospace Shenzhou Aerial Vehicle Ltd., Tianjin 300301, China

^*

Author to whom correspondence should be addressed.

Drones 2023, 7(4), 250; https://doi.org/10.3390/drones7040250

Submission received: 28 February 2023 / Revised: 31 March 2023 / Accepted: 3 April 2023 / Published: 5 April 2023

(This article belongs to the Special Issue Multi-UAV Networks)

Download

Browse Figures

Versions Notes

Abstract

:

In complex mining environments, driverless mining trucks are required to cooperate with multiple intelligent systems. They must perform obstacle avoidance based on factors such as the site road width, obstacle type, vehicle body movement state, and ground concavity-convexity. Targeting the open-pit mining area, this paper proposes an intelligent mining road object detection (IMOD) model developed using a 5G-multi-UAV and a deep learning approach. The IMOD model employs data sensors to monitor surface data in real time within a multisystem collaborative 5G network. The model transmits data to various intelligent systems and edge devices in real time, and the unmanned mining card constructs the driving area on the fly. The IMOD model utilizes a convolutional neural network to identify obstacles in front of driverless mining trucks in real time, optimizing multisystem collaborative control and driverless mining truck scheduling based on obstacle data. Multiple systems cooperate to maneuver around obstacles, including avoiding static obstacles, such as standing and lying dummies, empty oil drums, and vehicles; continuously avoiding multiple obstacles; and avoiding dynamic obstacles such as walking people and moving vehicles. For this study, we independently collected and constructed an obstacle image dataset specific to the mining area, and experimental tests and analyses reveal that the IMOD model maintains a smooth route and stable vehicle movement attitude, ensuring the safety of driverless mining trucks as well as of personnel and equipment in the mining area. The ablation and robustness experiments demonstrate that the IMOD model outperforms the unmodified YOLOv5 model, with an average improvement of approximately 9.4% across multiple performance measures. Additionally, compared with other algorithms, this model shows significant performance improvements.

Keywords:

multisystem collaboration; 5G-multi-UAV systems; multiscale feature fusion; pyramid model

1. Introduction

Smart mines integrate modern information, control technology, and mining technology to achieve the goals of efficiency, safety, and environmental protection. The use of automatic driving applications in mining trucks reduces manual demand in key production links at open-pit mines while promoting efficient collaboration. This development is supported by 5G and other networking technologies that enable all-around communication between vehicles, roads, and management platforms, with the ultimate goal of optimizing mine operations [1]. The four scenarios for 5G automatic driving in mine operations are loading, transportation, unloading, and operational support. To effectively realize these processes while improving accuracy and efficiency, there are requirements of remote control driving systems for mining trucks as well as cooperation among various construction machinery through the integration of cutting-edge multisystems aimed at resolving the travel path planning issues specifically associated with mining truck movements.

Internationally, there is continuing growth in the demand for industrial-grade drones, which are widely used in numerous industries. With the high growth in demand for logistics and transport expected in the coming years, the drone transport market holds great promise but also faces risks and challenges. Unmanned aerial systems (UASs) are versatile, advanced high-tech equipment with extensive scientific, social, and strategic applications that have the potential to trigger transformative industrial and societal change. UAS equipment is widely used in a variety of scenarios, such as pesticide spraying, courier transport delivery, video filming, and aerial inspection and monitoring, resulting in greater convenience in terms of human labor.

Worldwide, the development of civil drones is still in its early stages, but countries around the globe recognize the potential of drones in both military and civilian applications, leading to the issuance of numerous policies and regulatory documents [2]. In the realm of intelligent mining, advancements have been made from truck intelligence to collaborative intelligence through the integration of vehicle-road-cloud superagents [3]. This innovative approach allows for autonomous networked intelligent integration, unlocking limitless possibilities for networked systems to overcome traditional constraints on time and space interaction within open-pit mining areas.

Currently, 5G-multi-UAV applications remain at a nascent stage within open-pit mining areas, including auxiliary driving and early warning services, among others. The next phase will involve more complex applications serving not only L4 automatic driving but also manned driving for different levels of automatic driving scenarios [4].

In Figure 1, the 5G network is shown as an integrated vehicle-road-cloud scenario. This paper proposes the IMOD model, which combines 5G-multi-UAV and deep learning methods for open-pit mining environments. The IMOD model uses data sensors to monitor surface data in real time and transmit them to driverless mining trucks. The unmanned mine truck utilizes a deep learning algorithm to identify obstacles in real time and build a driving area that avoids static, dynamic, and multiple continuous obstacles. The IMOD model ensures the safety of driverless vehicles, personnel, and equipment by ensuring the smooth running of vehicles while avoiding obstacles.

The contributions of this paper include: (1) proposing an IMOD automatic driving model based on 5G-multi-UAV for enhancing safety in mining areas; (2) constructing an obstacle image dataset through field collection and manual marking; and (3) improving multiscale obstacle detection capabilities through cross-modal data fusion.

The second section presents related work, followed by a presentation of the IMOD autopilot model based on 5G-multi-UAV in Section 3. In Section 4, we provide experimental analysis results, followed by our conclusions in Section 5.

2. Related Work

2.1. Multisystem Collaboration Scenarios and Applications in Open-Pit Mines

Automated driving in open-pit mines continues to adhere to the standard production workflow of drilling, blasting, mining, transportation, and discharging [5]. Considering the process of mining transport operations and platooning, automatic driving scenarios in mines can be classified into three categories: loading, transporting, and unloading. Additionally, there are maintenance support scenarios, such as refueling and water replenishment, that facilitate the aforementioned operational processes. To implement intelligent networked automatic driving applications effectively, the realization of remote control driving capabilities for mining trucks is required. Moreover, there is a demand for seamless operation coordination between these trucks and other construction machinery as well as accurate planning of their travel path to ensure safe autonomous operation.

In this scenario, an unmanned mining truck travels to a loading point, where it receives payloads from excavators, shovels, and other equipment. The entire process involves communication between the mining trucks, equipment, and cloud platforms during entry, loading, and transportation to the designated destination along with providing updates such as on position/speed/direction/acceleration. Loading equipment also sends positional and directional information for efficient operations.

During abnormal conditions, triggering an emergency brake feature that initiates remote control mode provides increased safety measures by triggering alarms alerting excavators about any danger. Cloud-based systems provide troubleshooting assistance, resolving issues experienced during material transport.

For autonomous driving in mining trucks, a cloud platform is utilized for planning paths while integrating environmental information. The vehicle interacts with other vehicles/equipment/cloud platforms, ensuring safe driving via functions such as forward collision warning and over-the-horizon perception while being capable of emergency braking followed by remote takeover, if required.

The unloading process requires communication/cooperation among various pieces of equipment (bulldozers/loaders/cloud platforms). In using planned tasks/paths, assistance is provided by the surrounding environment perception resulting in real-time status/information exchange between the mining truck/unloading equipment, leading to efficient cooperation. Again, it should be capable of emergency braking followed by a remote takeover in case of any abnormal situations.

Finally, refueling/water replenishment/maintenance tasks require organizing maintenance or overhaul tasks by the cloud platform when necessary, with support task execution planning via coordination between the truck/platform detecting faults or insufficient oil/water using planned paths periodically broadcasting real-time status/task information with availability for remote takeover in case of abnormalities.

2.2. 5G-Based Multi-UAV Collaboration Technology in Mining Areas

Using complex networks such as 5G/4G/MEC and V2X, along with cloud computing, big data, and artificial intelligence technologies, we can achieve ubiquitous network connectivity between vehicles, roads, people, and cloud service platforms. This enables environmental awareness and integrated computing while allowing for decision control across end-users, road management systems, and cloud architectures. These advancements provide safer, more efficient, and intelligent solutions for automatic vehicle driving as well as traffic optimization services that promote green practices [6].

The integration of 5G and multi-UAV facilitates business operations by primarily focusing on three major points: network convergence to connect vehicles and roads to the 5G network, data fusion through MEC for processing interactive or roadside sensing data, and business integration by syncing roadside sensing results with the cloud while supporting on-demand, multicast, and roadside broadcast for both 5G and multi-UAV. This collaboration between vehicles, roads, and clouds is a crucial part of the overall business process.

To enhance infrastructure services as well as network connections, this method proposes utilizing human vehicle roads as service objects while incorporating edge clouds for digital infrastructure support. The mining area multisystem cooperation approach based on 5G-multi-UAV provides redundant information service across multiple channels along with high-speed slicing interconnection and collaborative perception for the vehicle-road-cloud integration; this involves end-to-end communication protection paired with high-precision positioning services, resulting in a unified service system including one network platform that caters to various terminals in different scenarios.

Platform: Establishing a 5G vehicle-road collaborative service platform to accelerate scene innovation through technological advancements and open standard interfaces.

Map: Integrating high-precision maps with applications to improve driving safety, offering unified basic coordinates and supporting environmental perception assistance, path planning, and decision-making with real-time updates on the edge map information.

Network: Creating a 5G-multi-UAV wireless scene library for exploring optimal access experience for various scenarios. Intelligent network connection hierarchical services cater to the differentiated needs of different IoV businesses based on demand awareness [7].

Roadside: Building multisource fusion sensing platforms across all-weather self-developed vehicle-roadside-cloud systems facilitating low delay/high-efficiency transmission of roadside sensor data from vehicles, enabling supercomputing fusion processing at MEC-edge cloud nodes via collaborative communication computing methodologies.

Application: Utilizing cloud-side collaborations via 5G + edge cloud technology providing early vehicle-road collaboration warning services to realize multi-UAV collaboration defined under IoV’s scene service logo defining unicast/multicast broadcast services.

The implementation of drones as a flying platform poses difficulties due to their unique operating characteristics. Communication and safety risks are also high. However, various overhead angles for aerial photography can be achieved through drone usage. In the field of security, drones have been employed in dispatching security communications, commanding assisted patrols, and performing line planning in cases involving electricity. Drones are also widely used in consulting, resource exploration, urban planning, and other areas.

When it comes to secure communication with drones, it is important to pay attention to physical layer security technology, as conventional techniques do not guarantee secure transmission against brute-force cracking attempts [8]. Physical layer security techniques that can be adopted for UAVs include beamforming, artificial noise (AN), power allocation, and cooperative interference, among others.

2.3. Obstacle Detection for Unmanned Mine Trucks

With the remarkable advancements in GPU performance, many scholars worldwide have recently utilized deep learning methods for target detection. The design principle of convolutional neural networks is to simulate synapses of brain neurons by establishing different characteristic neuron links. With regard to target detection, these networks focus on developing different levels of receptive fields and high-dimensional feature information through multiscale feature extraction and classification via a multilevel cascade design. Region-CNN is the first algorithm that successfully applied deep learning to target detection [9]. It uses CNNs to extract features from region proposals and then performs SVM classification and bbox regression, resulting in greater accuracy than traditional methods. Its successor, fast R-CNN, utilizes shared convolution features for improved detection efficiency while also reducing computational complexity with high accuracy [10]. Another study [11] proposed an engineering vehicle detection algorithm based on faster R-CNN, which adjusts the position of ROI pooling layers and adds a convolution layer in the feature classification part, thereby enhancing model accuracy. Furthermore, ref. [12] introduces several differently sized RPNs into traditional faster R-CNN structures, allowing for larger vehicle detection, whilst [13], building upon faster R-CNN, improved object feature extraction by combining multilayered feedforwarding alongside the output of each context layer, enriching robustness against smaller or occluded targets that may confound other models. In the literature [14], there is a suggestion for improving domain adaptive fast R-CNN algorithms by refining their respective region proposal network (RPN) configuration; multiscale training helps mine difficult samples during secondary training, leading toward expanded small-target capability-albeit at a considerable computational expense.

Given the slow execution speed of the R-CNN algorithm, the one-stage detection method pioneered in [15] creatively integrated candidate frame extraction and feature classification, developing several versions [16]. The you only look once (YOLO) target detection algorithm enables higher accuracy and faster detection. Another example [17] improved YOLOv4 by adjusting the size of the detection layer for smaller objects and replacing the backbone with CSPLocknet-19, effectively achieving a good average accuracy (mAP) and frames per second (FPS) on low-cost edge hardware. In [18], an improved vehicle detection method using the YOLOv5 network under different traffic scenarios was proposed, utilizing flip-mosaic to enhance the perception of small targets, thereby increasing the accuracy of detection while reducing false positives. By adding an SSH module after YOLOv7’s pafpn structure to merge context information, the small object detection ability was improved. In recent years, the one-stage target detection algorithm has gained popularity across various fields due to its strong generalization performance and fast processing.

The open-pit mining environment is complex and dynamic, necessitating the use of a target detection algorithm based on convolutional neural networks. The selected algorithm must satisfy the real-time and high-precision requirements for obstacle detection by unmanned mining trucks. After analyzing the existing algorithms, we selected the YOLOv5 network based on its suitability for detecting obstacles within an open-pit mining area [19]. Obstacle detection using only two-dimensional image methods often produces inaccurate distance information; multisensor fusion that combines stereo vision with laser radar can provide more accurate results. Currently, YOLOv5 and YOLOX algorithms demonstrate favorable obstacle detection performance. Notably, YOLOV5-s and YOLOX-s are lightweight models recommended for mobile deployment devices but still require improvement in detecting occluded and small-scale targets.

Automatic driving solutions in the field of mining primarily comprise three modules: a central control system, automatic driving trucks, and other engineering vehicle coordination kits. In cloud-based remote monitoring, the scheduling platform serves as the central control system, whereas automatic driving trucks operate with perception abilities to make intelligent decisions automatically, thereby reducing operational costs while increasing transportation efficiency through improved safety measures between vehicles such as excavator coordination kits or other vehicle support tools. With the integration of these solutions into practices such as travel path planning during onsite applications or collaborative management activities located within various sections throughout the mine site, significant improvements have been made toward safer operations for lower overall cost.

3. 5G-Multi-UAV-Based IMOD Autonomous Driving Model

The complex and variable background information present on unstructured roads within open-pit mining areas, coupled with the varying size and characteristics of obstacles as well as dependence on natural lighting conditions, resulting in frequent changes to road information, pose a challenge for obstacle detection [20]. Although such roads are typically simple pavements, frequently being traversed by heavy-duty trucks increases the likelihood of pavement damage and deformation.

The one-stage target detection algorithm YOLOv5 utilizes mosaic image enhancement and adaptive anchor frame calculation at its input. Its backbone network integrates focus and CSP structures, whereas the neck module uses FPN structures to enhance semantic information across different scales. The PAN structure fosters location awareness across these scales, thereby improving multiscale target detection performance. The lightweight YOLOv5-s and YOLOX-s target detection algorithms have strong performance in detecting targets across multiple scales and are well-suited for deployment on resource-constrained devices.

Based on the 5G-multi-UAV architecture, Figure 2 displays the IMOD collaborative system for a mining scene. Using the YOLOv5-s network structure as a basic framework, this paper presents an IMOD obstacle target detection model that adapts feature fusion to address challenges related to high-density occlusion of targets, low detection accuracy, and miss rate in detecting small-scale targets within open-pit mining areas. The specific improvements are summarized as follows:

(1): To cope with the negative impact of adjacent scale feature fusion on models, we propose utilizing a feature fusion factor and improving the calculation method. By increasing effective samples post-fusion, this approach improves learning abilities toward small and medium-sized scale targets.
(2): To enhance the detection accuracy of smaller targets in open-pit mining areas, reinforcing shallow feature layer information extraction via added shallow detection layers is crucial.
(3): Adaptively selecting appropriate receptive field features during model training can help tackle insufficient feature information extraction in scenes containing vehicles and pedestrians with significant scaling changes. Therefore, an adaptive receptive field fusion module based on the concept of an RFB [21] network structure is proposed.
(4): For efficiently detecting dense small-scale targets with high occlusion, we introduce StrongFocalLoss as a loss function while incorporating the CA attention mechanism to alter model focus toward relevant features, resulting in improved algorithmic accuracy.

3.1. Effective Fusion of Adjacent Scale Features

In small-scale target detection tasks involving occluding vehicles and pedestrians, the difficulty lies in extracting feature information due to their limited scale. Shallow networks have limited capacity to learn such information, whereas deep networks fail to provide sufficient support for shallow networks, thus impacting the successful detection of small targets. The neck network of the YOLOv5-s algorithm utilizes the bidirectional feature pyramids and horizontal connection structures of PAFPN for different scale feature fusion. However, some scales exhibit large response values across adjacent feature maps, leading to the identification of only one layer during network learning based on rough response value estimation, resulting in poor detection accuracy and convergence effectiveness.

The challenge in multisystem collaborative target detection stems from variations in image scale, sparse distribution of targets, a high number of targets, and small target size. Balancing the computational demand for processing high-resolution UAV images with limited computing power presents additional difficulties. To address these problems, the IMOD model uses three layers of feature maps that differ in size to detect objects at different scales. The model employs YOLOv5 as its base feature network and leverages inter-layer connections to extract more semantically informative features that facilitate effective object recognition while minimizing interference information. Selective channel expansion used by the IMOD does not excessively impact the model’s parameter size and avoids unnecessary training operations, thereby maximizing detection accuracy while ensuring algorithmic speediness.

The effectiveness of the same sample can vary in characteristic maps of different scales. Deep and shallow layers contribute differently to target information at various scales, and their impact on other layers has both advantages and disadvantages. To alleviate the negative effects of feature fusion, it is necessary to adjust the participation rate of deep features in shallow feature learning by filtering out invalid samples during adjacent layer transmission. This ensures more effective samples are available for learning on deep feature maps, which improves detection performance for targets of different sizes. An improved adjacent scale feature fusion strategy is proposed here to address these challenges. The expression for the FPN’s feature fusion process is as follows:

P_{i} = f_{c o n v} (f_{l a t e r a l} (C_{i}) + a_{i}^{i + 1} * f_{u p s a m p l e} (P_{i + 1})) .

(1)

Here,

C_{i}

and

P_{i + 1}

signify the feature map of layer i prior to and after feature fusion, respectively. The term

f_{l a t e r a l}

represents the one-in-FPN horizontal connection by convolution operation, whereas

f_{u p s a m p l e}

denotes an operation that increases the resolution twofold. In addition,

f_{c o n v}

indicates a convolution operation for processing features, whereas

a_{i}^{i + 1}

signifies the factor for fusing features that should be multiplied when transferring layer

i + 1

feature maps into those of layer i.

This study derived its proposed feature fusion factors through statistical analysis with calculated corresponding target numbers for each layer using this formula:

a_{i}^{i + 1} = N_{P_{i + 1}}, / N_{P_{i}}

(2)

The proposed method in [22] utilizes an attention module for calculating the fusion factor, incorporating the BAM attention mechanism from [23] and enhancing the efficiency of feature fusion between adjacent scales, as illustrated in Figure 3. The feature fusion factor is computed to alleviate the negative effects during the feature fusion process, with its formula expressed as follows:

a_{i}^{i + 1} = M_{s} (C_{i}^{}, P_{i + 1}^{}) \cdot M_{c} (C_{i}^{}, P_{i + 1}^{})

(3)

Here,

C_{i^{'}}

represents the feature map obtained by

C_{i}

after a 1 × 1 convolution operation, whereas

P_{{i + 1}^{'}}

denotes a feature map that was likewise processed through a twofold upsampling operation from

P_{i + 1}

. Both

M_{s}

and

M_{c}

denote spatial and channel attention modules used within the adjacent scale feature high-efficiency fusion (AFHF) module, as depicted in Figure 3.

The spatial attention module plays a crucial role in analyzing the differences between adjacent feature maps at different layers of transmission and filtering out invalid samples passed from deep to shallow layers. The

M_{s}

module is represented by the following formula:

M_{s} (C_{i}^{^{'}}, P_{i + 1}^{^{'}}) = σ (f_{5} (S o f t m a x^{'} (f_{1} f_{3} f_{3} f_{1} (C_{i}^{^{'}})) - S o f t m a x^{'} (f_{1} f_{3} f_{3} f_{1} (P_{i + 1}^{^{'}})))

(4)

Here,

σ

represents the sigmoid activation function, and

f_{1}

,

f_{3}

, and

f_{5}

specify convolution operations with varying kernel sizes and shared parameter information. Additionally,

S o f t m a x^{'}

refers to an operation involving multiplication across the row and column dimensions of feature maps after a softmax operation has been applied. Overall, these techniques aid in accurately identifying invalid sample data within drone imagery datasets at various depths of analysis.

Each feature map channel contains a significant amount of information. The channel attention module (CAM) focuses on the meaningful content within the feature map, and, in conjunction with the spatial attention module (SAM), it can more effectively process features at the channel level to reduce meaningless channels for improved performance. The formula for the MC module is:

M_{c} (C_{i}^{^{'}}, P_{i + 1}^{^{'}}) = σ (M L P (G A P (C_{i}^{^{'}}); G A P (P_{i + 1}^{^{'}})))

(5)

Here, “GAP” represents global average pooling operation, while MLP represents a multilayer perceptron with a hidden layer composed of fully connected layers along with ReLU activation that reduces channel dimensions to 1/r times their original size before expanding them back out again. The specific experiment uses r = 16 in this instance, whereas sigmoid is used as an activation function.

3.2. Multiscale Wide Field-of-View Adaptive Fusion Module

In open-pit mining environments, heavily obscured targets such as vehicles and pedestrians present challenges due to the scale of the involved objects. In such scenarios, contextual information can be utilized to improve recognition performance. The YOLOv5-s algorithm leverages SPPF spatial pyramid pooling modules to increase the perceptual field while segregating critical contextual features from multiple sources for fusion purposes. However, this approach may hinder feature extraction accuracy during target detection by interfering with the extraction of significant key features, resulting in inadequate information capture.

Therefore, this paper proposes using an adapted RFB-s network structure, which effectively increases the receptive field area for adaptive fusion whilst addressing shortcomings experienced with previous approaches. Figure 4 illustrates the improved methodology employed in our study.

The proposed RFB-s module employs several techniques to optimize the structural design. First, the input feature map is subjected to a 1 × 1 convolution operation, which reduces both channel count and computational load. Asymmetric convolutional layers are then used to further reduce parameter size before applying 3 × 3 dilated convolutions that expand the feature perceptual field across three rates (1, 3, and 5). Each such branch undergoes stitching, followed by another round of 1 × 1 convolution fusion operations that yield a final output fulfilling the fusion requirements of each stage. Critically, our approach incorporates shortcut regularization [24], which not only accelerates training but also reconciles issues around exploding or vanishing gradient flow via the merging of multiscale perceptual features with their original counterparts.

To further enhance the receptive field, the improved module for object detection in drone imagery, referred to as SRFB-s (strong receptive field block), adopts the overall structure of multi-branch null convolution. It replaces 3 × 3 convolutions with more efficient 1 × 3 and 3 × 1 asymmetric convolutions to reduce parameter count and computational effort. The module also includes additional perceptual field branches to provide a wider range of features, including contextual information. Additionally, it utilizes the ASFF network [25] to adaptively fuse feature maps and selects optimal fusion methods based on scale targets during training to prevent irrelevant background noise degradation while enhancing detection capabilities under occlusion, large-scale changes, etc.

3.3. Attention Mechanism and Loss Function Optimization

In object detection in open-pit mining scenes, challenges concerning considerable scale variations together with occlusions are encountered. During feature extraction, the model integrates extensive amounts of invalid feature information stemming from background clutter as well as undetected targets, which negatively affect valid target information extraction. To mitigate such concerns, researchers worldwide utilize attention mechanisms that highlight crucial features while ignoring irrelevant data, thus improving overall model performance. The coordinate attention (CA) module, introduced in [26], filters out invalid details, instead emphasizing relevant ones by incorporating novel encoding methods along two spatial directions, integrating coordinate information into generated attention maps for lightweight networks.

The role of the loss function is essential to enhance object detection and localization in open-pit mining scene models. The loss function comprises three critical components: localization loss, confidence loss, and classification loss. The formula is described as follows:

L o s s = L o c a l i z a t i o n_{L o s s} + C o n f i d e n c e_{L o s s} + C l a s s i f i c a t i o n_{L o s s}

(6)

The localization loss function shows varying effectiveness depending on different detection scenarios. To enhance open-pit scenario detection of occluded targets, small-scale targets, and medium-sized targets, this model employs StrongFocalLoss for calculating target frame localization loss; it applies the FocalLoss function for confidence loss in contrast to the conventionally employed cross-entropy loss function BCEclsloss. The experiment uses the StrongFocalLoss loss function instead of cross-entropy for classification loss, resulting in more effective object recognition.

S F L (σ) = - | y - σ |^{α} ((1 - y) lg (1 - σ) + y lg (1 - σ))

(7)

where

σ

represents the prediction, and

| y - σ |^{α} (α \geq 0)

, with scaling factor

α

, is a quality label ranging from 0 to 1 that denotes the absolute value of distance. This hyperparameter controls the downscaling rate, which can be set for optimal performance; in the literature, examples such as the recent study of Yuan et al. [27] suggest an experimentally specific variety of this parameter, where

α = 2

.

In open-pit mining scenarios, targets are often occluded or densely distributed at small scales, causing overlaps between candidate frames and leading to reduced classification accuracy within model loss functions subject to non-maximal suppression post-processing analyses. These deficiencies may be compensated for by introducing SFL into models that uncover obscured targets or those appearing on dense small-scale settings found in open-pit mining scenes more accurately relative to earlier work.

3.4. Improved Multiscale Obstacle Object Detection Model

The YOLOv5-s algorithm detects large, medium, and small targets using three scales of 20 × 20, 40 × 40, and 80 × 80. Due to the presence of a considerable number of small-scale and densely occluded targets in open-pit scenes, target features mostly exist at shallow network layers. To improve the accuracy of detecting smaller objects, we add a 160 × 160 shallow detection layer. However, as model complexity increases with this change, we reduced the deep feature layer to 20 × 20. This reduction allows for obtaining enough information required for larger-scale object detection, such as trucks or forklifts, leading to significant decreases in accuracy overall.

Comparing models in terms of improvements based on multiple metrics indicates that adding an extra shallow layer substantially improves model performance without inferring complexities unacceptable for practical deployment while still retaining the crucial 20 × 20 deep detection layer necessary for the precise identification of large-scale objects such as trucks or sprinklers. Figure 5 shows our improved IMOD model architecture, which includes both the original scales and one added scale per the recommendations.

4. Experimental Analysis

4.1. Network Model Ablation Study

To fully examine the efficacy of the three proposed improvement strategies in this paper, their impact on the performance of YOLOv5-s was investigated by conducting ablation and robustness experiments. Evaluation metrics including parameter count, weight size, computation volume (GFLOPS), mean average precision (mAP), and single-frame detection time (FPS) were evaluated. The

m A P

was computed at an intersection over a union threshold of 0.5 using the following formula:

P = T P / (T P / F P)

(8)

R = T P / (T P + F P)

(9)

m A P = \frac{1}{C} \sum_{i = 1}^{c} A P_{i} = \frac{1}{C} \int_{0}^{1} P (R) d R

(10)

where P is the accuracy rate,

T P

is true positive cases,

F P

is false positive cases, and

F N

is false negative cases.

Figure 6 presents the training loss curves of localization, classification, and confidence losses, indicating that the improved model converges faster than before. Furthermore, Figure 7 reports an enhanced

m A P

achieved by the improved model.

4.2. Robustness Experiment

The team utilizes intelligent driving data centers to acquire on-road datasets for large-scale real-world scenarios, one of which is the BUUISE-MO dataset shown in Figure 8, a mine obstacle image dataset.

The BUUISE-MO dataset has a picture resolution of 1920 × 1080, a training set of 7220, and a test set of 2500, for a total of 9720 images, as shown in Figure 8. The dataset contains 15 categories including truck, forklift, car, excavator, person, signboard, and others, with 6041 (large), 9230 (medium), and 12,043 (small) labels. This dataset is appropriate for the task of detecting small objects.

4.3. Comparative Experiment

4.3.1. Network Model Ablation Experiment

To verify the feasibility of the improved strategy and to explore its effect on model performance, different improvement methods were integrated into the original network structure. Ablation experiments were performed on the BUUISE-MO dataset to evaluate all aspects of the index, such as changes in model structure, parameter count, and computational complexity due to these improvements. Table 1 reports specific ablation experiment results, where ✔ denotes integration with our proposed method. The proposed strategy displays varying levels of improvement across both datasets without impacting real-time detection capabilities. Additionally, it facilitates higher detection accuracy for target objects within open-pit mining areas while validating feasibility through the successful completion of the ablation study.

In this article, the IMOD model is proposed, which employs 5G-based multi-UAV and deep learning methods to enhance unmanned mining vehicle behavior via multisystem coordination in an open-pit mining environment. An autonomous mine obstacle image dataset is constructed and experimentally analyzed to address difficulties with recognizing small-scale targets amidst complex road scenes in mining areas. The IMOD model effectively enhances safety for driverless vehicles and personnel/equipment within mining zones. Future work should focus on improving the accuracy of small target recognition under abnormal illumination conditions and addressing error correction resulting from data desynchronization caused by multisystem coordinated power outages.

The BUUISE-MO dataset was chosen for quantitative and qualitative analysis to demonstrate the degree of improvement achieved by the improved algorithm on various targets. Table 2 displays the average accuracy of different enhanced algorithms for each target.

Upon comparison of Table 1 and Table 2, it was determined that the inferior performance of the model in open-pit mining environments can be attributed to an increase in occluded targets and smaller targets. Although the SRFB-S module and AFHF module have been demonstrated to improve the detection accuracy for trucks, signboards, excavators, persons, forklifts, and cars within this scene, real-time detection is sacrificed as a result. The addition of both CA attentional modules and SFL loss functions resulted in improvements of 0.3% and 0.2%, respectively, without compromising other features. Implementing four detection branches presents the potential for significantly enhancing small target detection accuracy at the cost of increased model complexity.

4.3.2. Robustness Experiment

To validate the robustness and generalization of our improved model in complex road scenarios with significant occlusion and shadow factors, we utilized the BUUISE-MO dataset [28] for experimental verification. After undergoing thorough training involving 320 iterations, our model achieved stability while producing results during validation that are detailed in Table 3.

To further validate whether the improved algorithm can be generalized to basic object detection scenarios, we removed small objects in the BDD100K dataset that have a significant impact on detection and evaluated our approach using both the BDD100K dataset (with six classes) and VOC dataset. After training for 320 epochs, the model showed stable performance, as demonstrated in Figure 9 and Figure 10.

The data analysis presented in the above table indicates that, although the algorithm can be applied to basic object detection scenarios, its performance does not exhibit significant improvements. However, when compared with the YOLOv5-s algorithm, the enhanced approach demonstrates improved detection accuracy in complex road-scene datasets. This finding provides evidence that the enhanced model is both robust and generalizable.

4.3.3. Comparative Experiment

To demonstrate the efficacy of the IMOD algorithm, we compared it with four state-of-the-art algorithms, namely faster R-CNN, YOLOv4, YOLOv5-s, and YOLOv5-m, using the BDD100K dataset [29] as a benchmark. Detection accuracy and speed were employed as evaluation metrics. The experimental results are presented in Table 4 for comparison purposes.

According to the results in Table 4, obtained from different algorithmic models applied on the BDD100K dataset, the enhanced YOLOv5 algorithm outperforms other prevalent models based on both detection accuracy and speed metrics. In contrast to the YOLOv5-s and YOLOv5-m algorithms, with swift performance but lower detection accuracy, the enhanced YOLOv5 algorithm manifests superiority over real-time model operation conditions while delivering prominent overall performance, resulting in improved object recognition in intricate road scenarios. This has practical significance, demonstrating its beneficial applicability scope.

5. Conclusions

We proposed an object detection algorithm for complex road scenes in open-pit mining environments, aiming to address the problems of low detection accuracy, false detection, and missed detection of road occlusion targets and small-scale targets. Our algorithm is based on adaptive feature fusion using the YOLOv5-s algorithm as a starting point. We introduce a feature fusion factor to reduce negative impacts caused by adjacent scale fusion strategies, increase effective samples after feature fusion, and improve learning ability for small- to medium-sized targets. Additionally, we propose an improved receptive field module that extracts more target feature information from shallow feature layers. Finally, we introduce a CA attention mechanism and StrongFocalLoss loss function to improve model accuracy for dense occlusion targets and small-scale targets.

We autonomously collect and construct a mine obstacle image dataset to facilitate experimental testing of our approach. Our results show that our approach effectively addresses issues of blockage and small-scale target recognition in complex road scenarios found in mining areas, with use cases demonstrating the IMOD model increases the safety of unmanned vehicles while preserving equipment fidelity required at industrial scales.

Future work will involve improving recognition accuracy under abnormal illumination conditions as well as correcting errors due to data synchrony caused by multisystem shutdowns during network operations. Developing lightweight architectures toward facilitating deployment on mobile terminals while simultaneously enhancing overall model accuracy is also essential.

Author Contributions

Conceptualization and methodology, X.X. and C.X.; software and validation, C.X. and Z.W.; formal analysis, Y.Z.; investigation, S.Z.; writing—original draft preparation, X.X.; writing—review and editing, C.X., X.X. and Z.W.; project administration, H.B. and X.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by a key project of the National Nature Science Foundation of China (Grant No. 61932012), in part by the National Natural Science Foundation of China (Grant No. 62102033), and in part by Support for high-level Innovative Teams of Beijing Municipal Institutions (Grant No. BPHR20220121).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gao, Y.; Ai, Y.; Tian, B.; Chen, L.; Wang, J.; Cao, D.; Wang, F.Y. Parallel end-to-end autonomous mining: An IoT-oriented approach. IEEE Internet Things J. 2019, 7, 1011–1023. [Google Scholar] [CrossRef]
Ko, Y.; Kim, J.; Duguma, D.G.; Astillo, P.V.; You, I.; Pau, G. Drone Secure Communication Protocol for Future Sensitive Applications in Military Zone Number: 6. Sensors 2021, 21, 2057. [Google Scholar] [CrossRef] [PubMed]
Xu, C.; Wu, H.; Liu, H.; Gu, W.; Li, Y.; Cao, D. Blockchain-oriented privacy protection of sensitive data in the internet of vehicles. IEEE Trans. Intell. Veh. 2022, 8, 1057–1067. [Google Scholar] [CrossRef]
Chen, S.; Hu, J.; Shi, Y.; Zhao, L.; Li, W. A vision of C-V2X: Technologies, field testing, and challenges with chinese development. IEEE Internet Things J. 2020, 7, 3872–3881. [Google Scholar] [CrossRef] [Green Version]
Zhang, X.; Guo, A.; Ai, Y.; Tian, B.; Chen, L. Real-time scheduling of autonomous mining trucks via flow allocation-accelerated tabu search. IEEE Trans. Intell. Veh. 2022, 7, 466–479. [Google Scholar] [CrossRef]
Ma, N.; Li, D.; He, W.; Deng, Y.; Li, J.; Gao, Y.; Bao, H.; Zhang, H.; Xu, X.; Liu, Y.; et al. Future vehicles: Interactive wheeled robots. Sci. China Inf. Sci. 2021, 64, 1–3. [Google Scholar] [CrossRef]
Pan, Z.; Zhang, C.; Xia, Y.; Xiong, H.; Shao, X. An Improved Artificial Potential Field Method for Path Planning and Formation Control of the Multi-UAV Systems. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 1129–1133. [Google Scholar] [CrossRef]
Krichen, M.; Adoni, W.Y.H.; Mihoub, A.; Alzahrani, M.Y.; Nahhal, T. Security Challenges for Drone Communications: Possible Threats, Attacks and Countermeasures. In Proceedings of the 2022 2nd International Conference of Smart Systems and Emerging Technologies (SMARTTECH), Riyadh, Saudi Arabia, 9–11 May 2022; pp. 184–189. [Google Scholar]
Girshick, R.; Donahue, J. Trevor DARRELL a Jitendra MALIK. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1497. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Xiang, X.; Lv, N.; Guo, X.; Wang, S.; El Saddik, A. Engineering vehicles detection based on modified faster R-CNN for power grid surveillance. Sensors 2018, 18, 2258. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ghosh, R. On-road vehicle detection in varying weather conditions using faster R-CNN with several region proposal networks. Multimed. Tools Appl. 2021, 80, 25985–25999. [Google Scholar] [CrossRef]
Luo, J.Q.; Fang, H.S.; Shao, F.M.; Zhong, Y.; Hua, X. Multi-scale traffic vehicle detection based on faster R-CNN with NAS optimization and feature enrichment. Def. Technol. 2021, 17, 1542–1554. [Google Scholar] [CrossRef]
Yin, G.; Yu, M.; Wang, M.; Hu, Y.; Zhang, Y. Research on highway vehicle detection based on faster R-CNN and domain adaptation. Appl. Intell. 2022, 52, 3483–3498. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Qiu, Z.; Bai, H.; Chen, T. Special Vehicle Detection from UAV Perspective via YOLO-GNS Based Deep Learning Network. Drones 2023, 7, 117. [Google Scholar] [CrossRef]
Koay, H.V.; Chuah, J.H.; Chow, C.O.; Chang, Y.L.; Yong, K.K. YOLO-RTUAV: Towards real-time vehicle detection through aerial images with low-cost edge devices. Remote Sens. 2021, 13, 4196. [Google Scholar] [CrossRef]
Zhang, Y.; Guo, Z.; Wu, J.; Tian, Y.; Tang, H.; Guo, X. Real-Time Vehicle Detection Based on Improved YOLO v5. Sustainability 2022, 14, 12274. [Google Scholar] [CrossRef]
Ultralytics. YOLOv5. Available online: https://github.com/ultralytics/yolov5 (accessed on 3 February 2023).
Lu, X.; Ai, Y.; Tian, B. Real-time mine road boundary detection and tracking for autonomous truck. Sensors 2020, 20, 1121. [Google Scholar] [CrossRef] [Green Version]
Liu, S.; Huang, D. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 404–419. [Google Scholar]
Wu, H.; Xu, C.; Liu, H. S-MAT: Semantic-Driven Masked Attention Transformer for Multi-Label Aerial Image Classification. Sensors 2022, 22, 5433. [Google Scholar] [CrossRef]
Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
Geirhos, R.; Jacobsen, J.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F. Shortcut Learning in Deep Neural Networks. Nat. Mach. Intell. 2020, 2, 665–673. [Google Scholar] [CrossRef]
Cheng, X.; Yu, J. RetinaNet with difference channel attention and adaptively spatial feature fusion for steel surface defect detection. IEEE Trans. Instrum. Meas. 2020, 70, 1–11. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13708–13717. [Google Scholar]
Yuan, J.; Wang, Z.; Xu, C.; Li, H.; Dai, S.; Liu, H. Multi-vehicle group-aware data protection model based on differential privacy for autonomous sensor networks. IET Circuits Devices Syst. 2023, 17, 1–13. [Google Scholar] [CrossRef]
Li, M.; Zhang, H.; Xu, C.; Yan, C.; Liu, H.; Li, X. MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion. Electronics 2022, 11, 2999. [Google Scholar] [CrossRef]
Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2636–2645. [Google Scholar]

Figure 1. A 5G-multi-UAV vehicle-road-cloud integration scene in an open-air intelligent mining area.

Figure 2. The IMOD mining scene collaborative system architecture based on 5G-multi-UAV.

Figure 3. Adjacent scale feature high-efficiency fusion (AFHF).

Figure 4. SRFB-s module of IMOD.

Figure 5. The structure of the IMOD model.

Figure 6. Comparison of training loss curves.

Figure 7. Validation set

m A P

comparison map.

Figure 7. Validation set

m A P

comparison map.

Figure 8. The BUUISE-MO mine obstacle image dataset.

Figure 9. Comparison of detection results of different models for major targets.

Figure 10. Comparison of core parameters computing performance.

Table 1. Experimental results of the ablation study on the IMOD model using the BUUISE-MO dataset.

CA	SRFB-s	AFHF	SFL	4-Scale	mAP@0.5	FPS
	✔				0.495	144
		✔			0.494	146
✔					0.481	155
			✔		0.478	154
				✔	0.492	152
	✔	✔			0.514	123
✔	✔	✔			0.515	122
✔	✔	✔	✔		0.528	120
✔	✔	✔	✔	✔	0.543	120

Table 2. Average accuracy of improved models for various types of targets.

Algorithm	Truck	Signboard	Excavator	Person	Car	Forklift
YOLOv5-s	0.732	0.556	0.537	0.556	0.413	0.365
v5s-SRFB-s	0.754	0.582	0.563	0.587	0.406	0.398
v5s-AFHF	0.756	0.576	0.564	0.573	0.414	0.386
v5s-CA	0.726	0.562	0.541	0.555	0.417	0.373
v5s-SFL	0.744	0.546	0.542	0.546	0.414	0.313
v5s-4-scale	0.757	0.574	0.565	0.573	0.41	0.384
IMOD	0.815	0.592	0.612	0.596	0.489	0.396

Table 3. Experimental results showing improved model robustness on the BUUISE-MO dataset.

Algorithm	mAP@0.5	FPS	Parameter/M	GFLOPs
YOLOv5-s	0.466	155	7.08	16.2
YOLOv5-m	0.512	129	20.85	47
IMOD	0.535	120	11.19	20.4

Table 4. Performance comparison of different algorithms on the BDD100K dataset.

Algorithm	Backbone	mAP@0.5	FPS (V100)
YOLOv4	CSPDarknet53	0.452	65
IMOD	Darknet53	0.318	12
YOLOv5-s	CSPDarknet53	0.467	105
YOLOv5-m	CSPDarknet53	0.511	89
IMOD	CSPDarknet53	0.534	80

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, X.; Zhao, S.; Xu, C.; Wang, Z.; Zheng, Y.; Qian, X.; Bao, H. Intelligent Mining Road Object Detection Based on Multiscale Feature Fusion in Multi-UAV Networks. Drones 2023, 7, 250. https://doi.org/10.3390/drones7040250

AMA Style

Xu X, Zhao S, Xu C, Wang Z, Zheng Y, Qian X, Bao H. Intelligent Mining Road Object Detection Based on Multiscale Feature Fusion in Multi-UAV Networks. Drones. 2023; 7(4):250. https://doi.org/10.3390/drones7040250

Chicago/Turabian Style

Xu, Xinkai, Shuaihe Zhao, Cheng Xu, Zhuang Wang, Ying Zheng, Xu Qian, and Hong Bao. 2023. "Intelligent Mining Road Object Detection Based on Multiscale Feature Fusion in Multi-UAV Networks" Drones 7, no. 4: 250. https://doi.org/10.3390/drones7040250

APA Style

Xu, X., Zhao, S., Xu, C., Wang, Z., Zheng, Y., Qian, X., & Bao, H. (2023). Intelligent Mining Road Object Detection Based on Multiscale Feature Fusion in Multi-UAV Networks. Drones, 7(4), 250. https://doi.org/10.3390/drones7040250

Article Menu

Intelligent Mining Road Object Detection Based on Multiscale Feature Fusion in Multi-UAV Networks

Abstract

1. Introduction

2. Related Work

2.1. Multisystem Collaboration Scenarios and Applications in Open-Pit Mines

2.2. 5G-Based Multi-UAV Collaboration Technology in Mining Areas

2.3. Obstacle Detection for Unmanned Mine Trucks

3. 5G-Multi-UAV-Based IMOD Autonomous Driving Model

3.1. Effective Fusion of Adjacent Scale Features

3.2. Multiscale Wide Field-of-View Adaptive Fusion Module

3.3. Attention Mechanism and Loss Function Optimization

3.4. Improved Multiscale Obstacle Object Detection Model

4. Experimental Analysis

4.1. Network Model Ablation Study

4.2. Robustness Experiment

4.3. Comparative Experiment

4.3.1. Network Model Ablation Experiment

4.3.2. Robustness Experiment

4.3.3. Comparative Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI