Joint Fusion and Detection via Deep Learning in UAV-Borne Multispectral Sensing of Scatterable Landmine

Compared with traditional mine detection methods, UAV-based measures are more suitable for the rapid detection of large areas of scatterable landmines, and a multispectral fusion strategy based on a deep learning model is proposed to facilitate mine detection. Using the UAV-borne multispectral cruise platform, we establish a multispectral dataset of scatterable mines, with mine-spreading areas of the ground vegetation considered. In order to achieve the robust detection of occluded landmines, first, we employ an active learning strategy to refine the labeling of the multispectral dataset. Then, we propose an image fusion architecture driven by detection, in which we use YOLOv5 for the detection part, to improve the detection performance instructively while enhancing the quality of the fused image. Specifically, a simple and lightweight fusion network is designed to sufficiently aggregate texture details and semantic information of the source images and obtain a higher fusion speed. Moreover, we leverage detection loss as well as a joint-training algorithm to allow the semantic information to dynamically flow back into the fusion network. Extensive qualitative and quantitative experiments demonstrate that the detection-driven fusion (DDF) that we propose can effectively increase the recall rate, especially for occluded landmines, and verify the feasibility of multispectral data through reasonable processing.


Introduction
Landmines have caused great harm to human life and health and greatly hindered the harmonious development of human society [1]. According to incomplete statistics, more than 110 million landmines are still present in post-war areas. To date, many methods have been developed to detect and remove landmines during war or post-war reconstruction. They can be divided into two categories based on their fundamental principles [2]. One employs the geophysical differences between certain parts of landmines and their surroundings, including ground-penetrating radar, low-frequency electromagnetic induction, metal detectors, ultrasonic devices, infrared imaging, and other combinations of detection techniques. The other has a wide range of research dimensions based on particular techniques, such as acoustics, biological sensing, and molecular tracking.
With the development of military technologies, scatterable landmines have emerged and are attracted increasing attention, as opposed to landmines covered with vegetation. The scatterable landmines naturally possess the advantages of speed, flexibility, and high mobility. Consequently, large-scale and large-area minefields can be deployed quickly during war. Moreover, aerial minelaying or projection from armored vehicles results in the randomized distribution of landmines in diverse topographies and complex vegetation environments. These pose new challenges for minefield clearance and make the longdistance and rapid detection of scatterable landmines increasingly necessary.
Meanwhile, massive advances in unmanned aerial vehicles (UAVs) have presented enormous potential [3]. Specifically, UAVs with various on-board electromagnetic sensors have yielded a considerable number of applications, including a rise in low-altitude landmine detection, as they can provide better safety performance and faster measurements than other mine detection platforms. Many investigations aiming at the combination of drones and predominantly ground-penetrating radar (GPR) have been conducted [4][5][6][7][8][9]. The authors in [4,6] aimed to achieve a precise and steady flight over the surface, to increase the positioning accuracy and reliability of GPR measurements, while on-board software-defined radio (SDR)-based radars were developed in [6,10] to detect landmines. Other customized sensors mounted on drones have also emerged. Among them, synthetic aperture radar (SAR) [9,11] and hyper-spectral [12] imaging techniques have shown unprecedented potential with advanced image processing and computer vision algorithms. In [11], the authors reported the capability of high-resolution 3D SAR images for the detection of buried hazards, while the hyper-spectral imaging of landmines was highlighted with different signal processing techniques in [12,13].
Numerous factors, such as the materials and shapes of the mines, terrain complexity, vegetation humidity, flight altitude, etc., limit the detection of scatterable mines. It is challenging for a single sensor to solve all of the aforementioned issues. It is well known that objects exhibit different features in different spectra, and multispectral remote sensing that obtains and comprehensively exploits different spectra information of the same target by means of photography or scanning is appealing. For instance, a particular spectrum may offer a unique advantage in certain terrains or settings, whereas the proper fusion of multispectral remote sensing might expand the observation range and reveal more information. Multispectral fusion greatly outperforms a single spectrum in pedestrian recognition, especially in challenging situations such as occluded appearances and background clutter [14]. However, the development of multispectral techniques in the detection of scattered landmines is still in a preliminary exploration stage. Research in this area mainly concentrates on the analysis of different spectral signatures and the simulation of ground environmental influences [15][16][17][18][19]. Differences in spectral signatures between soils and landmines are essential for detection, and a study has been performed regarding this [15]. The utilization of remote sensing thermal infrared technology has led to satisfactory results in PFM-1 minefield detection [16], and plastic mines such as PFM-1 have higher thermal conductivity when the heat flux is high in the early morning [18]. In addition, ref. [17] simulated the influence of environmental forces on the orientation of PFM-1, complicating the minefield environment.
In the past few years, deep learning, based on the cognitive structure of the human brain, has achieved great success in many areas. Compared with the traditional machine learning algorithm (shallow learning), it can extract more abstract features through multilayer nonlinear transformation. As one of the fundamental visual tasks, object detection based on deep learning has outperformed traditional methods over the last few years. Ideas and implementations based on convolutional neural networks (CNNs) have facilitated the success of deep learning in detection [20]. Widely acclaimed deep learning models, such as ResNet [21], regions with CNN features [22] (R-CNN), Faster R-CNN [23], etc., show outstanding transferability in mine detection scenes [19,[24][25][26]. Some customized deep learning models also show powerful feature extraction abilities and adaptability in varied scenes [27,28]. It is worth noting that different types of data were used in the above studies, such as thermal infrared images, GPR B-scans, and radar signals, but deep learning methods still achieve excellent performance.
The Binghamton University team [16][17][18][19] is conducting ongoing research into multispectral landmine detection. Comparative experiments indicate that thermal cameras can detect temperature variations between plastic PFM-1 landmines and host geology during the early-morning hours, when the thermal flux is high [18]. Their preliminary studies in [16,17] demonstrate that a thermal camera mounted on a low-cost UAV can successfully detect PFM-1 mines at the elevation of 10 meters above the ground, regardless of their orientations. In 2020, they incorporated an analysis of multispectral and thermal datasets and discovered that different bands are beneficial in different situations [19]. Furthermore, they employed the Faster R-CNN [23] model and yielded satisfactory accuracy for a partially or wholly withheld testing set. The employed data only consist of RGB visible images, without fusing other bands for detection, but their solid work gives confidence and inspiration for multispectral detection. In our past work [29], we combined the detection results of RGB and NIR images and finally achieved excellent performance, demonstrating the advantages of multispectral over monospectral images in detection. In this paper, to further investigate the application of multispectral imaging, we explore in depth a new approach as an extension of the aforementioned work.
Most of the existing research and trials only consider an extremely elementary simulation environment and landmine setting. However, high-quality data from the real environment or diversified and high-fidelity experimental scenes are the foundation for the sufficient generalization of supervised deep learning. Therefore, in this study, we were motivated to set up complex minefield conditions to simulate real scenes and collect raw high-quality multispectral data acquired via a drone with an integrated multispectral camera. One RGB sensor and four monochrome sensors collected data in discrete spectral bands-green, red, red edge, and near-infrared. More specifically, our dataset contains four types of scatterable landmines, four scenes with different ground vegetation, and various flight altitudes. To overcome the labeling issues introduced by the severe occlusion and the tiny size of the landmines and improve the labeling efficiency, we utilize the idea of active learning to ensure the complete and accurate labeling of the dataset. To improve the detection performance for the obtained high-quality multispectral images, we present a detection-driven image fusion framework with YOLOv5 [30] plugged in. When compared to the original YOLOv5 [30] benchmark or other fusion detection cascade models, our proposed fusion detection enhances the recall markedly. The RGB and NIR bands are used to demonstrate this in the study. There are 789 pairs of images in our dataset, which are partitioned into training and test sets in an 8:2 ratio. A total of 991 mines are labeled in the test dataset, and a recall rate of 0.848 is achieved when using RGB alone for prediction, while 0.906 is achieved with the fusion of NIR and RGB driven by the landmine detection task. As a plug-and-play module, it can be integrated into most of the current state-of-the-art object detection models.
The rest of the paper is organized as follows. Section 2 illustrates the details of our dataset, including the instruments and the data processing. Section 3 introduces the adopted deep learning method and proposed fusion module. Section 4 presents the experimental results and comparative analysis. Finally, conclusions are drawn in Section 5.

Multispectral Dataset
In order to address the shortage of public multispectral landmine datasets, we constructed a small landmine dataset, including labeled RGB and near-infrared (NIR) image pairs.

Landmine Samples
In the dataset, 40 mines with five different sizes and materials were selected in total, three of which were anti-personnel mines, while the others were anti-vehicle mines. Their primary materials were metal, and the shell parts contained a small amount of plastic and rubber. The specifications of the four types of mines are given in Figure 1. With the exception of the M93 landmine, they are all olive-drab-covered and cylindrical in shape. Specifically, the smallest anti-personnel mine M14 is less than 6 cm in diameter and primarily non-metallic. In comparison, the anti-vehicle M93 Hornet mine possesses a relatively large body size and is plastic-cased with a rubber top. Before the explosion, they are air-delivered and hidden well on the ground or in vegetation, remaining active without being noticed for long periods.

Equipment
We acquired the multispectral images using the Sequoia multispectral kit. The Sequoia sensor comprises four monochrome cameras with specific narrow-band filters and highresolution RGB lenses in the NIR and visible domains, respectively, as shown in Table 1. The camera is equipped with a "sunlight sensor" that measures sky irradiance. The built-in spectroradiometer has precision of 1.5 nm to measure irradiance of 2048 pixels in the visible and NIR spectral ranges (350-1000 nm). The drone equipped with the multispectral sensor (Figure 2) can sustain a continuous flight of an hour approximately. The preparations before take-off include the calibration of the drones and multispectral cameras and radiation correction through the sunshine sensor. The matching sunshine sensor can monitor radiation fluctuations to obtain multispectral images with stable quality under different lighting and natural conditions. The flying height changes according to the host environment and whether sparse, tall plants are present. In this work, the flying heights are fixed at approximately 3 m and 8 m, respectively, to obtain images with a similar ground resolution. During the data acquisition process, the drone cruises at a constant speed with an area overlap rate of 80% to ensure complete coverage of the minefields of the given range. Moreover, the built-in GPS/GNSS of the multispectral camera can determine the device's location when capturing images, which provides good feasibility to convert the image coordinate system and the geographic coordinate system.

Experimental Scenes
We conducted data collection experiments and placed dozens of mines on grassland, instead of experimental scenes such as sand land, cement roads, and gravel roads, etc. A more complex environment with vegetation is considered to validate the feasibility of our method. We selected several low-vegetation environments to simulate minefields. As displayed in Figure 3, mines are easily obscured by wild vegetation. The mines were scattered randomly, with approximately 60% obscured by vegetation; in other words, only part of the mine body was visible. Moreover, a small number of them were entirely covered by ground vegetation, which caused extremely low visibility. In addition, there were many leaves, stones, and plant rhizomes that were similar in color and shape to the mines, which may have caused false alarms.

Acquisition and Registration
With a constant capturing interval of 2 s, UAVs may complete the minefield cruise in a very short amount of time. Sequoia's monochrome and RGB cameras are not calibrated throughout production, resulting in a lack of rigorous time synchronization due to the various shutters. As a result, the collected multispectral raw images have a spatio-temporal shift. Thus, in addition to the typical nonlinear intensity differences, there are also spatiotemporal mismatches between the multi-modal data. For instance, the vegetation in the simulated landscape is mostly filled with dense grasses that sway in the wind in different directions. Completely accurate image registration is almost impossible. This can be a significant problem for actual minefield detection, but not for host environments without vegetation. Registration allows image pixels to be aligned for more precise fusion and also eases subsequent labeling. The registered infrared and visible image pair that we obtained is shown in Figure 4. We also refer to the channel feature of oriented gradients [31] (CFOG) method to register the multispectral images, which could effectively further improve the precision of the registration result. By comparing the positions of the corresponding landmines, we could directly assess the quality of the registration result.

Labeling
Supervised learning strongly relies on high-quality labeled data, which usually comes with high costs. Manually labeled data are time-consuming to obtain and sometimes unreliable. Even in the well-known public datasets in the academic community, there are still many instances of mislabeling. In this paper, we labeled our dataset using the following steps: detection equipment assembly, data collection, data cleaning, image alignment, and labeling.
Firstly, random landmine placement and extreme occlusion substantially increase the difficulty of labeling. Moreover, nonsalient objects such as landmines usually blend in easily with their surroundings. In a 1280 × 960 resolution image, the target is estimated to occupy 100 pixels, while, for the smallest mine, M14, it occupies only approximately 10 pixels. It is not feasible to label each image manually. Therefore, during the labeling precess, we referred to the strategy described in [14]. We compared and labeled images in parallel after aligning the image pairs. Moreover, to accelerate the labeling, we introduced an active learning technique, which is shown in Figure 5. In real-world circumstances, it is worth considering how to collect valuable labeled data for a lower cost and subsequently increase the algorithm's performance. Active learning focuses on reducing the burden of labeling by incorporating human experience into machine learning models. Firstly, the query function determines the candidate set of data to be labeled. Then, professionals with business experience annotate the selected datasets. With continual manual annotation to be added, the model is therefore incrementally updated. The core principle behind active learning is to manually resolve the "hard sample" via human interference. In our situation, the term "hard sample" refers to landmines that are well hidden and difficult to detect. Furthermore, the training results assist in the improvement of the annotation. After three rounds of the iterative process, a dataset with reliable annotations was created.

Methods
Existing image fusion algorithms tend to focus on the visual effect of the fused image itself and the pursuit of quantitative evaluation, while ignoring the promotion ability of image fusion as a low-level vision task for the subsequent high-level vision task. In fact, the ideal image fusion method should not only provide rich texture details and accord with human visual perception, but also help to conduct high-quality downstream tasks.
Compared with the traditional architectures, the fusion architecture driven by highlevel vision tasks can better conserve semantic information as well as exclude external noise. Liu in [32] designed a cascade structure to connect the image denoising module with the high-level vision network and jointly trained them to minimize the image reconstruction loss. Under the guidance of semantic information, a denoising network could further improve the visual effect and help to complete the semantic task. Similarly, Guo in [33] designed a residual network for semantic refinement and jointly trained the "drainage" network and the segmentation network in two stages, which was successfully utilized to overcome the impact of rainwater on sensors during automatic driving. Object detection, as one of the high-level vision tasks, was also used to guide image super-resolution reconstruction. In the research of [34], the authors built a novel end-to-end super-resolution framework, in which the updating of the super-resolution network was influenced by the detection loss. The proposed task-driven super-resolution algorithm was verified to improve the detection accuracy on the low-resolution image under various scale factors or conditions.

Detection-Driven Fusion Framework
By focusing on landmine detection, we introduce a joint training framework based on detection-driven fusion (DDF), as shown in Figure 6, in which we select YOLOv5 [30] for detection. The image fusion network is not limited to the preservation of low-level semantic information but is driven by the detection task to gather more valuable high-level semantic information. Meanwhile, the fusion network is partially influenced by the detection result, which can preserve fine-grained details while providing more semantic information for computer perception.

Fusion Network
On the basis of auto-encoder, we design a lightweight fusion network (denoted as DDFNet) with two inputs and one output, in which the registered images include the visible image I rgb ∈ R H * W * 3 and the near-infrared image I nir ∈ R H * W * 1 as the original inputs. The images pass through the encoder, fusion layer, and decoder, respectively, as shown in Figure 7, where the encoder is used for feature extraction and the decoder for image reconstruction, which can be expressed as follows: (1) (2) where F rgb and F nir are the depth features extracted by two different encoders EN, respectively. F f denotes the fused features of the visible and the infrared images through concatenation, and I f is the fused image reconstructed by decoder DE. In Figure 7, the encoder part is mainly composed of two general convolutional blocks as well as two dense residual blocks (DRB), as shown in Figure 8, which refers to the combination of the residual and dense blocks from [35]. The main branch of DRB aggregates shallow and deep semantic information by dense connection as the shortcut branch uses the classic 1 × 1 convolutional layer in the residual block to achieve identity mapping. Next, the output of the main branch and the shortcut are added at the element level to integrate deep features with fine-grained features. Thus, the superposition of two DRBs in the encoder ensures that the abundant semantic features are fully utilized. Then, the fusion layer fuses the rich features extracted from the infrared image and the visible image via concatenation. Finally, the fused features are fed back into the decoder for image reconstruction. With the dimensions of the feature map shrinking, the decoder outputs a three-channel fused image. In general, the design of our fusion network is simple and lightweight, which contributes to a higher fusion speed, and it is easy to facilitate subsequent deployments. Moreover, the introduction of the auto-encoder architecture as well as the DRB makes it possible for the fused image to fully restore the details and semantic information of the source images.

Loss Function
The design of the loss function starts with two aspects. Firstly, the fusion network needs to fully integrate the complementary information of the source images, such as the salient objects in the infrared image and the texture details in the visible image. For this reason, the content loss L content is constructed, which aims to ensure the visual fidelity of the fused image. Secondly, the fused image should effectively promote the detection effect. Therefore, the semantic loss L detect is introduced to indicate the contribution of the fused image to the detection task. In summary, the joint loss function is designed as follows: where the joint loss L all of the fusion network comprises the content loss L content and the semantic loss L detect , and λ is set as the weight of the semantic loss. Meanwhile, the content loss is composed of three types of loss, image intensity loss (L intensity ), texture loss(L texture ) [35], and structural similarity loss (L ssim ), formatted as where β 1 and β 2 are the hyperparameters to define the weights of texture loss and structural similarity loss, respectively. Image intensity loss L intensity assesses the difference in pixel intensity between a fused image and a source image, which can be more precisely formulated as where H and W are, respectively, the height and width of the fused image I f . · 1 stands for the l 1 -norm. max(·) ensures that the fused image maintains the intensity sufficiently, which contributes to the positioning of landmines. The texture loss L texture is introduced to preserve the details. It can be calculated as where the absolute value operation is denoted by |·| and the Sobel gradient operator is denoted by ∇. Although the standard L 2 distance can maintain edges well in image fusion, it cannot effectively remove noise. Moreover, it lacks the capability to determine how structurally similar the images are, and it is highly sensitive to changes in lighting, color, etc., which poses a challenge in maintaining specificity in multispectral images. The Structure Similarity Index Measure (SSIM) proposed by [36] is used to compare the structural similarity of an image before and after alteration. SSIM is sensitive to local structural changes, as with the human visual system (HVS). More importantly, SSIM can better tolerate the influence of outside information on the source image. The SSIM between two images can be more precisely expressed as follows: where µ and σ represent the mean and the covariance of images, respectively, and C 1 , C 2 , C 3 are the parameters used to stabilize the measurement. Therefore, the SSIM loss L ssim can be formulated as where w 1 and w 2 denote the weights of structural loss between the fused image I f and the source image I rgb and I nir , respectively. Although content loss can maintain information in the fused image and minimize distortion to some extent, it is challenging to measure the promotion impact of the fusion result on high-level vision tasks. As a consequence, we redefine the semantic loss using the standard loss function based on YOLOv5 [30] to guide the training of the fusion network and ultimately enhance the detection performance. The object detection task normally requires us to refine the target size as well as identify and categorize the detected item. In particular, the accuracy of the test results depends on three optimization directions: • All targets in the image should be identified, with a low missed and false alarm rate; • The bounding boxes should completely and accurately enclose the target; • The categorization of the detected object should be consistent with its label.
Therefore, the semantic loss (marked L detect [30]) is defined by the confidence loss (marked loss con f ), the bounding box loss (marked loss rect ), and the categorized loss (marked loss cls ) as follows: where a, b, and c are their weights, respectively. The confidence loss generally weighs the most, followed by the bounding box loss and the categorized loss.

Joint Training Algorithm
There are primarily two approaches to low-level vision model training that is motivated by high-level visual tasks. The first one is to train the low-level vision model under the guidance of the pre-trained high-level vision model, but the training result cannot be effectively fed back to the high-level vision model. The second one is to concurrently train the low-level and high-level vision models in one stage, and this strategy may bring difficulties in maintaining a balance between the performance of vision models at various levels. Therefore, we introduce a joint training method based on an iterative strategy to train our fusion network.
More specifically, we train the fusion and detection networks repeatedly, with the number of iterations set to n. Firstly, the fusion and the detection network employ the Adaptive Moment Estimation (ADAM) optimizer to maximize each of their individual parameters under the supervision of the joint loss L all and the detection loss L detect , respectively. Secondly, the parameters of the detection model are fixed to generate real-time semantic loss when the fusion network is updated. It should be noted that the hyperparameter λ set for the joint loss is dynamically adjusted in an iterative process, which is expressed as follows: This option is primarily intended to guarantee that, as the fusion network has been trained to a certain extent, the fraction of semantic loss steadily increases, and the visual effect becomes a secondary priority.
During the training process, we adopt the periodic joint training strategy demonstrated in Algorithm 1, which is iterated for 5 rounds. Save the parameters of the updated detection model In the first round, the fusion network only utilizes the content loss for backward propagation, while the semantic loss is set to 0. Then, in the next four rounds, both the content and semantic loss contribute to the updating of the fusion network. In addition, the detection network is trained through the updated fused image in each round.

Comparative Experiment
In our experiments, the landmine dataset is randomly divided into a training part and a testing part at a ratio of 8:2. The hyperparameters involved are all set to the default values, with no validation set for adjustment. This experimental setup makes it easier to demonstrate the feasibility of our landmine detection method.
To begin with, we conducted fusion comparative experiments for DDF and other algorithms. Figure 9 shows fused results of the selected deep learning fusion models and some conventional algorithms in the Visible and Infrared Image Fusion Benchmark [37,38] (VIFB). It should be noted that since FusionGAN [39] and RFN-Nest [40] only work for single-channel images with a depth of 8 bits, the fusion result is also a single-channel grayscale image. The testing images are randomly selected from the infrared and visible image pairings from the multispectral landmine dataset, and part of the region is inter-cepted to illustrate the fusion effect. There are two types of scatterable mines (M93 and T125) in the interception area, of which the rose-red mine (M93) is easy to identify due to its great size and distinguishing characteristics. T125 is partially covered by vegetation, and therefore it is difficult to detect in visible images but it reveals strong visual features in near-infrared images. It can be observed from Figure 9 that M93 is still clearly identifiable in the fusion results of most algorithms, while the distortion is severe in certain fused images. For example, testing images employing Densefuse [41], IFCNN [42], LatLRR [43], and GFCE [44] perform poorly in terms of pixel intensity and color. Moreover, the fused image generated by FusionGAN has the lowest image quality, and the target characteristics are nearly lost. Compared to the source images (RGB and NIR), RFN-Nest [40] could successfully integrate the complementary features of the two spectra. However, although the significant features of landmines in the two images are likewise preserved in the fused image, the color information is lost and is visibly different. In contrast, U2Fusion [45] well maintains the color feature distribution, but it introduces some noise in the white area. As for MST-SR [46], NSCT-SR [46], and the DDF that we propose, all of them enhance the visual saliency of T125 while preserving the original color. Next, we randomly chose 20 pairs of near-infrared and visible images from the multispectral landmine dataset and processed them using different fusion algorithms. We performed a comprehensive fusion evaluation using 13 statistical metrics [37,38], which is shown in Table 2. To some extent, the evaluation metrics are designed to quantitatively simulate the human visual system (HVS). Some of them, such as the Root Mean Square Error (RMSE), Peak Signal to Noise Ratio (PSNR), and Structural Similarity Index Measure (SSIM), are measured based on the original RGB and NIR images as benchmarks, which can reflect the amount of complementary information retained from the source images in the fused image. Others, such as the Average Gradient (AG), Entropy Index (EI), and Entropy (EN), can be calculated with the fused image itself and be used for the evaluation of the image quality, such as edges, saliency, and other features, which may be helpful for the subsequent detection.
In Table 2, blue represents the best-performing model for the corresponding metric. For the quantitative metrics designated as ↓, a lower value is better. The last row of Table 2 displays the evaluation metrics of DDF, where red, orange, and green, respectively, represent excellent, moderate, and poor performance of the fused image based on DDF compared with other fusion algorithms. It can be seen that LatLRR [43] ranks the highest for approximately half of the metrics, but it is not satisfactory regarding the visual effect shown in the aforementioned Figure 9. NSCT-SR [46] and MST-SR [46] perform better in the metrics of Q CB [48] and Q CV [49], as it can be seen in Figure 9 that both of them provide fusion results consistent with human vision. The DDF that we propose demonstrates satisfactory performance in most of the statistical indicators, while only providing unsatisfactory assessment results for Cross-Entropy (CE) and the Peak Signal to Noise Ratio (PSNR).

Generalization Experiment
In addition, in order to verify the generalization performance of the proposed method, we also chose 21 pairs of infrared and visible images from VIFB [37,38] as a further evaluation. Figure 10 offers a qualitative comparison between DDF and the other 10 fusion models. It can be observed that the fusion results of Densefuse [41], U2Fusion [45], Fu-sionGAN [39], RFN-Nest [40], GFCE [44], MST-SR [46], and NSCT-SR [46] are significantly affected by radiation or lose a lot of color information; among them, RFN-Nest [40] shows blurred edges and a loss of detail. IFCNN [42] and DDF achieve satisfactory fusion results, presenting excellent visual effects. For example, the textural details of the distant buildings in the image are greatly enhanced compared with the original visible image. Table 3 lists the mean values of the 21 pairs of image fusion metrics for the VIFB [37,38] dataset. It can be observed that the performance of DDF is mostly moderate. The reason is that the DDF is trained with the scatterable mine images-more specifically, the nearinfrared and visible images-while the infrared images in VIFB [37,38] are mainly thermal infrared. The contributions of the different statistical metrics to the final advanced visual task are still difficult to analyze and determine, but in the experimental study of this section, DDF shows better performance in the evaluation of some metrics and has an excellent visual effect.

Detection Performance
Furthermore, we conducted comparisons of the detection performance for the aforementioned fusion models. During the training and testing process, we employed two NVIDIA GeForce RTX 3090 graphics cards. We randomly shuffled and divided the collected data into a training set and a testing set at a ratio of 8:2, with quantities of 3964 and 991, respectively. Table 3. Quantitative comparisons of 13 metrics [37,38], i.e., Average Gradient (AG), Cross-Entropy (CE), Entropy Index (EI), Entropy (EN), Mutual Information (MI), Peak Signal to Noise Ratio (PSNR), Q AB/F [47], Q CB [48], Q CV [49], Root Mean Square Error (RMSE), Spatial Frequency (SF), Structural Similarity Index Measure (SSIM), and Standard Deviation (SD), on 21 pairs of near-infrared and visible images from the VIFB [37,38] dataset. Blue indicates the best result. Red, orange, and green represent excellent, moderate, and poor performance, respectively, of the fused image based on DDF compared with other fusion algorithms. ↓ indicates that a lower result is better; otherwise, a higher value is better.

Landmine Detection Evaluation
To further validate the DDF that we propose under the condition of vegetation, we compared the detection performance of DDF with other fusion methods. We randomly selected an image pair from the testing set for landmine detection. Figure 11 shows the detection results based on different fusion methods. Except for the source images RGB and NIR, the first two rows are based on deep learning fusion methods, while the last row relates to conventional fusion methods. As can be seen from the source images, most of the landmines in RGB can be distinguished more clearly. However, a small number of landmines, such as T125 and M93, magnified to be displayed in ground truth, show clearer outlines in the NIR image, which demonstrates the complementary information of RGB and NIR images. Therefore, it can be observed that most of the landmines, besides the aforementioned T125, can be directly detected in RGB, outperforming most of the fusion-based algorithms. DDF successfully detects all landmines, even when some of them are covered with vegetation. It is worth noting that the detection of T125 is not so precise, as the main body of its long strip is obscured by weeds. MST-SR [46] can also obtain a high recall rate, but it may cause false alarms. NSCT-SR [46] is able to detect most of the landmines, with the exception of B91, which is occluded severely.

RFN-Nest DDF LatLRR
NSCT-SR GFCE MST-SR Figure 11. Qualitative comparisons of detection performance based on different fusion models and the same detection model (YOLOv5 [30]). The source image NIR is not considered in the comparison as it lacks quantities of color information for detection, so it is shown as ground truth.
Then, we produced a statistical table according to the detection results of all the images in the testing set. The detection evaluation metrics of various fusion models are shown in Table 4, in which red and blue, respectively, represent the best result and the second-best result. It can be observed that the DDF has the best performance in terms of precision, recall, and mAP@0.5. In terms of the recall rate, the detection results of most fusion models are even worse than those of single-spectrum visible images, which also indicates that the existing fusion algorithms do not focus sufficiently on high-level visual tasks. The superior performance of DDF demonstrates the validity and feasibility of our methodology. Next, we evaluated the inference efficiency of different fusion models, which is shown in Table 5. The experiment was conducted twice, one on 20 pairs of randomly selected landmine images and the other on the VIFB [37,38] dataset, including 21 pairs of visible and infrared images. Only one NVIDIA GeForce RTX 3090 graphics card was utilized for inference. Table 5 shows the time consumption of the different models-more specifically, the average time consumption of the fusion process for each pair of images. The calculation covers the process of data loading and image storage, while the time for model loading is not included. Since the efficiency of the traditional algorithm is generally much lower than that of the deep learning model with the forward computation, the table is separated into two parts: fusion models based on either the traditional or deep learning methods. Green highlights the highest time cost in the current column, while red indicates the lowest. On one hand, we can observe that the well-performing algorithm LatLRR [43], based on the traditional method, is much more time-consuming than the others. On the other hand, except for CNN [52] used as a reference, the distinction between the models based on deep learning is not obvious. Meanwhile, the inference efficiency of our model is moderate, whereas the IFCNN [42] model, which performs poorly in terms of the visual effect or the detection task, shows slight advantages in inference efficiency.

Conclusions
In recent years, the application of deep-learning-based techniques in landmine detection has progressed at a rapid pace. This paper investigates recent advances and suggests some opportunities for future research in this field. Specifically, we propose a UAV-based solution for the detection of scatterable landmines. We establish a dataset of scatterable landmines, including RGB and NIR image pairs collected by a multispectral sensor, with complex scenes covered. In addition, to fully utilize the multispectral potential in detection, we employ a training framework for image fusion, which is task-oriented and allows high-level semantic information to flow back into the fusion network upstream, so as to boost the quality of the fusion image as well as enhance the detection performance purposefully. We have carried out extensive comparative experiments to verify the effectiveness of the fusion and detection methods that we propose. Our attempt at multispectral utilization for landmine detection paves the way for future advancement.
Overall, the obscuration and burial of landmines remain the most challenging problems that we confront. We have focused on the concealment and occlusion of landmines by vegetation when constructing the dataset. However, the simulated settings are still far from the natural minefield environment. The aging, corrosion, and burial of mines in real situations are time-consuming and complex to reproduce. Furthermore, the range of host conditions, such as snow, gravel pavement, and so on, needs to be greatly enlarged. In the future, we will supplement the dataset with more hard samples (primarily obscured by weeds, fallen leaves, sand, snow, and gravel or buried under shallow soils). We expect that the proposed dataset will encourage the development of better landmine detection methods, and our efforts will be of assistance and inspiration to demining organizations.