remote

: The breaching of tailings pond dams may lead to casualties and environmental pollution; therefore, timely and accurate monitoring is an essential aspect of managing such structures and preventing accidents. Remote sensing technology is suitable for the regular extraction and monitoring of tailings pond information. However, traditional remote sensing is inefﬁcient and unsuitable for the frequent extraction of large volumes of highly precise information. Object detection, based on deep learning, provides a solution to this problem. Most remote sensing imagery applications for tailings pond object detection using deep learning are based on computer vision, utilizing the true-color triple-band data of high spatial resolution imagery for information extraction. The advantage of remote sensing image data is their greater number of spectral bands (more than three), providing more abundant spectral information. There is a lack of research on fully harnessing multispectral band information to improve the detection precision of tailings ponds. Accordingly, using a sample dataset of tailings pond satellite images from the Gaofen-1 high-resolution Earth observation satellite, we improved the Faster R-CNN deep learning object detection model by increasing the inputs from three true-color bands to four multispectral bands. Moreover, we used the attention mechanism to recalibrate the input contributions. Subsequently, we used a step-by-step transfer learning method to improve and gradually train our model. The improved model could fully utilize the near-infrared (NIR) band information of the images to improve the precision of tailings pond detection. Compared with that of the three true-color band input models, the tailings pond detection average precision (AP) and recall notably improved in our model, with the AP increasing from 82.3% to 85.9% and recall increasing from 65.4% to 71.9%. This research could serve as a reference for using multispectral band information from remote sensing images in the construction and application of deep learning models.


Introduction
Tailings ponds house tailings after mining and beneficiation. The term usually refers to a place to store metal and non-metal tailings or other industrial waste after ore separation, with a dam enclosing such a site constructed across a valley mouth or on flat terrain [1]. Because of the complex composition of tailings ponds, their constituent tailings and tailing water usually contain harmful elements. Consequently, leakages or dam breaks have grave consequences for downstream residents and the environment [2]. In recent years, frequent environmental disasters have been caused by tailings pond failures, resulting in numerous precision and effectiveness for tailings pond targets of high-resolution remote sensing images.
To summarize, most remote sensing imagery applications for tailings pond object detection using deep learning are based on computer vision, utilizing the true-color tripleband data of high spatial resolution imagery for information extraction. However, the advantage of remote sensing image data is their greater number of spectral bands (more than three), providing more abundant spectral information. In a previous study [23], we proposed an improved Faster R-CNN model based on the three true-color bands in high-resolution remote sensing data to improve the detection precision of tailings ponds. Therefore, in this study, we proposed an improved method based on Faster R-CNN and transfer learning to detect the tailings ponds from the Gaofen-1 (GF-1) satellite images, which have four multispectral bands. The experimental results showed that the proposed method could utilize the near-infrared (NIR) band in GF-1 images and exploit the rich spectral information of GF-1 images to significantly improve the precision of tailings pond detection.

Data and Preprocessing
The GF-1 satellite, which launched on 26 April 2013, is China's first high-resolution Earth observation satellite. The satellite is equipped with two 2 m panchromatic/8 m multispectral cameras and four 16 m multispectral cameras. In this study, we used the data from the 2 m panchromatic/8 m multispectral cameras to study tailings pond object detection. The specific index parameters are shown in Table 1 [17]. The acquired data derive from the L1A processing level. The data required preprocessing, including radiometric calibration, orthorectification, and image fusion. First, we performed radiometric calibration on the original data. Subsequently, to eliminate image geometric distortion and improve geometric accuracy, based on the rational polynomial coefficients file of the image and the digital elevation model data of the corresponding area, we performed orthorectification. Finally, we performed image fusion processing on the Pan band data and the blue, green, red, and NIR multispectral band data to generate multispectral image data with a spatial resolution of 2 m and containing four bands. These data were used as the tailings pond sample dataset.

Sampling Data Generation
We selected Hebei Province in China as the research area because of its substantial number of tailings ponds. Based on the GF-1 image data for this area, a total of 963 tailings pond samples were manually identified. The geographical distribution of the selected tailings pond samples is shown in Figure 1. The shape of a tailings pond depends on the natural landscape as well as artificial and engineering features [24]. Tailings ponds could be divided into four types based on a range of factors, such as their topography and geomorphology, the resource being mined, the mining technology employed, and the scale of operations-namely cross-valley type, hillside type, stockpile type, and cross-river type [17]. Cross-valley type tailings ponds refer to those formed by building a dam at the mouth of a valley. The main characteristics are that the initial dam is relatively short, and the reservoir area is long and deep. Hillsidetype tailings ponds refer to those surrounded by a dam body built at the foot of a hill. The main characteristics of these tailings ponds are that the initial dam is relatively long, and the depth of the reservoir area is short. Stockpile type tailings ponds are formed by building a dam around materials on a gently sloping area. Such tailings ponds require significant work to create the initial dam as well as to subsequently fill the dam, and these dams are generally not very high. Cross-river type tailings ponds are formed by damming the upper and lower reaches of rivers. Their primary feature is a large upstream catchment area and a complex tailings pond and upstream drainage system [23]. Cross-river tailings ponds are rarely found in Hebei Province, and the tailings pond sample in this study does not include any of this type, i.e., it only includes cross-valley type, hillside type, and stockpile type tailings ponds. Annotated GF-1 true-color fused images showing the features of the various types of tailings ponds are shown in Figure 2. The shape of a tailings pond depends on the natural landscape as well as artificial and engineering features [24]. Tailings ponds could be divided into four types based on a range of factors, such as their topography and geomorphology, the resource being mined, the mining technology employed, and the scale of operations-namely cross-valley type, hillside type, stockpile type, and cross-river type [17]. Cross-valley type tailings ponds refer to those formed by building a dam at the mouth of a valley. The main characteristics are that the initial dam is relatively short, and the reservoir area is long and deep. Hillside-type tailings ponds refer to those surrounded by a dam body built at the foot of a hill. The main characteristics of these tailings ponds are that the initial dam is relatively long, and the depth of the reservoir area is short. Stockpile type tailings ponds are formed by building a dam around materials on a gently sloping area. Such tailings ponds require significant work to create the initial dam as well as to subsequently fill the dam, and these dams are generally not very high. Cross-river type tailings ponds are formed by damming the upper and lower reaches of rivers. Their primary feature is a large upstream catchment area and a complex tailings pond and upstream drainage system [23]. Cross-river tailings ponds are rarely found in Hebei Province, and the tailings pond sample in this study does not include any of this type, i.e., it only includes cross-valley type, hillside type, and stockpile type tailings ponds. Annotated GF-1 true-color fused images showing the features of the various types of tailings ponds are shown in Figure 2. This study identified a total of 963 tailings ponds, 80% of which were used as the training sample set, and 20% were used as the test sample set. Before imputing the GF-1 image data, slicing had to be performed. To maximally ensure the integrity of tailings ponds in the sample slices, and given the limitations of computing hardware, such as graphics processing unit memory, we set the sample slice size to 1024 × 1024 pixels during slicing. The degree of overlap between the slices was set to 128 pixels to expand the characteristics of tailings ponds and increase the number of sample slices. The GF-1 image data are 16 bit, and the data conversion to 8 bits was performed on the sample slices. After the above processing and the filtering of invalid slices, we obtained a tailings pond sample dataset of GF-1 images. The detailed information of the dataset is shown in Table 2.  This study identified a total of 963 tailings ponds, 80% of which were used as the training sample set, and 20% were used as the test sample set. Before imputing the GF-1 image data, slicing had to be performed. To maximally ensure the integrity of tailings ponds in the sample slices, and given the limitations of computing hardware, such as graphics processing unit memory, we set the sample slice size to 1024 × 1024 pixels during slicing. The degree of overlap between the slices was set to 128 pixels to expand the characteristics of tailings ponds and increase the number of sample slices. The GF-1 image data are 16 bit, and the data conversion to 8 bits was performed on the sample slices. After the above processing and the filtering of invalid slices, we obtained a tailings pond sample dataset of GF-1 images. The detailed information of the dataset is shown in Table 2. In the field of computer vision, Faster R-CNN is a classic object detection model based on deep learning. The model has high recognition accuracy and efficiency when applied to large target areas and has been widely used for object detection from remote sensing images [10,12,23]. In this study, we introduced an improved Faster R-CNN model to make full use of the multispectral band information of GF-1 images and improve the precision of tailings pond detection, based on the research results in [23]. The structure of our model is shown in Figure 3. In the field of computer vision, Faster R-CNN is a classic object detection model based on deep learning. The model has high recognition accuracy and efficiency when applied to large target areas and has been widely used for object detection from remote sensing images [10,12,23]. In this study, we introduced an improved Faster R-CNN model to make full use of the multispectral band information of GF-1 images and improve the precision of tailings pond detection, based on the research results in [23]. The structure of our model is shown in Figure 3. (1) The model inputs are the four spectral bands of GF-1 image data, namely blue, green, red, and NIR. After being passed through the proposed feature extraction network, the model outputs are multi-layer features (C1, C2, C3, C4, and C5). (2) The feature pyramid network (FPN) fuses shallow features and deep features using the semantic information of deep features and the location information of shallow features to further improve the performance of the network [25].  (1) The model inputs are the four spectral bands of GF-1 image data, namely blue, green, red, and NIR. After being passed through the proposed feature extraction network, the model outputs are multi-layer features (C1, C2, C3, C4, and C5). (2) The feature pyramid network (FPN) fuses shallow features and deep features using the semantic information of deep features and the location information of shallow features to further improve the performance of the network [25]. In the field of computer vision, Faster R-CNN is a classic object detection model based on deep learning. The model has high recognition accuracy and efficiency when applied to large target areas and has been widely used for object detection from remote sensing images [10,12,23]. In this study, we introduced an improved Faster R-CNN model to make full use of the multispectral band information of GF-1 images and improve the precision of tailings pond detection, based on the research results in [23]. The structure of our model is shown in Figure 3. (1) The model inputs are the four spectral bands of GF-1 image data, namely blue, green, red, and NIR. After being passed through the proposed feature extraction network, the model outputs are multi-layer features (C1, C2, C3, C4, and C5). (2) The feature pyramid network (FPN) fuses shallow features and deep features using the semantic information of deep features and the location information of shallow features to further improve the performance of the network [25]. The multi-layer features C2, C3, C4, and C5 are the FPN inputs for feature merging, for which the number of channels are 256, 512, 1024, and 2048, respectively. First, through the 1 × 1 convolution operation (Conv1 × 1), C2, C3, C4, and C5 were subjected to dimensionality reduction, with the corresponding outputs CC2, CC3, CC4, and CC5, and the number of channels for each output was set to 256. Using the nearest neighbor difference method, CC5, CC4, and CC3 were up-sampled twice (2× up), and element-wise addition ( ) was performed on CC4, CC3, and CC2 to merge the features of different layers, with the corresponding outputs F2, F3, and F4 (256 channels where k is the feature map layer corresponding to the region proposal, which is ) was performed on CC4, CC3, and CC2 to merge the features of different layers, with the corresponding outputs F2, F3, and F4 (256 channels). A 3 × 3 convolution (Conv3 × 3) was conducted on F2, F3, and F4, generating P2, P3, and P4 (256 channels), and Conv3x3 was conducted on CC5, generating P5 (256 channel output). The maximum pooling was conducted on P5 of 1 × 1 with a stride of 2 (Max_pool 1 × 1 s = 2) and output P6 (256 channels). Features merging was completed with FPN, with the final set of feature maps being called {P2, P3, P4, P5, and P6}.  (3) The multi-scale feature maps {P2, P3, P4, P5, and P6} were sent to the region proposal network (RPN), with the anchor areas set to {32 2 , 64 2 , 128 2 , 256 2 , and 512 2 } pixels, and the aspect ratios of the anchors set to {1:2, 1:1, and 2:1} to generate region proposals. (4) The region proposals needed to slice the region proposal feature maps from {P2, P3, P4, and P5}, and the following formula were used to select the most appropriate scale: where k is the feature map layer corresponding to the region proposal, which is rounded during the calculation; k 0 is the highest layer of the feature maps and, as there are four layers of feature maps in this study, we set k 0 to 4; w and h represent the width and height of the region proposal, respectively; and H is the height and width of the model input. After the proposal, the feature maps were subjected to ROI pooling, and were sent to the subsequent fully connected (FC) layer to determine the object category and obtain the precise position of the bounding box. Based on the research results of [23], we improved the two following aspects: (1) The feature extraction network was improved. The number of input channels was increased from three to four, which is the number of bands in GF-1 images. Unlike most studies that use the attention mechanism to recalibrate the contribution of the extracted feature channels, we used the attention mechanism to recalibrate the contribution of the original four bands. Details of our methodology are presented in the "Proposed feature extraction network" section below. (2) A step-by-step transfer learning method was adopted to gradually improve and train the model, of which details are presented in the "transfer learning" section below.

Proposed Feature Extraction Network
In recent years, the attention mechanism has been widely used in the field of deep learning to improve performance, and it has become an important concept in neural network models [26]. Essentially, the attention mechanism mimics the human brain by devoting more attention resources to the object area (i.e., the focus) to obtain information that is more detailed, while suppressing information from non-important areas. Mnih et al. [27] used the attention mechanism in a recurrent neural network model to improve the performance of the model image classification. Bahdanau et al. [28] used the attention mechanism in the field of natural language processing on a machine translation task to simultaneously translate and align. In the feature layer of the convolutional neural network, each channel represents a different feature, and these features differ in importance and in their contribution to the network performance. In object detection based on deep learning for remote sensing images, Li et al. [29] used the attention mechanism and Mask R-CNN, a convolutional neural network, to design an improved top-down FPN, which improved the detection precision of small objects with complex backgrounds for remote sensing images. Based on the YOLO network model [30], Hu et al. [31] proposed a more advanced small marine vessel object detection method using the attention mechanism of spatial and channel information. Squeeze-and-excitation networks [32] are based on the principle of the attention mechanism, automatically obtaining the importance of each feature channel. Based on this importance, features with a large contribution are promoted and features with a small contribution are suppressed, thereby improving the performance of image classification.
Rather than recalibrating the contribution of the extracted feature channels, our study designed an attention mechanism module for GF-1 image slices and recalibrated the contribution of the four bands of the input. With the addition of a few parameters and calculations, we significantly improved the detection precision for tailings ponds. The structure of the module is shown in Figure 4.  Image slice refers to the multi-band (blue, green, red, and NIR bands) data of GF-1 images. The slice size is 1024 × 1024 pixels, and the tensor size is 1024 × 1024 × 4. After applying global average pooling (GAP), the input tensor was compressed into a 1 × 1 × 4 one-dimensional vector. Each value in the vector has a global receptive field, representing the global distribution of responses on corresponding input bands. An FC layer was used for compression into a 1 × 1 × 2 one-dimensional vector. After applying the Relu activation function, the second FC layer restored the number of channels to four, producing a 1 × 1 × 4 one-dimensional vector. The sigmoid function was used subsequently to obtain normalized weights, resulting in a 1 × 1 × 4 one-dimensional vector, with each value in the vector representing the contribution of the corresponding input bands. Finally, using the scale, once the contribution weight of each band was extended to a dimension equal to its corresponding band, it was multiplied by the input band to obtain the features, which were used as the input features in the subsequent feature extraction network.
The proposed feature extraction network in this study comprised the slice attention mechanism block (SAMB) and five convolution blocks of ResNet-101 [33]-the structure is shown in Figure 5. The four multispectral bands of the image slice served as the input of the model. After the contribution of each band was recalibrated using the SAMB, the output of the SAMB was sent to the subsequent network, where Conv1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x represented the five convolution blocks of ResNet-101. In the network, the number of channels of the Conv1 convolution kernel was expanded from three to four, the size and number of feature output channels remained unchanged, and the structure of the remaining convolution blocks remained unchanged. Each convolution block corresponds to the output features at different layers (C1, C2, C3, C4, and C5). In the case of C1, 64 × 256 × 256 means that the number of channels is 64 and the size is 256 × 256 pixels (the channels and sizes of the other layers are determined in a similar fashion). Layers C2, C3, C4, and C5 were used as input for subsequent FPN.  Image slice refers to the multi-band (blue, green, red, and NIR bands) data of GF-1 images. The slice size is 1024 × 1024 pixels, and the tensor size is 1024 × 1024 × 4. After applying global average pooling (GAP), the input tensor was compressed into a 1 × 1 × 4 one-dimensional vector. Each value in the vector has a global receptive field, representing the global distribution of responses on corresponding input bands. An FC layer was used for compression into a 1 × 1 × 2 one-dimensional vector. After applying the Relu activation function, the second FC layer restored the number of channels to four, producing a 1 × 1 × 4 one-dimensional vector. The sigmoid function was used subsequently to obtain normalized weights, resulting in a 1 × 1 × 4 one-dimensional vector, with each value in the vector representing the contribution of the corresponding input bands. Finally, using the scale, once the contribution weight of each band was extended to a dimension equal to its corresponding band, it was multiplied by the input band to obtain the features, which were used as the input features in the subsequent feature extraction network.
The proposed feature extraction network in this study comprised the slice attention mechanism block (SAMB) and five convolution blocks of ResNet-101 [33]-the structure is shown in Figure 5. The four multispectral bands of the image slice served as the input of the model. After the contribution of each band was recalibrated using the SAMB, the output of the SAMB was sent to the subsequent network, where Conv1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x represented the five convolution blocks of ResNet-101. In the network, the number of channels of the Conv1 convolution kernel was expanded from three to four, the size and number of feature output channels remained unchanged, and the structure of the remaining convolution blocks remained unchanged. Each convolution block corresponds to the output features at different layers (C1, C2, C3, C4, and C5). In the case of C1, 64 × 256 × 256 means that the number of channels is 64 and the size is 256 × 256 pixels (the channels and sizes of the other layers are determined in a similar fashion). Layers C2, C3, C4, and C5 were used as input for subsequent FPN. Image slice refers to the multi-band (blue, green, red, and NIR bands) data of GF-1 images. The slice size is 1024 × 1024 pixels, and the tensor size is 1024 × 1024 × 4. After applying global average pooling (GAP), the input tensor was compressed into a 1 × 1 × 4 one-dimensional vector. Each value in the vector has a global receptive field, representing the global distribution of responses on corresponding input bands. An FC layer was used for compression into a 1 × 1 × 2 one-dimensional vector. After applying the Relu activation function, the second FC layer restored the number of channels to four, producing a 1 × 1 × 4 one-dimensional vector. The sigmoid function was used subsequently to obtain normalized weights, resulting in a 1 × 1 × 4 one-dimensional vector, with each value in the vector representing the contribution of the corresponding input bands. Finally, using the scale, once the contribution weight of each band was extended to a dimension equal to its corresponding band, it was multiplied by the input band to obtain the features, which were used as the input features in the subsequent feature extraction network.
The proposed feature extraction network in this study comprised the slice attention mechanism block (SAMB) and five convolution blocks of ResNet-101 [33]-the structure is shown in Figure 5. The four multispectral bands of the image slice served as the input of the model. After the contribution of each band was recalibrated using the SAMB, the output of the SAMB was sent to the subsequent network, where Conv1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x represented the five convolution blocks of ResNet-101. In the network, the number of channels of the Conv1 convolution kernel was expanded from three to four, the size and number of feature output channels remained unchanged, and the structure of the remaining convolution blocks remained unchanged. Each convolution block corresponds to the output features at different layers (C1, C2, C3, C4, and C5). In the case of C1, 64 × 256 × 256 means that the number of channels is 64 and the size is 256 × 256 pixels (the channels and sizes of the other layers are determined in a similar fashion). Layers C2, C3, C4, and C5 were used as input for subsequent FPN.

Transfer Learning
Insufficient training samples and computing power are commonly encountered problems during machine learning tasks. In recent years, transfer learning has emerged as an important technique for overcoming these issues. The essence of transfer learning is the transfer and reuse of knowledge [13]. Transfer learning consists of two elements, namely domains and tasks [34]. Domains are the main body of learning and contain two elements, which are the sample feature space e Sens. 2022, 14, x FOR PEER REVIEW 9 of 21 as an important technique for overcoming these issues. The essence of transfer learning is the transfer and reuse of knowledge [13]. Transfer learning consists of two elements, namely domains and tasks [34]. Domains are the main body of learning and contain two elements, which are the sample feature space Ӽ and the probability distribution P(X), where X = (x1, x2, …, xn) ∈ Ӽ. Given a specific domain, D = {Ӽ, P(X)}, a task T consists of two parts, namely a label space Y = (y1, y2, …, yn) and an objective predictive function f. The task consists of {xi, yi}, where xi ∈ X and yi ∈ Y and the target predictive function f are used to predict the corresponding label f(x) of a new sample x. From the perspective of probability, f(x) could be considered the conditional probability P(y|x), and task T could be expressed as T = {Y, p(y|x)}. Given a source domain Ds and learning task Ts, a target domain Dt and learning task Tt, where Ds ≠ Dt or Ts ≠ Tt, transfer learning aims to use the knowledge of Ds and Ts to improve the learning of the predictive function fT in the target domain Dt [35]. Transfer learning methods could be divided into four types, namely instance transfer, feature representation transfer, parameter transfer, and relational knowledge transfer [36]. Instance transfer assumes that parts of the data in the source domain could be reused in the target domain by reweighting. Feature representation transfer uses a good feature representation to reduce the difference between the source domain and the target domain and model errors. Relational knowledge transfer involves the mapping of relevant knowledge between the source domain and the target domain. The parameter transfer approach refers to the sharing of model parameters and prior knowledge between the source domain and the target domain, which is a quite commonly used transfer learning method in the field of deep learning.
Yosinski et al. [36] investigated the transferability of features in deep neural networks. The results of their study showed that using the fine-tuning transfer learning method with a trained deep neural network and parameters, fine-tuning parameters in a new task could better overcome the differences in data, improving the training efficiency and performance of the model.
The limited number of GF-1 tailings pond samples could lead to overfitting and as we intended, using multispectral band information to improve the tailings pond detection precision, we adopted a step-by-step transfer learning method to gradually improve and train the model. In the transfer learning method of our initial research, an ImageNet dataset was used as the source domain and the GF-1 four-band sample dataset was used as the target domain. The five convolution blocks (Conv1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x) of the source domain pre-trained ResNet-101 were used as the feature extraction network of the target model. The number of band inputs was improved from three to four. The number of channels of the convolution kernel Conv1 improved from three to four, whereas the size and number of feature output channels remained unchanged. The first three channels of the convolution kernel Conv1 were initialized with pre-trained parameters, and the new fourth channel was initialized with the pre-trained third channel parameters. The rest of the feature extraction network was initialized with pre-trained parameters. The target model was trained and tested based on the target domain. The target model was named Bands_4. Our results showed that although the Bands_4 model included the additional NIR band information, the tailings pond detection precision of the model did not significantly improve.
To address this failure, we adopted a step-by-step transfer learning method to gradually improve and train the model. Only the feature extraction network was improved, whereas the other parts of the model remained unchanged. The process was as follows: (1) An ImageNet dataset was used as the source domain, and a GF-1 true-color sample dataset (only the three true-color bands of the GF-1 sample data) was used as the target domain. The five convolution blocks (Conv1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x) of the source domain pre-training ResNet-101 were used as the feature extraction network of the target model, and initialization was conducted with pre-trained parameters. The and the probability distribution P(X), where X = (x1, x2, . . . , xn) ∈ R REVIEW 9 of 21 as an important technique for overcoming these issues. The essence of transfer learning is the transfer and reuse of knowledge [13]. Transfer learning consists of two elements, namely domains and tasks [34]. Domains are the main body of learning and contain two elements, which are the sample feature space Ӽ and the probability distribution P(X), where X = (x1, x2, …, xn) ∈ Ӽ. Given a specific domain, D = {Ӽ, P(X)}, a task T consists of two parts, namely a label space Y = (y1, y2, …, yn) and an objective predictive function f. The task consists of {xi, yi}, where xi ∈ X and yi ∈ Y and the target predictive function f are used to predict the corresponding label f(x) of a new sample x. From the perspective of probability, f(x) could be considered the conditional probability P(y|x), and task T could be expressed as T = {Y, p(y|x)}. Given a source domain Ds and learning task Ts, a target domain Dt and learning task Tt, where Ds ≠ Dt or Ts ≠ Tt, transfer learning aims to use the knowledge of Ds and Ts to improve the learning of the predictive function fT in the target domain Dt [35]. Transfer learning methods could be divided into four types, namely instance transfer, feature representation transfer, parameter transfer, and relational knowledge transfer [36]. Instance transfer assumes that parts of the data in the source domain could be reused in the target domain by reweighting. Feature representation transfer uses a good feature representation to reduce the difference between the source domain and the target domain and model errors. Relational knowledge transfer involves the mapping of relevant knowledge between the source domain and the target domain. The parameter transfer approach refers to the sharing of model parameters and prior knowledge between the source domain and the target domain, which is a quite commonly used transfer learning method in the field of deep learning.
Yosinski et al. [36] investigated the transferability of features in deep neural networks. The results of their study showed that using the fine-tuning transfer learning method with a trained deep neural network and parameters, fine-tuning parameters in a new task could better overcome the differences in data, improving the training efficiency and performance of the model.
The limited number of GF-1 tailings pond samples could lead to overfitting and as we intended, using multispectral band information to improve the tailings pond detection precision, we adopted a step-by-step transfer learning method to gradually improve and train the model. In the transfer learning method of our initial research, an ImageNet dataset was used as the source domain and the GF-1 four-band sample dataset was used as the target domain. The five convolution blocks (Conv1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x) of the source domain pre-trained ResNet-101 were used as the feature extraction network of the target model. The number of band inputs was improved from three to four. The number of channels of the convolution kernel Conv1 improved from three to four, whereas the size and number of feature output channels remained unchanged. The first three channels of the convolution kernel Conv1 were initialized with pre-trained parameters, and the new fourth channel was initialized with the pre-trained third channel parameters. The rest of the feature extraction network was initialized with pre-trained parameters. The target model was trained and tested based on the target domain. The target model was named Bands_4. Our results showed that although the Bands_4 model included the additional NIR band information, the tailings pond detection precision of the model did not significantly improve.
To address this failure, we adopted a step-by-step transfer learning method to gradually improve and train the model. Only the feature extraction network was improved, whereas the other parts of the model remained unchanged. The process was as follows: (1) An ImageNet dataset was used as the source domain, and a GF-1 true-color sample dataset (only the three true-color bands of the GF-1 sample data) was used as the target domain. The five convolution blocks (Conv1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x) of the source domain pre-training ResNet-101 were used as the feature extraction network of the target model, and initialization was conducted with pre-trained parameters. The as an important technique for overcoming these issues. The essence of transfer learning is the transfer and reuse of knowledge [13]. Transfer learning consists of two elements, namely domains and tasks [34]. Domains are the main body of learning and contain two elements, which are the sample feature space Ӽ and the probability distribution P(X), where X = (x1, x2, …, xn) ∈ Ӽ. Given a specific domain, D = {Ӽ, P(X)}, a task T consists of two parts, namely a label space Y = (y1, y2, …, yn) and an objective predictive function f. The task consists of {xi, yi}, where xi ∈ X and yi ∈ Y and the target predictive function f are used to predict the corresponding label f(x) of a new sample x. From the perspective of probability, f(x) could be considered the conditional probability P(y|x), and task T could be expressed as T = {Y, p(y|x)}. Given a source domain Ds and learning task Ts, a target domain Dt and learning task Tt, where Ds ≠ Dt or Ts ≠ Tt, transfer learning aims to use the knowledge of Ds and Ts to improve the learning of the predictive function fT in the target domain Dt [35]. Transfer learning methods could be divided into four types, namely instance transfer, feature representation transfer, parameter transfer, and relational knowledge transfer [36]. Instance transfer assumes that parts of the data in the source domain could be reused in the target domain by reweighting. Feature representation transfer uses a good feature representation to reduce the difference between the source domain and the target domain and model errors. Relational knowledge transfer involves the mapping of relevant knowledge between the source domain and the target domain. The parameter transfer approach refers to the sharing of model parameters and prior knowledge between the source domain and the target domain, which is a quite commonly used transfer learning method in the field of deep learning.
Yosinski et al. [36] investigated the transferability of features in deep neural networks. The results of their study showed that using the fine-tuning transfer learning method with a trained deep neural network and parameters, fine-tuning parameters in a new task could better overcome the differences in data, improving the training efficiency and performance of the model.
The limited number of GF-1 tailings pond samples could lead to overfitting and as we intended, using multispectral band information to improve the tailings pond detection precision, we adopted a step-by-step transfer learning method to gradually improve and train the model. In the transfer learning method of our initial research, an ImageNet dataset was used as the source domain and the GF-1 four-band sample dataset was used as the target domain. The five convolution blocks (Conv1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x) of the source domain pre-trained ResNet-101 were used as the feature extraction network of the target model. The number of band inputs was improved from three to four. The number of channels of the convolution kernel Conv1 improved from three to four, whereas the size and number of feature output channels remained unchanged. The first three channels of the convolution kernel Conv1 were initialized with pre-trained parameters, and the new fourth channel was initialized with the pre-trained third channel parameters. The rest of the feature extraction network was initialized with pre-trained parameters. The target model was trained and tested based on the target domain. The target model was named Bands_4. Our results showed that although the Bands_4 model included the additional NIR band information, the tailings pond detection precision of the model did not significantly improve.
To address this failure, we adopted a step-by-step transfer learning method to gradually improve and train the model. Only the feature extraction network was improved, whereas the other parts of the model remained unchanged. The process was as follows: (1) An ImageNet dataset was used as the source domain, and a GF-1 true-color sample dataset (only the three true-color bands of the GF-1 sample data) was used as the target domain. The five convolution blocks (Conv1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x) of the source domain pre-training ResNet-101 were used as the feature extraction network of the target model, and initialization was conducted with pre-trained parameters. The , P(X)}, a task T consists of two parts, namely a label space Y = (y1, y2, . . . , yn) and an objective predictive function f. The task consists of {xi, yi}, where xi ∈ X and yi ∈ Y and the target predictive function f are used to predict the corresponding label f(x) of a new sample x. From the perspective of probability, f(x) could be considered the conditional probability P(y|x), and task T could be expressed as T = {Y, p(y|x)}. Given a source domain Ds and learning task Ts, a target domain Dt and learning task Tt, where Ds = Dt or Ts = Tt, transfer learning aims to use the knowledge of Ds and Ts to improve the learning of the predictive function fT in the target domain Dt [35].
Transfer learning methods could be divided into four types, namely instance transfer, feature representation transfer, parameter transfer, and relational knowledge transfer [36]. Instance transfer assumes that parts of the data in the source domain could be reused in the target domain by reweighting. Feature representation transfer uses a good feature representation to reduce the difference between the source domain and the target domain and model errors. Relational knowledge transfer involves the mapping of relevant knowledge between the source domain and the target domain. The parameter transfer approach refers to the sharing of model parameters and prior knowledge between the source domain and the target domain, which is a quite commonly used transfer learning method in the field of deep learning.
Yosinski et al. [36] investigated the transferability of features in deep neural networks. The results of their study showed that using the fine-tuning transfer learning method with a trained deep neural network and parameters, fine-tuning parameters in a new task could better overcome the differences in data, improving the training efficiency and performance of the model.
The limited number of GF-1 tailings pond samples could lead to overfitting and as we intended, using multispectral band information to improve the tailings pond detection precision, we adopted a step-by-step transfer learning method to gradually improve and train the model. In the transfer learning method of our initial research, an ImageNet dataset was used as the source domain and the GF-1 four-band sample dataset was used as the target domain. The five convolution blocks (Conv1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x) of the source domain pre-trained ResNet-101 were used as the feature extraction network of the target model. The number of band inputs was improved from three to four. The number of channels of the convolution kernel Conv1 improved from three to four, whereas the size and number of feature output channels remained unchanged. The first three channels of the convolution kernel Conv1 were initialized with pre-trained parameters, and the new fourth channel was initialized with the pre-trained third channel parameters. The rest of the feature extraction network was initialized with pre-trained parameters. The target model was trained and tested based on the target domain. The target model was named Bands_4. Our results showed that although the Bands_4 model included the additional NIR band information, the tailings pond detection precision of the model did not significantly improve.
To address this failure, we adopted a step-by-step transfer learning method to gradually improve and train the model. Only the feature extraction network was improved, whereas the other parts of the model remained unchanged. The process was as follows: (1) An ImageNet dataset was used as the source domain, and a GF-1 true-color sample dataset (only the three true-color bands of the GF-1 sample data) was used as the target domain. The five convolution blocks (Conv1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x) of the source domain pre-training ResNet-101 were used as the feature extraction network of the target model, and initialization was conducted with pre-trained parameters. The training and testing of the target model were completed based on the target domain. The target model in this step was named Bands_3.
(2) The GF-1 true-color sample dataset was used as the source domain, and the GF-1 four-band sample dataset was used as the target domain. The number of input bands of the feature extraction network of the source domain pre-trained model Bands_3 was improved by increasing the bands from three to four. The number of channels of the convolution kernel Conv1 was also improved from three to four, but the size and number of feature output channels was unchanged, and served as the feature extraction network of the target model. In the feature extraction network of the target model, the newly added fourth channel of Conv1 was initialized with pre-trained third channel parameters, and the rest was initialized with corresponding pre-trained parameters. We compared the use of pretrained third channel parameters, second channel parameters, first channel parameters, and the average of the three channel parameters to initialize the newly added fourth channel of Conv1, that using the third channel parameters has the highest precision. The training and testing of the target model were completed based on the target domain. The target model in this step was named Bands_4_sub.
(3) The GF-1 four-band sample dataset was used as both the source domain and the target domain. We transferred the feature extraction network and parameters of the pretrained Bands4_sub model and added a SAMB module to improve it, which served as the feature extraction network of the target model. The training and testing of the target model were completed based on the target domain. The target model in this step was named Bands4_sub_SAMB.
During our research, Kaiming normal initialization was used for model parameters (include SAMB), except for the feature extraction network, for which we used pre-trained model parameters. The steps of the transfer learning process are shown in Figure 6.
training and testing of the target model were completed based on the target domain. The target model in this step was named Bands_3.
(2) The GF-1 true-color sample dataset was used as the source domain, and the GF-1 four-band sample dataset was used as the target domain. The number of input bands of the feature extraction network of the source domain pre-trained model Bands_3 was improved by increasing the bands from three to four. The number of channels of the convolution kernel Conv1 was also improved from three to four, but the size and number of feature output channels was unchanged, and served as the feature extraction network of the target model. In the feature extraction network of the target model, the newly added fourth channel of Conv1 was initialized with pre-trained third channel parameters, and the rest was initialized with corresponding pre-trained parameters. We compared the use of pre-trained third channel parameters, second channel parameters, first channel parameters, and the average of the three channel parameters to initialize the newly added fourth channel of Conv1, that using the third channel parameters has the highest precision. The training and testing of the target model were completed based on the target domain. The target model in this step was named Bands_4_sub.
(3) The GF-1 four-band sample dataset was used as both the source domain and the target domain. We transferred the feature extraction network and parameters of the pretrained Bands4_sub model and added a SAMB module to improve it, which served as the feature extraction network of the target model. The training and testing of the target model were completed based on the target domain. The target model in this step was named Bands4_sub_SAMB.
During our research, Kaiming normal initialization was used for model parameters (include SAMB), except for the feature extraction network, for which we used pre-trained model parameters. The steps of the transfer learning process are shown in Figure 6. Step-by-step transfer learning.

Accuracy Assessment
When evaluating the target detection results, the ground truth bounding box (GT) is the true bounding box of the predicted target, whereas the predicted bounding box (PT) is the predicted bounding box of the predicted target. The area encompassed by both the predicted bounding box and the ground truth is denoted as the area of union, the intersection is denoted as the area of overlap, and the calculation formula of the intersection over union (IOU) is as follows:

IOU
Area of Overlap Area of Union  Step-by-step transfer learning.

Accuracy Assessment
When evaluating the target detection results, the ground truth bounding box (GT) is the true bounding box of the predicted target, whereas the predicted bounding box (PT) is the predicted bounding box of the predicted target. The area encompassed by both the predicted bounding box and the ground truth is denoted as the area of union, the intersection is denoted as the area of overlap, and the calculation formula of the intersection over union (IOU) is as follows: where TP (true positive) refers to the number of detection boxes with correct detection results and an IOU > 0.5; false positive (FP) refers to the number of detection boxes with incorrect detection results and an IOU ≤ 0.5; and false negative (FN) refers to the number of GTs that are not detected. The model evaluation indicators used in this study were precision and recall. Precision refers to the ratio of the number of correct detection boxes to the total number of detection boxes, whereas recall refers to the ratio of the number of correct detection boxes to the total number of true bounding boxes. Their corresponding calculation formulas are as follows: The average precision (AP) of the target, precision-recall curve (PRC), and mean average precision (mAP) are three common indicators widely applied to evaluate the performance of object detection methods. AP is typically the area under the PRC and mAP is the average value of AP values for all classes; the larger the mAP value, the better the object detection performance. As this study only detects one target, namely a tailings pond, AP was used as the main model evaluation indicator, with the recall and time consumption of a single iteration used as reference indicators [23].

Loss Function
In this study, we used the loss function of Faster R-CNN, and the formula for the calculation can be expressed as follows [11]: where N cls represents the number of anchors in the mini batch, N reg represents the number of anchor locations, and α represents the weight balance parameter, which was set to 10 in this study, and i represents the index of an anchor in a mini batch. Furthermore, p i represents the predictive classification probability of the anchor. Specifically, when the anchor was positive, p * i = 1, and when it was negative, p * i = 0. Moreover, anchors that met the following two conditions were considered positive: (1) the anchor has the highest intersection-over-union (IOU) overlap with a ground truth box; or (2) the IOU overlap of the anchor with the ground truth box is >0.7. Conversely, when the IOU overlap of the anchor with any ground-truth box was <0.3, the anchor was considered negative. Anchors that were neither positive nor negative were not included in the training: For the bounding box regression, we adopted the parameterization of four coordinates, defined as follows: where x and y represent the coordinates of the center of the bounding box, and w and h represent the width and height of the bounding box, respectively. Furthermore, x, x a , and x* correspond to the predicted box, anchor box, and ground truth box, respectively, similarly to y, w, and h.

Training Environment and Optimization
The network was trained using a 64-bit Ubuntu20.04LTs operating system and a NVIDIA GeForce GTX3080, using Xeon E5 CPU and CUDA version 11.1. The model trained 30 epochs of the training set. Stochastic gradient descent was used as the optimizer, the initial learning rate of the model was set to 0.02, momentum was set to 0.9, weight_decay was set to 0.0001, and the batch size was set to 2.

Results
Based on a sample dataset of tailings pond images from the GF-1 satellite, we improved the Faster R-CNN object detection model. We expanded the input from the three true-color bands to four multispectral bands and used the attention mechanism to recalibrate the contribution of the model input bands. Model improvement and training were gradually completed using a step-by-step transfer learning method.
The training loss curves for the different models shown in Figure 7 indicate that the curvilinear trends of Bands_3 and Bands_4 are remarkably similar, and the curvilinear trends of Bands4_sub and Bands4_sub_SAMB are quite similar, with excellent convergence in both cases. However, the loss values of the latter were significantly lower than those of the former.

Training Environment and Optimization
The network was trained using a 64-bit Ubuntu20.04LTs operating system and NVIDIA GeForce GTX3080, using Xeon E5 CPU and CUDA version 11.1. The mod trained 30 epochs of the training set. Stochastic gradient descent was used as the op mizer, the initial learning rate of the model was set to 0.02, momentum was set to 0 weight_decay was set to 0.0001, and the batch size was set to 2.

Results
Based on a sample dataset of tailings pond images from the GF-1 satellite, we im proved the Faster R-CNN object detection model. We expanded the input from the thr true-color bands to four multispectral bands and used the attention mechanism to reca brate the contribution of the model input bands. Model improvement and training we gradually completed using a step-by-step transfer learning method.
The training loss curves for the different models shown in Figure 7 indicate that t curvilinear trends of Bands_3 and Bands_4 are remarkably similar, and the curviline trends of Bands4_sub and Bands4_sub_SAMB are quite similar, with excellent conve gence in both cases. However, the loss values of the latter were significantly lower th those of the former. The test precision curves for the various models shown in Figure 8 indicate that t curvilinear trends of Bands_3 and Bands_4 are quite similar, with gradually increasin and then stabilizing test precision. However, comparing Bands_3 and Bands_4 after N was added to the model indicated that the tailings pond detection precision did not si nificantly improve. It is evident from the Bands4_sub test precision curve that the init test precision value was relatively high, and it rapidly increased further before stabilizin reflecting the fact that the transfer learning method significantly improved the mod training efficiency. Bands4_sub_SAMB used the attention mechanism to recalibrate t contribution of the original remote sensing image bands in the model. The initial test pr cision value was higher, and it quickly rose and reached a stable state. Compared wi that of the Bands4_sub model, a more notable improvement was shown in the precisio of tailings pond detection. The test precision curves for the various models shown in Figure 8 indicate that the curvilinear trends of Bands_3 and Bands_4 are quite similar, with gradually increasing and then stabilizing test precision. However, comparing Bands_3 and Bands_4 after NIR was added to the model indicated that the tailings pond detection precision did not significantly improve. It is evident from the Bands4_sub test precision curve that the initial test precision value was relatively high, and it rapidly increased further before stabilizing, reflecting the fact that the transfer learning method significantly improved the model training efficiency. Bands4_sub_SAMB used the attention mechanism to recalibrate the contribution of the original remote sensing image bands in the model. The initial test precision value was higher, and it quickly rose and reached a stable state. Compared with that of the Bands4_sub model, a more notable improvement was shown in the precision of tailings pond detection. Table 3 shows specific model evaluation values. The Bands_3 model has an AP of 82.3%, and the Bands_4 model has a value of 82.5%, indicating that the latter did not effectively utilize the NIR band information to improve detection precision. After using the step-by-step transfer learning method, compared with that of the Bands_3 model, the AP and recall of the Bands_4_sub model greatly improved, increasing from 82.3% to 84.9% (up by 2.6%) and from 65.4% to 70.5% (up by 5.1%), respectively. The Bands_4_sub_SAMB model added the SAMB module to the feature extraction network of the Bands_4_sub model and recalibrated the contribution of the four bands in the image slices to further improve performance. The AP and recall increased by 1.0% and 1.4%, reaching 85.9% and 71.5%, respectively.  Table 3 shows specific model evaluation values. The Bands_3 model has an AP 82.3%, and the Bands_4 model has a value of 82.5%, indicating that the latter did not fectively utilize the NIR band information to improve detection precision. After using t step-by-step transfer learning method, compared with that of the Bands_3 model, the A and recall of the Bands_4_sub model greatly improved, increasing from 82.3% to 84.9 (up by 2.6%) and from 65.4% to 70.5% (up by 5.1%), respectively. The Bands_4_sub_SAM model added the SAMB module to the feature extraction network of the Bands_4_s model and recalibrated the contribution of the four bands in the image slices to furth improve performance. The AP and recall increased by 1.0% and 1.4%, reaching 85.9% a 71.5%, respectively. Looking at specific predicted tailings pond targets in Figure 9, (a) is the predicti result of the Bands_3 model, (b) is the prediction result of the Bands_4 model, (c) is t predictions result of the Bands_4_sub model, and (d) is the prediction result of t Bands_4_sub_SAMB model. In the images (as well as in Figures 10 and 11), the red boun ing box represents the GT of the tailings pond and the green bounding box represents t PT of the tailings ponds. In (a), the model predicted two different bounding boxes for o tailings pond, showing poor accuracy. In (b), (c), and (d), the models predicted one boun ing box for the one tailings pond, and the accuracy of the PTs gradually improved.  Looking at specific predicted tailings pond targets in Figure 9,    Figure 10 shows that in (a) and (b), the tailings pond target was missed, but in (c) and (d), such missed detection was avoided. In addition, compared with the GT, the PT in (d) was significantly superior to that in (c). In Figure 12, (a)-(d) are feature heat maps of the corresponding images in Figure 10, with the improvement of the model, the characteristics of the tailings pond are gradually obvious and prominent, predictions are correspondingly better, which is consistent with the results in Figure 10. Figure 11 shows that in (a), the tailings pond target in the top left of the image slice had multiple prediction results with poor accuracy, with some mistakenly detecting the shadow of the mountain and snow as part of the tailings pond. The smaller tailings pond target in the top right of the image slice had only one prediction result, but its accuracy was poor. In (b), the prediction result of the top-left tailings pond improved, but the pre-  Figure 10 shows that in (a) and (b), the tailings pond target was missed, but in (c) and (d), such missed detection was avoided. In addition, compared with the GT, the PT in (d) was significantly superior to that in (c). In Figure 12, (a)-(d) are feature heat maps of the corresponding images in Figure 10, with the improvement of the model, the characteristics of the tailings pond are gradually obvious and prominent, predictions are correspondingly better, which is consistent with the results in Figure 10.
prediction bounding boxes appearing that showed poor accuracy. In (c) and (d), all the target prediction results significantly improved. In the comparison of (d) with (c), although the smaller top-right tailings pond had two prediction results in (d), one of the prediction results was more consistent with the GT; therefore, it showed greater accuracy.  Figure 13 shows that in (a), the red bounding box represents the GT of the small tailings pond, and (b) is the prediction result of all models. For some small tailings ponds, the improved method still cannot identify them. This may be because the tailings pond area is small, and the increase in NIR band information has a small contribution to the characteristic information. In view of this problem, we continued to improve it in the future study.  Figure 11 shows that in (a), the tailings pond target in the top left of the image slice had multiple prediction results with poor accuracy, with some mistakenly detecting the shadow of the mountain and snow as part of the tailings pond. The smaller tailings pond target in the top right of the image slice had only one prediction result, but its accuracy was poor. In (b), the prediction result of the top-left tailings pond improved, but the prediction result of the smaller top-right tailings pond was worse than before, with multiple prediction bounding boxes appearing that showed poor accuracy. In (c) and (d), all the target prediction results significantly improved. In the comparison of (d) with (c), although the smaller top-right tailings pond had two prediction results in (d), one of the prediction results was more consistent with the GT; therefore, it showed greater accuracy. Figure 13 shows that in (a), the red bounding box represents the GT of the small tailings pond, and (b) is the prediction result of all models. For some small tailings ponds, the improved method still cannot identify them. This may be because the tailings pond area is small, and the increase in NIR band information has a small contribution to the characteristic information. In view of this problem, we continued to improve it in the future study.

Discussion
Remote sensing image object detection, based on deep learning, primarily uses models from the field of computer vision. Transfer learning, hyperparameter adjustment, attention mechanisms, and other advanced deep learning techniques are used to improve models that adapt to the features of a target, enabling the highly precise and intelligent extraction of information. In the specific case of tailings ponds, most methods are based on improved deep learning object detection models and use information of the three truecolor bands from high-resolution remote sensing imagery to intelligently extract tailings pond information. An advantage of remote sensing image data, however, is that it has more than three spectral bands, but few studies have considered how to utilize the rich information contained in the many spectral bands of remote sensing imagery to improve the precision of tailings ponds detection. To address this deficiency, using a sample dataset of tailings pond images from the Gaofen-1 satellite, our study has improved the deep learning object detection model known as Faster R-CNN by increasing its inputs from three true-color bands to four multispectral bands (adding the NIR band) as well as an attention mechanism to recalibrate the contributions of the four multispectral bands in the model. Subsequently, we used a step-by-step transfer learning method to gradually improve and train the model.
The training loss curves for the different models ( Figure 7) show that Bands4_sub and Bands4_sub_SAMB have lower loss values than Bands_3 and Bands_4. The test precision curves for the various models ( Figure 8) and the evaluation values of the various models (Table 3) show that the precisions of the Bands_3 and Bands_4 models are similar, indicating that Bands_4 model could not fully utilize the NIR band information, but the precision of the Bands_4_sub model significantly improved compared with that of Bands_3 and Bands_4, indicating that the step-by-step transfer learning method could effectively utilize NIR band information to significantly improve detection precision. The precision of the Bands_4_sub_SAMB model improved compared with that of Bands_4_sub, indicating that the recalibrated contribution of the input four bands further improved performance. The prediction results of different models also show that the proposed method can improve the accuracy of tailings pond detection and improve the mistaken detection and missed detection (Figures 9-12).

Discussion
Remote sensing image object detection, based on deep learning, primarily uses models from the field of computer vision. Transfer learning, hyperparameter adjustment, attention mechanisms, and other advanced deep learning techniques are used to improve models that adapt to the features of a target, enabling the highly precise and intelligent extraction of information. In the specific case of tailings ponds, most methods are based on improved deep learning object detection models and use information of the three true-color bands from high-resolution remote sensing imagery to intelligently extract tailings pond information. An advantage of remote sensing image data, however, is that it has more than three spectral bands, but few studies have considered how to utilize the rich information contained in the many spectral bands of remote sensing imagery to improve the precision of tailings ponds detection. To address this deficiency, using a sample dataset of tailings pond images from the Gaofen-1 satellite, our study has improved the deep learning object detection model known as Faster R-CNN by increasing its inputs from three true-color bands to four multispectral bands (adding the NIR band) as well as an attention mechanism to recalibrate the contributions of the four multispectral bands in the model. Subsequently, we used a step-by-step transfer learning method to gradually improve and train the model.
The training loss curves for the different models ( Figure 7) show that Bands4_sub and Bands4_sub_SAMB have lower loss values than Bands_3 and Bands_4. The test precision curves for the various models ( Figure 8) and the evaluation values of the various models (Table 3) show that the precisions of the Bands_3 and Bands_4 models are similar, indicating that Bands_4 model could not fully utilize the NIR band information, but the precision of the Bands_4_sub model significantly improved compared with that of Bands_3 and Bands_4, indicating that the step-by-step transfer learning method could effectively utilize NIR band information to significantly improve detection precision. The precision of the Bands_4_sub_SAMB model improved compared with that of Bands_4_sub, indicating that the recalibrated contribution of the input four bands further improved performance. The prediction results of different models also show that the proposed method can improve the accuracy of tailings pond detection and improve the mistaken detection and missed detection (Figures 9-12).
Although the attention mechanism has been widely used in deep learning and remote sensing applications, it was used mostly to recalibrate extracted feature channels. There are few examples of using the attention mechanism on original remote sensing multi-band data. However, in this study, we adopted the method and improved detection precision with a relatively minor increase in parameters and calculations (e.g., single iteration time increased by only 0.095 s). Similarly, in contrast to the common transfer learning method, we adopted the step-by-step transfer learning method to fully utilize multispectral band information of remote sensing images to improve model performance.
The experimental results showed that the improved model could make full use of the NIR band information of GF-1 images to improve the precision of tailings pond detection. Compared with that of the triple-band input model, the AP and recall of tailings pond detection significantly improved in our model, with AP increasing from 82.3% to 85.9% and recall increasing from 65.4% to 71.9%. These results indicated that the model and method proposed in our study could utilize the multispectral band information of GF-1 images to improve the object detection precision for tailings ponds.

Conclusions
In this study, we used a sample dataset of tailings pond images from the GF-1 highresolution Earth observation satellite. We improved the Faster R-CNN object detection model by increasing its inputs from the three true-color bands to four multispectral bands as well as using the attention mechanism to recalibrate their input contributions. A stepby-step transfer learning method was subsequently used to gradually improve and train the model. The experimental results showed that the improved model fully utilizes the NIR band information of the GF-1 images to improve tailings pond detection precision. Compared with that in the three true-color band input model, tailings pond detection AP and recall notably improved in our model, with the AP increasing from 82.3% to 85.9% (up by 3.6%) and recall increasing from 65.4% to 71.9% (up by 6.5%). The model and method proposed in this study made full use of the multispectral band information of GF-1 images to improve the accuracy of identifying tailings ponds. With the rapid development of remote sensing technology, more remote sensing image data are becoming increasingly available at higher spatial resolutions as well as with a greater number of spectral bands. Our research could serve as a reference for using multispectral band information from remote sensing images in the construction and application of a deep learning model. In the future, we will study the deep learning model based on multispectral remote sensing images to extract the boundary of tailings pond and determine the mineral types of tailings pond, and study the use of synthetic aperture radar data to monitor the tailings pond.