1. Introduction
Change detection (CD) is one of the major tasks in remote sensing (RS), which represents semantic changes in bi-temporal images by comparing satellite remote sensing images of the same region at different times and assigning binary labels to each pixel in the area. The technology of change detection using remote sensing images has been widely used in various fields and plays a crucial role in urban area expansion research [
1,
2], land use change analysis [
3,
4], forest vegetation cover monitoring [
5,
6,
7] and natural disaster damage assessment [
8,
9]. In recent years, with the rapid development of small unmanned aerial vehicles (SUAVs) and the maturity of UAV low-altitude remote sensing technology, more and more researchers have focused on these. SUAV low-altitude remote sensing is convenient, in real-time and highly maneuverable. Many change detection studies use bi-temporal SUAV low-altitude remote sensing images to conduct experiments. However, due to the influence of wind changes and positioning system errors when the UAV is flying, there is a viewing angle change during imaging, which makes affine transformation between the bi-temporal images captured by SUAVs in the same area. The current studies often use additional registration methods to register bi-temporal images before performing change detection, which not only adds additional research content but also makes the effect of registration directly affect the quality of change detection results.
Traditional registration methods apply feature descriptors to image pairs and use the nearest neighbor criterion to globally match keypoints to obtain pixel-to-pixel correspondences. However, these methods cannot accurately extract the correspondence between pairs of bi-temporal remote sensing images. This is because, in these image pairs, not only the change of the viewing point, but also the color change due to season, and the semantic change due to the ground object change are included. The traditional method cannot extract keypoints for matching by means of semantic information between bi-temporal remote sensing image pairs containing these changes. Owing to the amazing performance of convolutional neural networks (CNN) in extracting abstract semantic information in images, in recent years, many studies have used CNN to extract semantic features of bi-temporal remote sensing images, and perform registration by comparing pixels in feature space [
10]. On the basis of CNN, recent optical flow methods exploit local correlation layers to estimate correspondences between image semantics, achieving great success in predicting semantic pixel-level accurate displacements. Nevertheless, the optical flow method evaluates the similarity of the local area around the pixel coordinates of the image. It is only suitable for small displacements, cannot capture large changes in viewing point and distance and performs poorly in SUAVs’ low-altitude remote sensing image registration tasks.
Under the premise that the image pairs have been registered, for more and more complex semantic changes, the traditional pixel-based and object-based change detection methods are ineffective, and more and more researches focus on change detection using CNN. The network U-Net, which is designed for image segmentation, unexpectedly shows outstanding performance in change detection, establishing an encoder–decoder benchmark structure for subsequent research of change detection networks. Meanwhile, the method of extracting bi-temporal image features using a Siamese network [
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22] is widely used as a standard step in change detection. In order to improve the detection performance, some methods use a feature pyramid to extract multi-scale features during the down-sampling process of the encoder, which enriches the feature expression during up-sampling. Other methods utilize the attention mechanism during encoder down-sampling and decoder up-sampling to obtain better feature representation since attention-based methods (channel attention and spatial attention) are effective in establishing global information. Although the above methods have achieved good results in the change detection task, in these change detection networks using the encoder–decoder structure, continuous down-sampling and up-sampling will cause the loss of accurate location information in the shallow layers of the network, which will lead to the blurring in the edge of the change areas and the missed detection of small change areas in the detection results.
In this paper, we propose a convolutional neural network architecture, called RegCD-Net, for end-to-end change detection in the bi-temporal SUAV low-altitude remote sensing images. We build a multi-task CNN that implements bi-temporal low-altitude remote sensing image registration and subsequent change detection in a single network and it can be trained end-to-end. We use a combination of local and global correlation to solve the problem that bi-temporal low-altitude remote sensing image registration using the optical flow method performs poorly under large viewpoint changes and large displacement, and use the optical flow pyramid for layer-by-layer optimization. In order to solve the problem of loss of location information caused by continuous down-sampling in change detection, we use nested connections to combine deep semantic information and shallow location information and use the attention mechanism to perform network deep supervision and optimize feature representation.
The main contributions of this paper are as follows:
- (1)
We propose an end-to-end CNN architecture RegCD-Net, which integrates registration and change detection functions in a network, to achieve registration and change detection in bi-temporal SUAV low-altitude remote sensing images.
- (2)
We integrate global and local correlations to generate an optical flow pyramid and realize image registration through layer-by-layer optical flow fields.
- (3)
We utilize nested connections to combine effective information in different layers and perform deep supervision through a combined attention module to achieve change detection.
- (4)
We propose a method for generation change detection dataset with viewing angle changes using optical flow fields and generate a bi-temporal SUAV low-altitude remote sensing dataset for change detection in the garbage-scattered areas of nature reserves.
The remainder of this paper is organized as follows.
Section 2 reviews the related works.
Section 3 describes the proposed network architecture in detail. To evaluate our method, the experiments are designed in
Section 4. The comparison experiment results are discussed in
Section 5 and the ablution studies are shown in
Section 6. Finally, the discussion and conclusions are presented in
Section 7 and
Section 8.
2. Related Work
Change detection is an important task in computer vision and a critical method in remote sensing image analysis. When using this method to detect changes in bi-temporal low-altitude remote sensing image pairs, registration is needed for the image pairs first. In recent years, CNN has completely changed most fields in computer vision, and the method using CNN has been greatly improved compared with traditional methods. Here, we focus on CNN-based methods for image registration and change detection, as these are most relevant to our work.
The feasibility of using CNN for image registration originated from Spatial Transformer Network, a fully convolutional neural network built by Jaderberg et al. [
23] for handwriting letter correction. However, the network structure is too simple, and it can only predict the deformation field of simple semantic images, which is not competent for realistic image registration. The motion relationship between bi-temporal images of the same scene in reality can also be regarded as the change (viewpoint change and semantic change) between the two images, and this change can be represented by an optical flow map. The optical flow map can describe the changing field between the two images, so that the two images can be registered by this field. Based on U-Net [
24], Dosovitskiy et al. [
25] proposed the first optical flow estimation network FlowNet, which directly estimated the optical flow between the original and target images by using the local correlation layer, providing strong clues for image registration. Ilg et al. [
26] stacked several basic FlowNet models into a concatenated network FlowNet2, utilized correlation layers in a coarse-to-fine manner to estimate the similarity within the neighborhood interval of the center pixel, and used the intermediate optical flow to distort the target image for registration. Ranjan et al. [
27] introduced an optical flow estimation network SpyNet combined with a spatial image pyramid model, which distorts the target image at each pyramid level through the current optical flow and calculates the updated optical flow to estimate the displacement between images layer by layer. The recent network PWC-Net proposed by Sun et al. [
28] took the advantages of the above mentioned networks, combining spatial image pyramid, layer-by-layer distortion of intermediate optical flow and correlation cost volume in an optical estimation network, which is small and efficient.
Although the above-mentioned optical flow estimation models perform well when the images are in small deformations, they are not competent in obvious deformations, large displacements and significant differences in visual appearance (semantic changes) between images. With regard to this, Melekhov et al. [
29] proposed DGC-Net, which exploits global correlation layers [
30] to extract similarities between deep features and generates dense 2D correspondences to solve the optical flow prediction problem with strong geometric transformations between images. However, the network builds global cost volume through the coarsest resolution, which limits its accuracy in estimating small pixel displacement optical flow in high-resolution images.
In the field of change detection using CNN, the network U-Net proposed by Ronnerberger et al. [
24] for image segmentation had shown outstanding performance in the change detection task due to its ability to extract deeper feature information, and established an encoder–decoder benchmark model structure for subsequent change detection research. Subsequently, Xiao et al. [
31] and Guan et al. [
32] applied the connection method of ResNet [
33] and DenseNet [
34] to U-Net, and proposed Res-Unet and FD-Unet respectively, which further enhanced the feature extraction ability of original U-net. Although the change detection models such as those based on U-net and its variants [
18,
35] can accurately predict the change areas, the encoding and decoding process of U-net is too direct and simple, and the accurate location information in shallow layers is often lost during successive down-sampling and up-sampling processes, leading to blurring in the change area edges and missed detection of small change areas.
In order to extract more accurate location features, some studies applied feature pyramid [
15,
17], such as STA-Net proposed by Bi et al. [
36], which uses a feature pyramid to extract multi-scale features in the down-sampling process, enriching the representation of location information features during up-sampling. Other studies adopted attention mechanism [
11,
12,
13,
14,
16,
21], for example, Zhang et al. [
37] proposed IFN-Net by adding the CBAM [
38] to the change detection network framework and used CBAM to fuse features in the up-sampling process to enhance the accuracy of the boundary reconstruction of the detection results. Zhou et al. [
39] proposed UNet++, which improves the region edge segmentation accuracy by adding nested dense connections and skip-path convolutional layers between the encoder and decoder, and adds a deep supervision mechanism to extract the output of different decoding layers. On the basis of this, Fang et al. [
40] proposed SNUNet-CD, a change detection network based on Unet++. The network uses the Siamese network to extract bi-temporal image features and feeds these features into a densely connected encoder–decoder structure to output deep supervised features at different layers, and finally utilizes an attention mechanism to filter features for output. However, the network must input well-registered bi-temporal images, and images with poor registration effects will generate more non-semantic change areas through the network.
3. Materials and Methods
In this work, our goal is to detect the garbage-scattered areas in the nature reserves by means of change detection. We plan the flight paths of multiple SUAVs through the multi-UAV collaboration platform, realize the capture of bi-temporal images of the large-scale ground in the nature reserve and use the CNN to accomplish the detection of changes in the garbage-scattered areas in the bi-temporal images.
In the classic remote sensing image change detection task using CNN, the bi-temporal images usually need to be pre-registered to ensure that the perspectives of the two images are consistent, and the change detection can be performed. Among them, registration and change detection are two separate processes that need to be accomplished by using different networks. Different from these approaches that separate the registration process from change detection, in this work, we treat registration and change detection as two consecutive tasks in the same network, and proposed an end-to-end network to address the registration pretreatment of bi-temporal images and subsequent change detection issues.
The architecture of the proposed network is presented in
Section 3.1. In
Section 3.2 and
Section 3.3, the optical flow registration subnetwork and subsequent change detection subnetwork of the overall network architecture are described in detail.
Section 3.4 provides details of the multi-SUAV collaboration platform. In
Section 3.5, the training details are presented in detail, including the structure of the loss function and the generation of the dataset.
3.1. Network Architecture
The proposed network is an encoder–decoder structure overall. As shown in
Figure 1, the network is mainly composed of three parts: down-sampling backbone, optical flow registration sub-network and nested connection change detection subnetwork. The bi-temporal images are firstly encoded by a down-sampling backbone, and then the two images are registered through the optical flow registration subnetwork. Finally, the registered pair of images is sent to the nested connection change detection subnetwork for decoding to obtain the final change detection result.
The ResNet18 [
33] backbone is used as the down-sampling network in the encoder. Furthermore, the siamese down-sampling network, which is constituted by two parallel ResNet18 backbones, is designed for simultaneous down-sampling of source and target images. As shown in
Figure 1a, bi-temporal images are input into two branches of the siamese down-sampling network and down-sampled simultaneously. Owing to weight sharing between the two branches, two down-sampling networks with the same set of parameters can extract approximately the same feature in two images. Then, the concatenation is used to fuse the two extracted feature maps into a single one that contains the same and different features of the two maps. It should be emphasized that one of the feature maps involved in fusion is the registered feature map obtained after warping by optical flow.
Before the decoder performs up-sampling, the input bi-temporal images are necessary to register to ensure that two images are in the same perspective. That is to say, the feature maps of target images in each level are registered to the perspective where the corresponding feature maps of the source images are located. The estimated displacement field
, which is often called optical flow, is used to warp the target images
to the source images
as follows:
where
is the coordinate of each pixel in images. Through field
, the coordinate
in the target images can be mapped directly to its corresponding location in the source images to complete the registration of the two images.
The optical registration is applied to each feature map level and accomplished layer by layer from deep to shallow layers of the network. As shown in
Figure 1b, in the deepest layer of the down-sampling, the two feature maps are sent to the global correlation module
to calculate global cost volume, and then the result is sent to the global mapping decoder
to estimate an optical flow field. The target feature map in the deepest layer is warped through the optical flow field, and the field will be sent to the previous layer after 2× up-sampling. In the shallow layers, the target feature map is firstly warped by an optical flow field from the deeper layer, and then the local cost volume is calculated together with the source feature map in the local correlation module
. The optical flow field from the deeper layer is not only decoded by the local mapping decoder
together with the local cost volume into a new optical flow field, but also added to this new field to finally generate the optical flow field of this layer. The optical flow filed
in shallow layers is defined as:
where
U denotes the 2× up-sampling, and
l denotes the level of the feature layer.
and
refer to the feature maps of the source and target images.
is the operation of computing the local correlation, and
is the decoding operation using a local mapping decoder.
is the operation that uses the optical flow field to warp the target feature map to achieve registration. In the top layer, the target image will be directly warped to generate the registered original-size image. More details about optical registration are available in
Section 3.2.
The nested connection up-sampling structure acts as a decoder to densely up-sample the feature map of each layer. The paired source and warped target feature maps of each layer in down-sampling are sent to the corresponding feature map size up-sampling network together. The semantic and spatial information in the feature maps of different layers are fused through densely nested connections, and the four feature maps of original size are generated by stepwise up-sampling. Eventually, these four feature maps are convolved through an attention module to generate a final change detection map, as shown in
Figure 1c. Please refer to
Section 3.3 for details of the nested connection and change detection. The integral inference detail of our RegCD-Net is shown in Algorithm 1.
Algorithm 1 Inference of RegCD-Net for change detection |
|
3.2. Optical Flow Registration
The optical flow is generated by a sequence of layers in the optical flow estimate network to measure local and global correspondence between the source and target feature images in each layer. The similarity calculated results of two feature maps, generally called cost volume [
25], can quantify the correspondence between two feature maps in each layer and provide a strong basis for the network to estimate optical flow layer by layer. According to the calculation range of the feature correspondence relationship, the cost volume can be calculated in the way of local correlation and global correlation.
3.2.1. Local Correlation
The local correlation layer only calculates the feature correspondence in the neighborhood pixel between the source and target feature map [
41,
42]. The local correlation
between the source
and target
feature maps is defined as:
where
l refers the the level in the feature pyramid layer,
is a coordinate in the target feature map and
is the offset from this coordinate. The maximum offset in any direction is always constrained in the neighborhood radius
R, and the correlations are only calculated in this neighborhood radius. In theory, the correlation result is a 4D tensor, which is positions offset combine of two 2D tensor. In practice, the cost volume
is organized as a 3D tensor, which has the size of
(2
R+1).
3.2.2. Global Correlation
The global correlation layer is only used in the deepest feature layer to calculate the global correlation between the most roughly feature maps. It evaluates the correspondence in all locations between the source and target feature maps [
43,
44]. The global correlation
is defined as follows:
where
and
refer to the feature map extracted from all source feature map coordinates and target feature map coordinates, respectively. The cost volume
is organized as a 3D tensor of size
.
3.2.3. Local and Global Correlation Assemble
According to their range of correlation calculation, the behaviors of the local and global correlation layers present some complementary characteristics. The local correlation layer is widely used in the optical flow estimate network to evaluate the displacements of two feature maps. Limited by neighborhood radius calculation, the local correlation layer can be applied in high-resolution feature maps to estimate small displacements precisely. That is to say, the correlation calculations are restricted to a small range by local correlation, and failed to estimate a large offset of source and target feature maps. In contrast, the global correlation calculates a large range correspondence relationship without maximum range radius limitation; therefore, it can estimate the large-scale displacement of the two feature maps.
Moreover, the cost volume of the global correlation is calculated as a tensor in the size of
, which refers to a coordinate in the source feature maps that need to calculate two direction offsets with all coordinates in target feature maps. In the high-resolution feature maps, the very large space complexity
will cause the memory to be occupied by a huge amount of tensor computation. Hence, the global correlation layer is only utilized in calculating the most coarse-resolution feature maps. The architecture has a combination of local and global correlation, which is used in the proposed network to estimate optical flow (
Figure 2).
3.2.4. Flow Decoder
The flow decoder is used to estimate the optical flow field of this layer through the optical flow field generated by the previous layer and the cost volume produced by the correlation of this layer. In the deepest feature layer, only the resulting global correlation
needs to be sent into the flow decoder
, as implemented in DGC-Net [
29], to estimate the optical flow field
at the coarsest level of the feature layers:
In the rest level
l of the feature layers, the residual flow
, which is the computing result of the flow decoder
, is defined as:
where
refers the local correlation with search radius
R, and
denotes the 2× up-sampling.
is the registered target feature map
warped by the up-sampled optical flow field
from the deeper one layer, which is defined as:
where
is a coordinate in the maps. The complete optical flow field in layer
l is defined as:
The flow decoder
and
both consist of five convolutional layers with dense connections [
34]. The number of channels in each convolutional layer is 128, 128, 96, 64 and 32, respectively, and the size of all convolutional kernels is 3 × 3. Finally, the estimated optical flow field is output through a 2D linear convolution.
3.2.5. Optical Flow Pyramid
In a typical CNN network structure, as the down-sampling process continues to deepen, the feature maps in deeper layers contain more rich semantic information. However, accurate location information in shallow layers is continuously lost in the down-sampling process. The feature pyramid structure [
15,
27,
28,
45] is used to combine the rich semantic information in deeper feature maps and the accurate location information in shallow feature maps to achieve more precise feature representation.
Similarly, the feature pyramid structure in this paper is utilized for fusion of optical flows from different feature layers. The optical flow in coarse layers can estimate large range displacement, while the optical flow in high-resolution layers is used to determine tiny offsets precisely. During the optical flow up-sampling process, the optical flow pyramid fuses the deep and shallow optical flows layer by layer, eventually generating an optical flow that can estimate the exact displacement of the two images at the original resolution. The optical flow pyramid architecture is shown in
Figure 2.
3.3. Nested Connection Change Detection
The standard encoder–decoder architecture is widely used in change detection. The deeper the feature extraction layer of an encoder, the richer the semantic information of the feature map that will be extracted, while the extracted location information will be more vague. To tackle this contradiction, the decoder structure is added to fuse the semantic information of the deep layer with the location information shallow layer, and up-sampling to the original resolution step by step. Sparked by the ResNet, in order to achieve better fusion information, the skip connection is used in the decoder to connect more deep and shallow layers when up-sampling.
3.3.1. Nested Connection Up-Sampling
Different from the traditional network of the encoder–decoder structure, which connect encoder and decoder feature maps straightly, in UNet++ [
39], this direct connection is expanded as dense connections by the use of skip connection. In order to maintain location information in shallow feature maps and semantic information in deep ones, to bridge the semantic gap between the encoder and decoder feature maps, the dense skip connection between the encoder and decoder is used in the proposed network.
The dense skip connection is shown in
Figure 3. The source and target feature maps are down-sampled by two branched feature extraction backbones. Then, in each feature layer level, the extracted feature maps of two branches are concatenated to generate a single one that contains the same and different features of the two maps. The concatenated feature maps participate in the nested connection up-sampling process, transmitting the feature information from different layers to the decoder through skip connections, and compensating for the loss of location information in deep layers. For example, the feature maps of
and
are extracted by two branches of down-sampling backbone, then the
X is generated by convolution after concatenating the two. The three block of
,
and
X are the one-level deeper counterparts, respectively, of
,
and
X. To obtain
X,
X is obtained firstly by convolution after concatenating
X with 2× up-sampled
X. Then, the intermediate unit
X is generated by convolving the concatenation of
X and 2× up-sampled
X. At last, through skip connection,
X could be concatenated with
X and
X, eventually generating
X by convolution. Up-sampling is needed for every unit except for the original size unit, in order to achieve dense nested connections throughout the up-sampling process of the decoder.
The convolution block is designed as a residual unit structure [
33] through a skip connection. Each convolution block is preceded by a concatenation block which concatenates the output of the previous convolution block in the same feature map level with the up-sampled output of the one-level deeper convolution block and unified by convolution blocks. The structure of the convolution block is shown in
Figure 4.
Formally, let
denote the output of unit
, where
i denotes the down-sampling layer level and
j denotes the number of skip connections received by this unit. The
is defined as follows:
where function
denotes the convolution operation of the convolution block, function
is a 2× up-sampling operation, and
denotes the concatenation operation on the channel dimension. Specifically,
and
are sourced from the down-sampling, and other units
at level
are concatenated by
and
; units at level
receive both the outputs of the previous units in the same sampling level and an up-sampled output of the deeper unit. The location and semantic information in different levels of encoders are transmitted to the decoders in succession for concatenation, convolution and up-sampling through these dense skip connections.
3.3.2. Channel Attention
The outputs of the nested connection up-sampling are four feature maps, which are the outputs of the unit , , and , with the same size as the original images. However, the four outputs have different representations of semantic levels and spatial location, because they are generated through different levels of skip connection and up-sampling pathways. The outputs from shallow pathways have precise location information and finer-grained features, by contrast, the outputs from deep pathways have richer semantic information and coarse-grained features. Therefore, a select mechanism is needed to screen out effective feature information representation when fusing the four feature maps.
On the basis of the channel attention module (CAM) [
38], a select mechanism called channel group attention module (CGAM) is proposed in this paper to select more appropriate feature information and focus on more effective feature representations between each set of feature maps. As shown in
Figure 5, the four groups of output feature maps are concatenated first, and then a CAM is used to extract the inter-group channel relationship. Meanwhile, another CAM is also used to extract intra-group relationships after summing the four groups of feature maps. Finally, the final refined output is obtained by sequentially multiplying the concatenated feature map and the two CAMs. In short, the CGAM of feature map
is defined as follows:
where
denotes the sigmoid function, and MLP is the multi-layer perception layer. AP and MP denote average pooling and max pooling operations, respectively.
denotes the concatenation operation, and ⊗ denotes the element-wise multiplication between feature maps and attention maps.
Through the CGAM, refined feature maps with spatial attention are generated. As a complement to CGAM, the spatial attention module (SAM) [
38] is added to focus on the more precise position of the semantic and location information on the feature map. Finally, a feature map with channel refinement and spatial refinement features is obtained, and the final change map
is generated after passing it through a 1 × 1 convolutional layer.
3.4. Multi-SUAV Collaboration Platform
A multi-UAV collaboration platform is established for change detection of nature reserve garbage scattered areas. The detection and localization tasks can be run on a computer to reduce the requirements for SUAVs’ energy consumption and performance. At the same time, thanks to its openness, the platform can simultaneously connect multiple SUAVs for data transmission. The platform can realize the path planning of multiple SUAVs on a visual map interface, and display the video stream as well as flight information returned by multiple SUAVs in real time. Meanwhile, the location information of the detected garbage scattered areas can be read and then marked on the map. The visual interface of the platform is shown in
Figure 6. In function, the platform is mainly composed of a multi-SUAV collaboration module, path planning module and location module.
3.4.1. Multi-SUAV Collaboration
The multi-SUAV collaboration module is used to build a multi-SUAV collaboration remote sensing control system with open interfaces. The system can connect multiple SUAVs at the same time, receive the video streams returned by each SUAV and display them in real-time. Meanwhile, the cruise path of each SUAV can be planned by manual punctuation in the visual map interface, and the real-time flight record of each SUAV can also be displayed in the form of a text stream.
The DJI’s civilian-grade SUAVs are chosen as the video acquisition terminal of our multi-SUAV platform because DJI provides a stable software development kit (SDK) called DJI Mobile SDK, which is convenient for our system development according to practical applications [
46,
47,
48]. Through this SDK, the flight data of the UAVs can be accessed in real-time to realize functions such as automatic cruise, gimbal remote control, real-time video streaming transmission, real-time GPS information acquisition and SUAVs’ status monitoring.
3.4.2. Path Planning
The path planning module is designed for the global planning of flight range and route. Path planning is essentially a waypoint task, which is needed to point the waypoint path on the map first, and then achieved through UAVs passing through the ordered GPS coordinates with elevation information in turn [
49,
50]. The path planning method, mainly based on satellite maps and supplemented by the digital elevation model (DEM) [
51,
52,
53], is adopted, which not only realizes safer UAV route planning in nature reserves but also makes the location of garbage-scattered areas more accurate.
3.4.3. Location
The location module is mainly based on GPS information and coordinate transformation, which is the core of the garbage-scattered area location. Using the GPS information [
54,
55] contained in the video stream transmitted back by UAVs, the pixel coordinates of the detected change area in each video frame are converted into GPS coordinates so as to realize the location of the garbage-scattered areas.
3.5. Training
3.5.1. Loss Function
Our network, which combines the image registration and change detection task, is trained end-to-end. The pre-trained ResNet18 feature extractor backbone is unfrozen and participates in the training to update parameters. According to the different tasks performed at each stage, the loss function of the proposed network is divided into the loss function of the image registration phase and the loss function of the change detection phase.
In the field of image registration, the endpoint error (EPE), which is the standard error measurement for optical flow estimation, is used as the training loss. It is the displacement between the estimated optical flow and the ground truth, calculated from the Euclidean distance. As proposed in FlowNet [
25], we utilize the optical flow field information of different optical flow pyramid layers for multi-scale training loss. The multi-scale EPE loss can be formulated as:
where
l denotes the level of the
L-level optical flow pyramid.
and
are the feature map size in the level
l.
and
respectively denote the optical flow field estimated by the network at the
l pyramid level and the corresponding ground truth field.
is the coordinate in the optical flow field.
is the weight coefficient of each pyramid level to adjust the weight of different pyramid layers.
In the phase of the image registration, the focal loss [
56] is utilized as the loss function to solve the sample imbalance problem in that the number of the changed pixel is much less than the number of the unchanged pixel. In addition, the dice coefficient is also added to assist in tackling this problem. Formally, the combination loss function of change detection is defined as follows:
where
is the simplified form of the cross entropy loss, and
is the modulating factor.
Y denotes the ground truth and
denotes the predicted change map.
The final loss function
is the addition of
and
, which is defined as:
where
is a hyperparameter. The final task of the entire network is to perform change detection, however, the accuracy of this task is affected by the effect of the previous registration task. Therefore, the hyperparameter
is added to adjust the size of the loss function
, so as to weigh the proportion between the two tasks, which in turn optimizes the final change detection effect.
3.5.2. Dataset Generation
Our network requires the supervised training data consisting of image pairs, optical flow pairs, warp optical flow and ground truth of change detection. Unlike other vision tasks, the used dataset of low-altitude remote sensing garbage in nature reserves contains not only the ground truth of warp optical flow for image registration but also the ground truth of garbage change detection. In general, obtaining such a dataset is hard, and no public datasets exist that can satisfy the proposed network’s requirements for the dataset. Therefore, we decide to make a dataset for training and validation of the network.
As shown in
Figure 7, the artificial dataset is generated according to the following steps: (1) The high-resolution images of the nature reserve ground are shot vertically downward by SUAVs at a height of 30 m. (2) A large number of common garbage images are collected, and their backgrounds are removed through matting software to generate massive garbage image patches. (3) Some garbage patches are selected randomly to form several garbage distribution areas with a certain probability distribution, and these areas are covered randomly into the original image to generate the source image with a small amount of garbage scattered areas, and the corresponding binary change map. (4) After the affine transformation on the original image, the target image is generated by covering more garbage scattered areas on it, and the binary change map is also produced. At the same time, the affine matrix is saved and the optical flow map is generated through this matrix to record the perspective change from the source image to the target image. (5) The change map of the target image is restored to the perspective of the source image through the inverse matrix of the affine matrix and added with the change map of the source image to generate the change map ground truth of the source and target images. (6) At the same position on a group of source, target, change and optical flow maps, a set of dataset images is obtained through cropping with small square boxes of the same size. Multiple sets of dataset images can be obtained by performing the above operations at random positions multiple times on a group of source images. (7) The entire dataset is obtained by deleting those image sets with too large black border areas and repeated cropping areas, after performing the above step on all source image groups. (8) To consummate the dataset, data augmentation is used for the target images to increase the difference from the source images and enhance the robustness of the network after training.
5. Results
To evaluate the performance of the proposed method, we make a comparison with the several representative and SOTA change detection methods on our dataset.
FC-Siam-Conc [
58]: The baseline model for change detection, which is fully consisted of convolution. It is a simple combination of UNet and Siamese networks and uses feature concatenation to fuse the bi-temporal information.
FC-Siam-Diff [
58]: The baseline model for change detection, whose architecture is similar to FC-Siam-Conc, but uses multi-scale feature difference to fuse the bi-temporal information.
UNet++_MSOF [
59]: Feature fusion method, which inputs concatenated bi-temporal images into UNet++, and uses the multiple side output fusion for deep supervision.
IFN [
37]: Multi-scale feature concatenation method, which fuses the multi-level deep features of images with different features by attention modules, and uses a deep supervision strategy for optimization.
DASNet [
12]: Attention-based method, which extracts features by a Siamese backbone, and uses a dual attention mechanism to build connections between local features to obtain more discriminant feature representations.
BIT [
60]: Transformer-based method, which models contexts within the spatial-temporal domain through multi-attention heads, and projects them to the pixel space to refine the representation of the original features.
SNUNet-CD [
40]: Multi-scale feature concatenation method, which combines UNet++ and Siamese network, and uses the ensemble channel attention module to integrate multi-level outputs to perform deep supervision.
RDP-Net [
61]: Feature fusion method, which uses region detail preserving the network to improve the detection performance on boundaries and small regions.
We implement the above CD methods using their public codes with default hyperparameters. We train the above networks on our dataset to examine their change detection performance in realistic bi-temporal SUAV low-altitude remote sensing images with viewpoint changes. Furthermore, as a comparison, these CD networks are also trained on our dataset with registered bi-temporal images. The bi-temporal images are pre-registered by the outstanding optical flow registration network GLU-Net [
62], and we use the optical flow in our dataset for the GLU-Net training.
Table 2 reports the overall comparisons of detection accuracy, parameters number and FPS on our dataset. Our proposed RegCD-Net can outperform the other change detection methods with relatively few parameters, fast speed and better integration without additional registration networks. In comparison with UNet++_MSOF, IFN, BIT and SNUNet-CD, our method achieves the highest P (96.74%), R (95.92%) and F1 (96.32%) with minimal parameters (20.66 M). Owing to simple network structure, FC-Siam-conc, FC-Siam-diff and RDP-Net have the least parameters, and the former two, also have the fastest FPS. By contrast, our RegCD-Net achieves at least 1% accuracy improvement with only about 5M more parameters and 0.7 less FPS. In addition, although DASNet gets the highest R (96.42%), our RegCD-Net achieves the second best R (slightly lower 0.5%) only with 32% parameters of DASNet.
The visualization comparison of different methods on our dataset is displayed in
Figure 9. The true positive, true negative, false positive and false negative are indicated by white, black, red and green, respectively in the figure. From
Figure 9, we can observe that our RegCD-Net achieves better detection performance than others, which mainly benefited from three perspectives. Firstly, our RegCD-Net employs global and local correlations in the deepest and shallow layers to generate optical flow, respectively, to achieve better registration performance in bi-temporal images with large viewpoint changes, thereby indirectly improving change detection accuracy. Secondly, our RegCD-Net utilizes nested connections in the up-sampling process, which combines rich semantic information and precise location information, to achieve more delicate edge performance in the change map. Furthermore, we also use the attention module CGAM to fuse features of the different semantic level paths to automatically emphasize more precise change edge representation. Thirdly, our RegCD-Net integrates registration and change detection sub-networks into a single network, enabling end-to-end optimization of change detection on bi-temporal low-altitude remote sensing images with viewpoint changes, which can reduce the influence of imprecise pre-registration results on change detection performance. Benefiting from an end-to-end structure, our method also has good real-time performance.