Learning Color Distributions from Bitemporal Remote Sensing Images to Update Existing Building Footprints

: For most cities, municipal governments have constructed basic building footprint datasets that need to be updated regularly for the management and monitoring of urban development and ecology. Cities are capable of changing in a short period of time, and the area of change is variable; hence, automated methods for generating up-to-date building footprints are urgently needed. However, the labels of current buildings or changed areas are usually lacking, and the conditions for acquiring images from different periods are not perfectly consistent, which can severely limit deep learning methods when attempting to learn deep information about buildings. In addition, common update methods can ignore the strictly accurate historical labels of unchanged areas. To solve the above problem, we propose a new update algorithm to update the existing building database to the current state without manual relabeling. First, the difference between the data distributions of different time-phase images is reduced using the image color translation method. Then, a semantic segmentation model predicts the segmentation results of the images from the latest period, and, ﬁnally, a post-processing update strategy is applied to strictly retain the existing labels of unchanged regions to attain the updated results. We apply the proposed algorithm on the Wuhan University change detection dataset and the Beijing Huairou district land survey dataset to evaluate the effectiveness of the method in building surface and complex labeling scenarios in urban and suburban areas. The F1 scores of the updated results obtained for both datasets reach more than 96%, which proves the applicability of our proposed algorithm and its ability to efﬁciently and accurately extract building footprints in real-world scenarios.


Introduction
With the rapid expansion and renewal of cities around the world [1], updating existing building databases has become routine for the generation of up-to-date building footprint information [2]; a routine which can contribute to sustainable urban development [3] and ecology [4]. Traditionally, a building update is performed by manually interpreting the change area and outlining the building boundaries of the change area, which is a timeconsuming and labor-intensive process, especially when analytically dealing with large areas (e.g., nationwide). Therefore, automation is essential for facilitating building change detection and building database updates.
In recent years, the main challenge of the update task has been to maintain the building detection capability of the model for unchanged areas while generating accurate segmentation results for changed areas (including building additions and demolitions). There are two main approaches to the performance of building updates: one based on building extraction [5] and the other on building change detection [6]. The former is trained using pre-temporal images and labels, and the model is fine-tuned on post-temporal images to generate the latest building segmentation results. EANet [7] and SRINet [8] respectively sensing images. For example, significant spectral differences in vegetation can occur due to seasonal differences. Even for images taken at different times of the day, the brightness of the same object may vary significantly. In addition, due to atmospheric effects, in some cases, even images collected by the same satellite sensor may have very different radiation intensities [31], which makes the segmentation task more difficult, and image color translation can effectively solve the problem of differences in spectral features between images from different periods.
Image color translation involves the translation of the color style of the source domain to the color style of the target domain. Early color translation methods included linear and nonlinear methods [32]. The most commonly used nonlinear method is histogram matching (HM) [33] and linear methods include the image regression (IR) method [34], Reinhard method [35], and pseudo-invariant feature (PIF) method [36]. Although the above methods are frequently used in color translation, traditional methods still have limitations when addressing complex scenes and object changes in remote sensing images.
Generative adversarial networks (GANs) [37] can align the data distributions of the source and target domains with the aim of generating pseudo-source domain images that are statistically indistinguishable from the target domain images [31]. Pix2Pix [38] uses a conditional GAN to learn input-to-output image mappings that requires paired data. Some recent works have relaxed the need to dependent on image translation learning on pairs of training data. The coupled GAN (CoGAN) [39] generates distribution estimates using samples from the boundaries for learning joint data by forcing the discriminators and generators in the source and target domains to share parameters at the low level. UNIT [40] further extends the CoGAN by assuming the existence of a shared low-dimensional latent space between the source and target domains. MUNIT [41] and DRIT [42] extend this idea to multimodal image-to-image translation by assuming two potential representations, one for "style" and another for "content". Then, cross-domain image translation is performed by combining different content and style representations. DiscoGAN [43] and CycleGAN [44] overcome the corrupted semantic structure problem by a cycle consistency loss and encourage the generated pseudo-source domain images to be effectively reconstructed when mapped back to the source domain. The attention-guided GAN (AGGAN) [45] adds an attention mechanism (AM) to each generator of CycleGAN and assigns different weights to different image positions. The attention-guided color consistency GAN (ACGAN) [46] can extract the high-level features of images, reducing the color distribution differences between multitemporal remote sensing images. Cycada [47] segments the original and generated images using a classifier trained on the original data and minimizes the cross-entropy loss between the segments. Unlike existing GANs, the generator in ColorMapGAN [31] does not have any convolution or pooling layers. It learns to translate the colors of the training data to the colors of the test data by performing only one element-by-element matrix multiplication operation and one matrix addition operation. We compare the CycleGAN, UNIT, DRIT, HM, and Reinhard image color translation methods and analyze the effect of each method on the experimental results.
In this paper, CycleGAN is proposed to be applied to the image color translation process for dual-temporal remote sensing images to reduce the differences among data distributions. In addition, based on UNet with EfficientNet [48] as the encoder, we make full use of the a priori information of the existing database to segment the buildings from the latest period and can directly predict the added and demolished change areas. In addition, the image color translation method does not require relabeling of the post-temporal images for fine-tuning to fit the data distribution, which greatly saves time and costs. Finally, a post-processing update strategy is proposed to strictly retain the historical labels for unchanged areas, calculate the ratio between the intersection area of the prediction and historical labels and then set an appropriate threshold to replace the segmentation results with historical labels, which is of great significance for high-accuracy urban mapping. The main contributions of this paper are as follows.

1.
Image color translation is performed on different-phase remote sensing images using CycleGAN to smoothly translate the color distribution from the source domain to the target domain in an unsupervised manner.

2.
A priori information is obtained based on a historical database using UNet(EfficientNet) to update buildings (additions and demolitions) without relabeling.

3.
We propose a post-processing update strategy to replace the segmentation of unchanged regions using strictly accurate historical labels to solve the problem of inaccurate prediction edges.
The rest of this paper is organized as follows. Section 2 describes the approach of this paper in detail. Section 3 verifies the effectiveness of the CycleGAN method and UNet(EfficientNet) by comparing them with other excellent image color translation methods and semantic segmentation models, respectively, and analyzes the segmentation improvement yielded by the post-processing update strategies. Section 4 discusses the ablation experiments and threshold selection for the post-processing update strategy. Finally, Section 5 summarizes the paper.

Methods
This section is structured as follows ( Figure 1). First, the architecture of CycleGAN and its loss are described. Second, the UNet(EfficientNet) architecture and the loss are introduced. Finally, we discuss the post-processing update strategy proposed in this paper.

Image Color Translation
CycleGAN is an unsupervised image-to-image translation method that converts information from one form to another. Its principle is based on the idea of pairwise image style translation, which translates the style of an image to another and preserves the semantic information of the original image. CycleGAN uses two discriminators and two generators to implement a mutual mapping between the source domain X and the target domain Y (Figure 2a). The generator takes a source domain image as input and generates a synthesized image with the style of the target domain. The discriminator takes an image as input and tries to identify whether it is the original image or the generated image.
CycleGAN is implemented with a forward generator G that translates image x from domain X to image G(x) in domain Y and another backward generator F that translates image y in domain Y to F(y) in domain X. A discriminator Dx is used to determine whether the image is from domain X or generator F(y), and another discriminator Dy is used to distinguish whether the image is from domain Y or generator G(x). The training procedure of CycleGAN is divided into two supervised processes, the first of which is cycle consistency supervision, in which the image is translated from domain X to domain Y and then translated from domain Y to domain X again. The synthesized image G(x) needs to generate F(G(x)) by the backward generator F to make it as close as possible to the original input x. The cycle consistency loss constrained network is used to solve the problem that the GAN cannot output the corresponding image. Similarly, the same process is adapted to Y. The forward cycle consistency process (Figure 2b) is x → G(x) → F(G(x)) ≈ x, and the backward cycle consistency process (Figure 2c) is y → F(y) → G(F(y)) ≈ y . The second stage is to determine whether an image is original or generated, i.e., the discriminator Dy discriminates whether the generated imageŷ of generator G comes from domain Y. Similarly, this process is also used by the discriminator Dx in domain X. In this case, the generator uses a residual neural network (ResNet) architecture, and the discriminator is PatchGAN [38].  The losses of CycleGAN include the adversarial loss and cycle consistency loss. The adversarial loss is used to distinguish whether the generated image is real or fake. The cycle consistency loss is used to improve the ability of the model to recover the images. The forward process of the GAN involves the generator G translating the image in domain X to G(X) in domain Y, and the discriminator Dy distinguishing whether it is a synthesized image. Thus, the generator G and discriminator Dy constitute the forward loss, as shown in Equation (1).
where L GAN (G, D Y , X, Y) is the forward adversarial loss of the mapping of domain X to domain Y. The data distribution is defined as x ∼ p data (x), y ∼ p data (y), X denotes the source domain, Y denotes the target domain, and E(*) denotes the expectation of the distribution function. Similar to the above loss is the backward process of the GAN, where the generator F and the discriminator Dx constitute the backward loss, as shown in Equation (2).
where L GAN (F, D X , Y, X) is the backward adversarial loss when mapping domain Y to domain X. Both the generated image and the original image are used as inputs for discrimination. The random combination of source and target domain images for image translation causes the model to learn different mapping relationships; therefore, relying on the adversarial loss alone does not guarantee that the function will map a single input x i to the desired output G(x i ), and there is no guarantee that the translation process will not distort the image content even for pairs of images. Therefore, to further reduce the possible mapping space and guarantee the quality of the generated images, we believe that the training process of the GAN should exhibit cycle consistency. The forward cycle consistency of CycleGAN recovers the image x of domain X to the original image x after cycle translation, i.e., x → G(x) → F(G(x)) ≈ x . Similarly, for the image y of domain Y, generators G and F still satisfy backward cycle consistency: y → F(y) → G(F(y)) ≈ y . The cycle consistency loss is shown in Equation (3).
where the L cyc (G, F) formula uses the L1 paradigm. The final loss is shown in Equation (4).
where λ denotes the coefficient of the cycle consistency loss. The higher the weight is, the more important the cycle consistency loss.

Semantic Segmentation
UNet was proposed in the field of medical images and is a typical encoder-decoder architecture. The encoder uses convolution and pooling layers to increase the number of channels and reduce the spatial size to extract deep features and underlying representations of the image, and the decoder is used to recover the original size and detail information of the image. In addition, it introduces skip connections in the network to combine shallow, low-level, and fine-grained feature maps from the encoder subnetwork and deep, semantic, and coarse-grained feature maps from the decoder subnetwork [49]. Among these, the encoder uses EfficientNet [50,51], which balances the three dimensions (the network depth, width, and resolution) to capture richer, more complex, and more detailed features in images, as shown in Figure 3. Since the image color translation method can produce both the generated image Fake X of the pre-temporal image and the generated image Fake Y of post-temporal images, the generated image Fake X of the pre-temporal image and the pre-temporal label L can be used as the training data pair (Fake X, L) to learn the semantic information of the building and test for post-temporal image Y to obtain segmentation result P. Alternatively, the pre-temporal image and label can be used as the training data pair (X, L) to learn the building semantic information and to test for the post-temporal generated image Fake Y to obtain the segmentation result P. The choice of two strategies depends on the image quality between pre-temporal images and post-temporal images, usually, the original image is the upper bound of the image quality, and the better the image color translation method, the closer the generated image to this upper bound. Therefore, selecting a high-quality original image to generate the synthesized image for semantic segmentation can acquire better results. In this paper, we adopt the scheme of data pairs (Fake X, L) for training and data Y for testing. The losses for semantic segmentation include the binary cross-entropy loss L bce and the Dice loss L dice [52]. The binary cross-entropy loss treats each pixel as an independent sample, while the Dice loss treats it in a holistic form. The binary cross-entropy loss and Dice loss are shown in Equation (5) and Equation (6), respectively.
where N is the sum of all pixels in the image, y is the label, and p(y i ) is the prediction probability.
where |P ∩ G| denotes the common number of predictions and ground truths. The final segmentation loss is shown in Equation (7).
where γ controls the importance of L dice in L seg .

Post-Processing Update Strategy
Since semantic segmentation is a pixel-level classification task, the central pixel is influenced by its neighboring pixels, and if the neighboring pixels belong to the interior of the target to be segmented, the central pixel classification is favored. Conversely, when the neighboring pixels are at the boundary of the target to be segmented, this negatively affects the process of correctly classifying the central pixel and thus often results in the problem of inaccurate edges. In addition, in the building update task, the changed area generally accounts for a smaller percentage than the unchanged area, so it is important to strictly maintain the corrected historical labels for urban mapping. We extract the building area contour in the prediction P and transform it into a polygon set P = {p 1 , p 2 , . . . , p n ; similarly, the pre-temporal phase label L is transformed into the set L = {l 1 , l 2 , . . . , l k . Based on the set P , we loop through the set L . If p intersects with l , we can obtain the intersection region i and then calculate the ratio α = S i S l of the area S i of the intersecting region i to the area S l of the historical label region. We set the threshold θ; if α > θ, we use l to replace p ; otherwise, we keep p . A larger θ indicates that the update result is more dependent on the prediction and is adapted to scenes where a longer time interval causes more change areas and a higher image resolution, and there are local changes in buildings. Conversely, a smaller θ means that the update result is more dependent on the historical label and is adapted to scenes where a shorter time interval causes fewer change areas, a lower image resolution and more complex labels. The algorithm of the post-processing update strategy is shown in Algorithm 1.
The proposed method is summarized as follows. We are given a pre-temporal image X and its corresponding label L, as well as the post-temporal image Y. The pre-temporal image X is translated into the generated image Fake X by the image color translation method, a semantic segmentation model is trained based on the data pair (Fake X, L), and the post-temporal image Y is tested to generate the building prediction P. The area of the intersection between the building prediction P and the pre-temporal label L, S i , is compared to the area S l of the previous temporal phase label L. The building segmentation p with a higher ratio is replaced by the corresponding building label l from the previous temporal phase to obtain the final updated resultP.

Algorithm 1 Post-processing update strategy
Step 1: Transform the pre-temporal label L and the post-temporal prediction P into the polygon sets L = {l 1 , l 2 , . . . , l k andP = {p 1 , p 2 , . . . , p n , respectively, and set the threshold θ.
Step 2: Calculate α and update: for p in P : Convert the set of polygonsP = {p 1 ,p 2 , . . . ,p n into pixel-level update results.

Experiments and Results Analysis
In this section, we first describe the two utilized datasets, introduce the implementation details of the proposed algorithm in this paper, and compare it with other excellent methods to explore the effectiveness of image color translation for the semantic segmentation of buildings from remote sensing images with different time phases. Finally, the performance of the proposed algorithm is evaluated by optimizing the segmentation process to achieve building updates using post-processing update strategies.

Datasets and Experimental Details
(1) Wuhan University Building Change Detection Dataset [53] The study area is located in Christchurch, New Zealand, covering 20 km 2 . The datasets of pre-and post-temporal images were obtained in 2012 and 2016, respectively, with a spatial resolution of 0.3 m and 3-band aerial images ( Figure 4). The pre-temporal building database and post-temporal building database contain 9938 and 12,091 labels, respectively. In the pre-processing stage, the original images and building labels are cropped into patches of 256 × 256 pixels with 50% overlap, and the final total number of crops obtained is 30,107 (30,107 pre-temporal tiles for training and 30,107 post-temporal tiles for testing). (2) Beijing Huairou district Land Survey Dataset The study area is located in Huairou district, Beijing. The dataset contains remote sensing images and building labels from February 2018 to October 2019, covering obvious construction sites, rural settlements, soccer fields and other infrastructures with complex labeling scenarios ( Figure 5). The image resolution is 2 m, and the images are 3-band images. The pre-and post-temporal databases both contain 3308 labelsand the numbers of added and demolished buildings are small due to the short interval between the two temporal settings of the land survey dataset and the large coverage of some labels. Similar to the above pre-processing output, the total number of images obtained from the final cropping operation is 8775 (8775 pre-temporal tiles for training and 8775 post-temporal tiles for testing). In the image color translation task, the generator contains three convolutional layers, nine residual blocks, two fractionally stride convolutional layers with 1 2 strides and one convolutional layer that maps the feature map to RGB. The convolutional layers were followed by instance normalization [54]. The discriminator used 70 × 70 PatchGANs to discriminate whether a patch of overlapping images of size 70 × 70 was real or fake. The initial learning rate was 0.0002, the optimizer was Adam with a batch size of 4, the total number of training epochs was 100, the learning rate was linearly decayed to 0 starting from the 50th epoch, and the data were enhanced using the flipping strategy. In addition, we utilized the least-squares loss instead of the negative log loss to make the model training process more stable, generate high-quality images, and to reduce model oscillations [55]. The discriminator was updated using historically generated images instead of images generated by the current generator [56]. In the semantic segmentation task, the encoder used EfficientNet-b1 to extract image features, and the PSPNet encoder used part of the architecture of ResNet50 with a block depth of 3. The initial learning rate was 0.001, the optimizer was AdamW with a batch size of 128, the total number of training epochs was 60, and the learning rate increased linearly in the first three epochs, after which the PolyLR strategy was used to decay the rate. The data were enhanced using random cropping to 128 pixels and the flipping strategy. All experiments in this paper were performed on one NVIDIA RTX 3090 GPU.
To quantify the experimental results, five evaluation metrics, including accuracy, intersection over union (IoU), precision, recall, and F1, were used to evaluate the performance of the proposed building update method for all buildings in the whole region. First, the numbers of false-negative (FN), true-negative (TN), true-positive (TP) and false-positive (FP) pixels were calculated using the prediction and ground truth. TP indicates pixels that are correctly predicted to be positive. Conversely, FN implies pixels that are incorrectly predicted to be negative. The above evaluation metrics were then calculated auxiliary to the formulas shown below. In addition, to eliminate the differences in evaluation due to image cropping size and overlap in preprocessing, all results were calculated on the merged large map.

Visualization of Image Color Translation
We first performed image color translation on the pre-and post-temporal images of the two datasets and evaluated the performance of each method by visually comparing the generated and real images, as shown in Figures 6 and 7. HM translates the whole area of each image, and Reinhard method makes the contrast between local areas of the image more obvious; however, for the land survey dataset, there are obvious seasonal differences between the two temporal phases. The traditional method is not ideal for the reconstruction of bare soil to vegetation in the image, but the deep learning method can learn the deeper mapping relationships between images. DRIT migrates the color style of the target domain to the source domain but causes some building roof colors in the generated image to be close to the color of bare soil or vegetation. UNIT shares the latent space during translation, which can enhance the similarity between the local area of the source domain and the corresponding area of the target domain but ignores detailed information such as the edges of buildings. CycleGAN is more stable than the other methods on both datasets and can effectively reconstruct vegetation features and preserve the edge information of buildings.
To better analyze the effect of each method on the RGB bands of the images, we depicted the histogram information of the different generated images, as shown in Figures 8 and 9. HM can fit the data distribution of the target domain well, but it incurs a loss of semantic information due to the discontinuity of the digital number (DN) of the image. The Reinhard method makes the distribution of the DN more uniform and enhances the contrast of the image, but the whole distribution of the fitted data is poor. The wave peak of DRIT differs from the target domain, so it has an impact on the realism of the building roof color. The distribution of UNIT has a wider distribution range and enhances the contrast of the local area, but its wave response is not obvious. CycleGAN can fit the RGB distribution better than other methods.

Numerical Results and Semantic Segmentation Visualization
We set the model with the original image without image color translation as the baseline and compared the gains achieved by other image color translation methods with the semantic segmentation model UNet(Eff-b1) in terms of five metrics.
The segmentation results obtained on the change detection dataset are shown in Table 1. Compared with the baseline, the use of image color translation can greatly improve

Numerical Results and Semantic Segmentation Visualization
We set the model with the original image without image color translation as the baseline and compared the gains achieved by other image color translation methods with the semantic segmentation model UNet(Eff-b1) in terms of five metrics.
The segmentation results obtained on the change detection dataset are shown in Table 1. Compared with the baseline, the use of image color translation can greatly improve the semantic segmentation performance for the buildings from the latest period. CycleGAN has the best overall performance compared with that of other methods, with IoU, precision, recall, accuracy, and F1 improvements of 10.93%, 9.22%, 2.94%, 2.43%, and 6.17%, respectively. HM achieves the optimal precision with a 9.62% improvement. In addition, for high-resolution images and datasets with small seasonal differences, there is little difference between the traditional and deep learning methods. Figure 10 shows the segmentation results of different methods. Holes and edge inaccuracies exist in the baseline results, in addition to certain degrees of false detections and missed detections, which are caused by the different data distributions of the two temporal images. The use of translation methods to align the data distributions can obtain better segmentation results, especially the CycleGAN method, which can help the segmentation model learn richer and more detailed information.  The segmentation results obtained on the land survey dataset are shown in Table 2. Due to the low resolutions and complex building labeling scenes in this dataset, training the model to predict the images of the latest period using only the pre-temporal data leads to a dramatic performance decrease. Compared with other methods, CycleGAN achieves the best results with 16.93%, 16.72%, 4.6%, 3.98%, and 11.38% improvements in the IoU, precision, recall, accuracy, and F1 metrics, respectively. It can be seen that the deep learning method outperforms the traditional methods in low-resolution and complex scenes and is able to learn deeper mapping relationships between different temporal images. The ground truths of the buildings in Figure 11 are divided using obvious roads or bare woodland. The baseline and traditional image color translation methods make the models unable to learn rich label information well, and the segmentation effects are poor. Deep learning-based translations can help the segmentation model better adapt to complex scenes.  The segmentation results obtained on the land survey dataset are shown in Table 2. Due to the low resolutions and complex building labeling scenes in this dataset, training the model to predict the images of the latest period using only the pre-temporal data leads to a dramatic performance decrease. Compared with other methods, CycleGAN achieves the best results with 16.93%, 16.72%, 4.6%, 3.98%, and 11.38% improvements in the IoU, precision, recall, accuracy, and F1 metrics, respectively. It can be seen that the deep learning method outperforms the traditional methods in low-resolution and complex scenes and is able to learn deeper mapping relationships between different temporal images. The ground truths of the buildings in Figure 11 are divided using obvious roads or bare woodland. The baseline and traditional image color translation methods make the models unable to learn rich label information well, and the segmentation effects are poor. Deep learning-based translations can help the segmentation model better adapt to complex scenes.    In addition, we further investigated the effects of the images generated by Cycle-GAN on different semantic segmentation models, such as PSPNet, DeepLabV3, OCRNet, Segformer, SwinTransformer, UNet(ResNet50) and UNet(EfficientNet-b1). As seen from Tables 3 and 4, UNet with EfficientNet-b1 as the encoder achieves the best performance in terms of most of the metrics on the change detection dataset and the land survey dataset, outperforming the other competitive CNNs and transformer networks. On the change detection dataset, the IoU, precision, recall, accuracy, and F1 evaluation metrics of UNet(Eff-b1) segmentation are 0.9366, 0.9697, 0.9647, 0.9878, and 0.9672, respectively, while on the land survey dataset, the evaluation metrics are 0.8121, 0.8806, 0.9125, 0.9689, and 0.8963; the precision is 2.13% lower than that of DeepLabV3. In the case that the transformer results are lower than those of UNet(Eff-b1), we believe that the reasons for this are as follows, since the transformer captures global contextual information in an attentional manner to establish a long-distance dependence on the target object; however, the generated image after translation still has some distortion and distribution shifts compared with the real image, which causes errors to be accumulated several times when capturing the global context information and thus affects the final segmentation effect. In addition, there is more noise in low-spatial-resolution images, which further affects the application of the transformer network in remote sensing images. We also analyzed the efficiency levels of different models (Table 5). UNet using EfficientNet-b1 as an encoder is more efficient than ResNet50 and achieves better FLOPs 0.637(G) compared to most models, while its number of parameters is only 0.065(M) higher than that of the PSPNet. Compared to other models, UNet(Eff-b1) has a better balance between accuracy and complexity, so it can meet the needs of complex scenarios with large areas and the deployment of applications in real situations. From the prediction results of different models (Figures 12 and 13), the differences among the effects of different models on the change detection dataset are small, and the advantage of UNet(Eff-b1) is shown in the more accurate edges produced by small objects. For the land survey dataset, UNet(Eff-b1) has better visual effects; its results are not only globally closer to the ground truth but also have less jaggedness at the local edges. Therefore, UNet(Eff-b1) can achieve stable and accurate results under different labeling scenes and different resolutions.

Effectiveness Analysis of the Post-Processing Update Strategy
During the urban building update process, most of the buildings in the area remain unchanged, especially when the time interval between the acquisition of the two timephased images is short. Moreover, the historical database is often already manually processed with strictly accurate labels for urban mapping, and these labels will be the primary reference for post-temporal ground truth. Therefore, we propose replacing some predictions that meet the overlap requirement with the corresponding historical labels to optimize the final update results. Tables 6 and 7 show the update results obtained for the two datasets using different thresholds after executing CycleGAN and UNet (Eff-b1). The up-

Effectiveness Analysis of the Post-Processing Update Strategy
During the urban building update process, most of the buildings in the area remain unchanged, especially when the time interval between the acquisition of the two time-phased images is short. Moreover, the historical database is often already manually processed with strictly accurate labels for urban mapping, and these labels will be the primary reference for post-temporal ground truth. Therefore, we propose replacing some predictions that meet the overlap requirement with the corresponding historical labels to optimize the final update results. Tables 6 and 7 show the update results obtained for the two datasets using different thresholds after executing CycleGAN and UNet(Eff-b1). The update results are more dependent on the segmentation of the post-temporal images because of the long time interval between the pre-and post-temporal images in the change detection dataset and the higher image resolutions and more obvious changes in the local areas of the buildings. The post-processing update strategy works best when a threshold of 1 is used, i.e., it degenerates to a point where it uses only the predictions as the final update results. The reason for this is that there are certain coordinate shifts in some buildings in the pre-and post-temporal images, i.e., systematic errors are incurred in the topological position of the historical database to be replaced and the ground truths of the images from the latest period, which affects the evaluations of the update results and inhibits the effectiveness of the post-processing update strategy ( Figure 14). It is worth noting that the updated results obtained with a threshold of 1 produce a small deviation from the accuracy above due to the post-processing of filtering non-polygon pixels and the effect of the simplified polygon of the contour extraction process.   However, the update results are more dependent on the pre-temporal ground truth due to the shorter time interval between pre-and post-temporal images with lower image resolutions and wider label ranges. On the land survey dataset, the post-processing update strategy with a threshold of 0.2 greatly improves the update accuracy, with IoU, precision, recall, accuracy, and F1 metric improvements of 11.52%, 8.34%, 6.06%, 2%, and 6.6%, respectively. Moreover, the accuracy reached the best results at a threshold of 0.4 However, the update results are more dependent on the pre-temporal ground truth due to the shorter time interval between pre-and post-temporal images with lower image resolutions and wider label ranges. On the land survey dataset, the post-processing update strategy with a threshold of 0.2 greatly improves the update accuracy, with IoU, precision, recall, accuracy, and F1 metric improvements of 11.52%, 8.34%, 6.06%, 2%, and 6.6%, respectively. Moreover, the accuracy reached the best results at a threshold of 0.4 and the recall at a threshold of 0. As shown in Figure 15, using the post-processing update strategy to update the historical labels that satisfy the segmentation conditions can effectively optimize the update results by making full use of the a priori knowledge contained in the historical database. Column 5 predicts the added buildings in the changed area based on the accurate retention of the unchanged area, and column 6 updates the demolished buildings in the changed area without being influenced by the historical database.

Discussion
Section 4.1 discusses the ablation experiments of the proposed update algorithm. Section 4.2 explains the reasons for the differences in thresholds across datasets.

Ablation Study
Ablation experiments of the proposed update algorithm were conducted on the change detection dataset and the land survey dataset. Here, UNet(Eff-b1) was trained on the original images without image color translation as a baseline, then CycleGAN and post-processing update strategy were gradually added on top of it to verify the effectiveness of each part of the update method. The results of the ablation experiments are shown in Tables 8 and 9.

Discussion
Section 4.1 discusses the ablation experiments of the proposed update algorithm. Section 4.2 explains the reasons for the differences in thresholds across datasets.

Ablation Study
Ablation experiments of the proposed update algorithm were conducted on the change detection dataset and the land survey dataset. Here, UNet(Eff-b1) was trained on the original images without image color translation as a baseline, then CycleGAN and postprocessing update strategy were gradually added on top of it to verify the effectiveness of each part of the update method. The results of the ablation experiments are shown in Tables 8 and 9.  Figures 16 and 17 show the comparison of the visualization results from the first row to the third row, which show that the baseline can better recognize the edges of buildings in the images after adding CycleGAN, thus validating the effectiveness of CycleGAN.
After adding the post-processing update strategy to baseline + CycleGAN, the IoU, precision, recall, accuracy, and F1 of the land survey dataset improved by 11.52%, 8.34%, 6.06%, 2%, and 6.6%, respectively, suggesting that retaining strictly accurate historical labels is helpful for the improvement of the final update results. The change detection dataset, however, is subject to systematic errors due to coordinate shifts, so the segmentation results are used as the final update results. After adding CycleGAN to the baseline, the IoU, precision, recall, accuracy, and F1 of the change detection dataset and the land survey dataset improved by 10.93%, 9.22%, 2.94%, 2.43%, 6.17%, and 16.93%, 16.72%, 4.6%, 3.98%, 11.38%, respectively, indicating that CycleGAN can mitigate the differences in the distribution of image color between different time phases. Figures 16 and 17 show the comparison of the visualization results from the first row to the third row, which show that the baseline can better recognize the edges of buildings in the images after adding CycleGAN, thus validating the effectiveness of CycleGAN.

Thresholds in the Post-Processing Update Strategy
The post-processing update strategy proposed in this paper was implemented based on intersection ratio, therefore, the intersection ratio of the pre-temporal label and the latter temporal segmentation result will affect the threshold of the post-processing update strategy. For the dataset with high image resolution, the semantic segmentation model can generally predict the target better, i.e., the main part of the building can be accurately predicted, and the error occurs more at the edges. Therefore, the high overlap between the segmentation result and the pre-temporal label leads to a larger threshold setting. However, for datasets with low image resolution or complex label range, it is extremely challenging for the semantic segmentation model to accurately predict the target, because the segmentation result is often part of the pre-temporal label or low overlap. Therefore, the partial overlap between the segmentation result and the previous temporal label leads to a smaller threshold setting.
In summary, different datasets need to set appropriate thresholds by considering the image resolution, label range, and the overlap between segmentation result and pretemporal label. If the dataset is similar to the change detection dataset, the threshold can be set as large as possible. Conversely, if the dataset is similar to the land survey dataset, the threshold can be reduced appropriately.

Conclusions
In this paper, an update algorithm without manual relabeling is proposed to address the problem regarding differences between the data distributions of pre-and post-temporal images in the building update process. First, we used CycleGAN to reduce the color differences among satellite images under different time phases in an unsupervised way, then utilized UNet(Eff-b1) to learn the deep semantic information of buildings based on the generated images and historical database, and used this information to predict the images in the latest period. In addition, a post-processing update strategy is proposed to strictly retain the historical labels of unchanged regions. In an experiment, the characteristics of different image color translation methods, the improvements achieved by various semantic segmentation models and the effectiveness of post-processing update strategies were compared. The final IoU, precision, recall, accuracy, and F1 metrics of the update results obtained on the change detection dataset and land survey dataset are 0.9363, 0.9692, 0.9649, 0.9877, and 0.9671 and 0.9272, 0.9581, 0.9663, 0.9888, and 0.9622, respectively, which are improvements of 10.9%, 9.17%, 2.96%, 2.42%, and 6.16% and 28.44%, 24.47%, 9.98%, 5.97%, and 17.97%, respectively, over the baseline. However, this paper does not fully utilize the a priori knowledge contained in existing labels when using the image color translation method, and the post-processing update strategy needs to set appropriate thresholds according to different datasets. In future work, we will try to utilize the label information of the target category in the translation process to better couple it with the semantic segmentation model and study the characteristics of the changed and unchanged regions of different categories under multiple datasets to better utilize the label contours of the unchanged regions. In addition, we will further attempt to explore the applicability of the adaptive post-processing update strategy on the update task. The source code is publicly available at https://github.com/wangzehui20/building-footprints-update.