Do Game Data Generalize Well for Remote Sensing Image Segmentation?

: Despite the recent progress in deep learning and remote sensing image analysis, the adaption of a deep learning model between different sources of remote sensing data still remains a challenge. This paper investigates an interesting question: do synthetic data generalize well for remote sensing image applications? To answer this question, we take the building segmentation as an example by training a deep learning model on the city map of a well-known PC game “Grand Theft Auto V” and then adapting the model to real-world remote sensing images. We propose a Generative Adversarial Networks (GAN) based segmentation framework to improve the adaptability of the segmentation model. Our model consists of a CycleGAN model and a ResNet based segmentation model, where the former one is a well-known image to image translation framework which learns a mapping of the image from the game domain to the remote sensing domain; and the latter learns to predict pixel-wise building masks based on the transformed data. All models in our method can be trained in an end-to-end fashion. The segmentation model can be trained without requiring any additional ground truth reference of the real-world images. Experimental results on a public building segmentation dataset suggest the effectiveness of our adaptation method. Our method outperforms some other state of the art semantic segmentation methods, e.g. Deeplab-v3 and UNet. Another advantage of our method is that by introducing semantic information to the image to image translation framework, the image style conversion can be further improved.


Introduction
Remote sensing has opened a door for people to better understand the earth, changing all walks of our life.Remote sensing technology has very broad applications, including disaster relief, land monitoring, city planning, etc.With the rapid development of imaging sensors, the modality of the remote sensing data is becoming more and more diversified.People now can easily acquire and access to up-to-date remote sensing images from a variety of imaging platforms (e.g., airborne, spaceborne) with a wide spectral range (from multi-spectrum to hyper-spectrum) at multiple spatial resolutions (from centimeters to kilometers).
Despite its recent success in automatic remote sensing image analysis, the adaption of a deep learning model between different sources of remote sensing data still remains a challenge.On one hand, most of the previous methods of this field are designed and trained based on images of specific resolution or motility.When these methods are applied across different platforms (e.g., the remote sensing images with different modality or image resolution), their performance will be deeply affected.On the other hand, with the fast increase of the deep neural networks' capacity, the training of deep learning models requires a huge amount of data with high-quality annotations, but the manual labeling of these data is time-consuming, expensive and may heavily rely on domain knowledge.
Arguably, improving the adaptability of a model between the different sources of images can be essentially considered as a visual domain adaptation problem [29][30][31].The mechanism behind the degradation lies in the none independent and identically distributed data between the training and deployment, i.e., the "domain gap" [29,32] between different sources.Since the training of most of the deep CNN models can be essentially considered as a maximum likelihood estimation process under the assumption of the "independent and identically distribution" [2], once the data distribution has changed after training, the performance can be deeply affected.An important idea for tackling this problem is to learn a mapping/transformation between the two groups of data (e.g., the training data and the testing data) so that they will have, in principle, the same distribution.
In the computer vision field, efforts have been made to generalize a model trained on the rendered images [27,28] (e.g., computer games) to real-world computer vision tasks and have obtained promising results [33][34][35][36].In recent open-world PC games, such as Grand Theft Auto 1 , Watch Dogs 2 , and Hitman 3 , to improve a player's immersion, the game developers feature extensive and highly realistic worlds.The realism of these games is not only in the high fidelity of material appearance and light transport simulation but also in the content of the game worlds: the layout of structures, objects, vehicles and environments [28].In addition to realism, the game maps are growing explosively and are made more and more sophisticated.For example, in a well-known game "Grand Theft Auto V (GTA-V)" 4 , the Los Santos, a fictional city featured in the game's open world, covers an area of over 100km 2 with unprecedented details.Reviewers praised its design and similarity to Los Angeles.Fig. 1 shows a part of its official map and a frame rendered by the image.
In this paper, we investigate an interesting question: do synthetic data generalize well for remote sensing applications?To answer this question, we train a remote sensing image segmentation model on the city map of the well-known video game GTA-V and then adapting it to real-world remote sensing application scenarios.Due to the "domain gap" between the game data and the real-world data, simply applying the models trained on game data may lead to a high generalization error on real-world applications.A general practice to tackle this problem is "neural style transfer" i.e. to transform the game images by using a deep neural network so that the transformed images share the similar style of the real-world ones while keeping their original image contents unchanged [37][38][39][40].The Fully Convolutional Adaptation Network (FCAN) [33] is a representative of this group of the method.The FCAN aims to improve the street-view image segmentation across different domains.By transforming the game data to the style of urban street scenes based on neural style transfer, it narrows the domain gap and improves the segmentation performance.More recently, Generative Adversarial  Networks (GAN) [41,42] has greatly promoted the progress of image translation [41,[43][44][45].Zhu et al. proposed a method called CycleGAN [44], which has achieved impressive results in a variety of image to image translation tasks.Owing to the "cycle consistency loss" they introduced, people are now able to obtain realistic transformations between the two domains even without the instruction of paired training data.CycleGAN has then been applied to improve visual domain adaptation tasks [34,46].
In this paper, we choose an important application in remote sensing image analysis, i.e. the building segmentation, as an example by training a deep learning model on the city map of the game GTA-V and then adapting our model to real-world remote sensing images.We build our dataset based on the GTA-V official game maps.Different from any previous methods [33][34][35][36] and any previous image segmentation dataset [27,28] that focuses on the game images generated from the "first-person perspective" or from the "street view", our dataset is built from an "aerial view" of the game world, resulting more abundant ground features and spatial relationship of different ground objects.
We further proposed a generative adversarial training based method, called "CycleGAN based Fully Convolutional Networks (CGFCN)", on top of the CycleGAN [44], to improve the adaptability of a deep learning model to different sources of remote sensing data.Our model consists of a CycleGAN model [44] and a deep Fully Convolutional Networks (FCN) [13,47] based segmentation model, where the former one learns to transforms the style of an image from the "game domain" to the "remote sensing domain" and the latter one learns to predict pixel-wise building masks.The two models can be trained in an end-to-end fashion without requiring any additional ground truth reference of the real-world images.Fig. 2 shows an overview of our method.
Different from the previous methods like FCAN [33] where the image transformation model and the segmentation model are trained separately, our model can be jointly trained in a unified framework, which leads to additional performance gains.Experimental results on Massachusetts Buildings [48], a well-known building segmentation dataset, suggest the effectiveness of our adaptation method.Our method outperforms some other state of the art semantic segmentation methods, e.g.Deeplab-v3 [47] and UNet [49].In addition, by introducing semantic information to the image to image translation framework, the image style conversion of the CycleGAN can be further improved by using our method.
The contributions of this paper are summarized as follows: • We investigate an interesting question: do game data generalize well for remote sensing image segmentation?To answer this, we study the domain adaptation ability of a deep learning based segmentation methods by training our model based on the rendered in-game data and then apply it to real-world remote sensing tasks.

•
We introduce a synthetic dataset for building segmentation based on the well-known PC game GTA-V.Different from the previous datasets [27,28] that focuses on rendering street-view images from the "first-person perspective", we build our dataset from the "aerial perspective" of the city.To our best knowledge, this is the first synthetic dataset that focuses on aerial view image segmentation tasks.We will make our dataset publicly available.
The rest of this paper is organized as follows.We give a detailed introduction to our method in Section 2. Our experimental datasets and evaluation metrics are introduced in Section 3. The experimental results are given in Section 4, and the conclusions are drawn in Section 5.

Methodology
In this section, we will first give a brief review of some related methods, including the vanilla GAN [41] and the CycleGAN [44].Then, we will introduce the proposed CGFCN and our implementation details.
The key to GAN's success is the idea of adversarial training where the two networks, a generator G and a discriminator D, will contest with each other in a minimax two-player game and forces the generated data to be, in principle, indistinguishable from real ones.In this framework, the generator aims to learn a mapping G(z) from a latent noise space z ∈ Z to a particular data distribution of interest.The discriminator, on one hand, aims to discriminate between instances from the true data distribution x ∼ p data and those generated ones G(z), on the other hand, feeds its output back to G to further make the generated data indistinguishable from real ones.The training of a GAN can be considered as solving the following minimax problem: where x and z represent a true data point and an input random noise.The above problem can be well-solved by iteratively updating D and G: i.e., by first fixing G and updating D to maximize V(D, G), and then fixing D and updating G to minimize V(D, G).As the adversarial training progresses, the D will have more powerful discriminative ability and thus the images generated by the G will become more and more realistic.As is suggested by I. Goodfellow et al. [41], instead of training G to minimize log(1 − D(G(•))), in practice, many researchers choose to maximize log D(G(•)).This is because in the early stage of learning, log(1 − D(G(•))) tends to saturate.This revision on objective provides much stronger gradients.

CycleGAN for Image to Image Translation
Suppose X represents a source domain (e.g., the game maps), Y represents a target domain (e.g., the real-world remote sensing images), and x i ∈ X and y j ∈ Y are their training samples.In a GAN-based image to image translation task [44,54], we aim to learn a mapping G: X → Y such that the distribution of images from G(x i ) is indistinguishable from the distribution of Y using an adversarial loss.In this case, the above random noise vector z in the vanilla GAN will be replaced by an input image x.In addition, the generator G and the discriminator D are usually constructed based on deep convolutional networks [51].Similar to the vanilla GAN [41], the G and D are also trained to compete with each other.Their objective function can be rewritten as follows: where x and y represent two images from the domain A and domain B. p x (x) and p y (y) are their data distributions.
In 2017, Zhu et al. proposed CycleGAN [44] for solving the image to image translation problem.The main contribution of the CycleGAN is the introduction of the cycle "consistency loss" in the adversarial training framework.CycleGAN breaks the limits of previous GAN based image translation methods, in which their models need to be trained by pair-wise images between the source and target domains.Since no pair-wise training data is provided, they couple it with an inverse mapping F: Y → X and enforce F(G(X)) ≈ X (and vice versa).
A CycleGAN consists of four networks: two generative networks G Y , G X , and two discriminative networks D Y , D X .To transform the style of an image x i ∈ X to the domain Y, a straight forward implementation would be training the G Y to learn a mapping from X to Y so that to fool the D Y to make it fail to tell which domain they belong to.The objective function for training the G Y and D Y can be thus written as follows: where G Y (x) maps the data from domain X to domain Y, and D Y (G Y (x))) is trained to classify whether the transformed data is real or fake.Similarly, G X can also be trained to learn to map the data from Y to X and D X is trained to classify it.The objective function for the training of G X and D X can be thus written as L Y→X (G X , D X ).
However, because the mapping is highly under-constrained, with large enough model capacity, the networks G X and G Y can map the same set of input images to any random permutation of images in the target domain if no pair-wise training supervision is provided [44], thus may fail to learn image correspondence between the two domains.To this end, the CycleGAN introduces a cycle consistency loss that further enforces the transformed image to be mapped back to itself in the original domain: The cycle consistency loss is defined as follows: where • 1 represents the pixel-wise l 1 loss (sum of absolute difference of each pixel between the input and the back-projected output).The CycleGAN uses pixel-wise l 1 loss rather the l 2 loss since the former one encourages less blurring effect.
The final objective function of the CycleGAN can be written as the sum of ( 3) and ( 4): where G = (G X , G Y ) and D = (D X , D Y ).λ > 0 controls the balance of the different objectives.

CGFCN
We build our model on top of the CycleGAN.Our model consists of five networks: G X , G Y , D X , D Y and F, where the (G X , G Y , D X , D Y ) correspond to a CycleGAN, and the F is a standard FCN based image segmentation network.Fig. 2 shows an overview of the proposed method.
The goals of the proposed CGFCN is twofold.On one hand, we aim to learn two mappings G Y (x) and G X (y), where the former one maps the data from X to Y and the latter one maps the data from Y to X. On the other hand, we aim to train the F to predict pixel-wise building masks on the transformed data G Y (x).Since the CycleGAN can convert the source data to the target style while keeping their content unchanged, we use it to generate target-like images.In this way, the transformed data G Y (x) is given as the input of F and the ground truth of the original game data is given as the reference when training the segmentation network.
Suppose x ∈ {0, 1} represents the pixel-wise binary label of the image x, where "1" represents the pixel belonging to the category of "building" and "0" represent the pixel belonging to the category of "background".As the segmentation is essentially a pixel-wise binary classification process, we design the loss function of the segmentation network F as a standard pixel-wise binary cross-entropy loss.We express it as follows: On combining the CycleGAN's objectives with the above segmentation loss, the final objective function of our method can be written as follows: where µ > 0 controls the balance between the image translation task and the segmentation task.The training of our model can be considered as a minimax optimization process where the G and F try to minimize its objective while the D tries to maximize it:  1.The details of our experimental datasets.We train our model on the synthetic remote sensing dataset (GTA-V game map) and then run evaluation on the real remote sensing dataset (Massachusetts Building).
Since all networks of our model are differentiable, the image segmentation network F can be jointly trained with the CycleGAN networks in an end-to-end fashion.
A complete optimization pipeline of our method is summarized as follows: • Step 1. Initialize the weights of the networks ( G, D) with random initialization.Initialize the F using the ImageNet pre-train weights.

•
Step 2. Fix G and D, and update F to minimize the L CGFCN .

•
Step 3. Fix F and D, and update G to minimize the L CGFCN .

•
Step 4. Fix F and G, and update D to maximize the L CGFCN .

•
Step 5. Repeat the steps 2-4 until the maximum epoch number reached.

Implementation Details
We build our generators G and discriminators D by following the configurations of the CycleGAN paper [44].We build the D as a local perception network -which only penalizes the image structures at the scale of patches (a.k.a the Markovian discriminator or "PatchGAN").The D tries to classify if each N × N patch in an image is a clean image (real) or a decomposed one (fake).This type of architecture can be equivalently implemented by building a fully convolutional network with N × N perceptive fields.Such design is more computationally efficient since the responses of all patches can be obtained by taking only one time of forward-propagation.We build the G by following the configuration of the UNet [49].We add skip connections to our separator between the layer i and layer n − i for learning both high-level semantics and low-level details.
Our segmentation network F is built based on the ResNet-50 [6] by removing its fully connected layers and adding an additional 1 × 1 convolution layer at its output end.In this way, the network F can proceed an input image with an arbitrary size and aspect ratio.Besides, to increase the output resolution, we change the convolutional stride from 2 to 1 at the "Conv_3" layer and "Conv_4" layer in ResNet-50 to enlarge the output resolution from 1/32 to 1/8 of the input.
During the training, the D, G, and F are alternatively updated.The maximum training iteration is set to 200 epochs.We train D G, and F by using Adam optimizer [61].The D and G are trained from scratch.For the first 100 epochs, we set learning_rate = 0.0001.For the rest epochs, we reduce the learning rate to its 1/10.The F is trained from the ImageNet [62] pre-trained initialization with the learning rate of 1e −3 .The learning rate decays to 90% per 10 epochs.We set λ = 10.0 and µ = 1.0.To increase the diversity of the training images, the data augmentation is used during the training, including the random image rotation (90 • , 180 • , 270 • ) and vertical / horizontally image flipping.

Dataset and Evaluation Metrics
We build our aerial view image segmentation dataset based on the game map of the PC game GTA-V.We use a sub-region of the rendered satellite map as our training data.This part of the map is Figure 3.A preview of our two experimental datasets.The first two rows show some representative images and their ground truth labels from our synthetic dataset (our training and validation set).The last two rows show some images pairs from the real remote sensing dataset [48] (our testing set).located in the urban part of the fictional city "Los Santos".We build its ground truth map based on its official legend (8,000×8,000 pixels) by manually annotating the building regions.As the GTA-V official map contains a Google map fashion color legend for various ground features, the manual annotation can be very efficient -it only takes half an hour for a single person to complete the annotation.Our dataset covers the most ground features of a typical coastal city, e.g., building, road, green-land, mountain, beach, harbor, wasteland, etc.In Fig. 3, the first two rows show some representative samples and their ground truth of our synthetic dataset.
We test our model on a real-world remote sensing dataset, the Massachusetts building detection dataset [48].As the CycleGAN focuses on reducing image style differences between two sets of images, which requires the two datasets to have similar contents.Therefore, we use a subset of the Massachusetts building dataset as our test set where the images in this subset are captured above the urban area.All images in both our training set and test set are cropped to image slices with 500 × 500 pixel size and with the resolution of ∼1m/pixel before feeding into our networks.Table 1 gives the statistics of our experimental datasets.
We use the Intersection Over Union (IOU) as our evaluation metric for the segmentation results.The IOU metric is commonly used in previous building segmentation literature [63,64].Given a segmentation output and a ground truth reference image with the same size, the IOU is defined as follows: where N TP , N FP and N FN represent the number of true positive, false positive and false negative pixels of the segmentation result.

Experimental Results
In this section, we first compare our method with some other state of the art segmentation methods.Then the ablation experiment is made to evaluate the effectiveness of each of our technical components.Finally, some additional controlled experiments are made to investigate whether the integration of semantic labels helps style conversion.

Metric
UNet [49] [49], Deeplab-v3 [47], and CGFCN (Ours).For each of these methods, we repeat the training of each model for five times and then record the accuracy of each model on our test set (marked as "Test-1" ∼ "Test-5").The CGFCN obtains the best results in terms of both mean accuracy and stability.

Comparison with Other Methods
We compare our model with some state of the art semantic segmentation models, including Deeplab-v3 [47] and UNet [49].These models are first trained on our training set and then directly evaluated on our test set without the help of the style transfer.All models are fully optimized for a fair comparison.For each of these methods, we repeat the training of each model for five times, and then record the accuracy of each model on our test set (marked as "Test-1" ∼ "Test-5").
Table 2 shows their accuracy during the five repeated tests.It should be noticed that although we do not apply any other tricks (e.g., feature fusion and dilated convolution) to increase the feature resolution, as those are used in UNet [49] and Deeplab-v3 [47], our method still achieves the best results in terms of both mean accuracy and stability (standard deviation).
Fig. 4 shows some image translation examples of our method, where the first two rows show some rendered game images and the "game → real world" translation results.The second two rows show some real world images from the Massachusetts building dataset and the "real world → game" translation results.It can be seen that the style of these images has been transformed to another domain while their contents are retained at the same time.

Ablation Analysis
To further evaluate the effectiveness of the proposed methods, the ablation experiment is conducted to analyze the importance of each component of the proposed method, including the "domain adaptation" (Adaptation) and the "end-to-end training" (End-to-End).For a fair comparison, we set our method and its variants with the same experimental configurations in data augmentation, and use the same training hyper-parameters.We first compare with a weak baseline method "ResNet50-FCN" where our segmentation network F is only trained according to Eq (6) without the help of adversarial domain adaptation (the first row in Table 3).Then, we gradually add other technical components.

•
Res50-FCN: We train our segmentation network F according to Eq (6).The training is performed on game data and then the evaluation is performed on real data without the help of domain adaption (our weak baseline).• Adaptation: We first train a CycleGAN model separately to transform the game data to the real-world data style.Then we train our segmentation network F based on the transformed data by freezing the parameters of the CycleGAN part (our strong baseline).• End-to-End: we jointly train the CycleGAN and our segmentation network F according to Eq (8) in an end-to-end fashion (our full implementation).
Table 3 shows the evaluation results of all the above variants.We can see the integration of domain adaptation and end-to-end learning yields noticeable improvements in the segmentation accuracy.Fig. 5 shows some building segmentation results of the UNet [49], Deeplab-v3 [47], and all the above-mentioned ablation variants.The green, yellow, and red pixels represent "true positives", "false negatives", and "false positives", respectively.Although the style transfer (Res50FCN+Adaptation, our strong baseline) improves the segmentation result, it still has some limitations.As shown in the third row of Fig. 5, the flyover is falsely labeled as building by the UNet, ResNet50, and ResNet50-Adpt, while our end-to-end model (full-implementation) can effectively remove most false-alarms.
This improvement (∼ 4%) is mainly owing to the introduction of semantic information, which benefits our method in generating more precisely stylized results.This indicates that the integration of the semantic information to the style transfer process helps to reduce the style difference between two datasets and thus a semantic segmentation model jointly trained with a style transfer model yields incremental segmentation results.Another reason for the improvement is due to the perturbation of the data introduced by the end-to-end training process, where the intermediate results produced by the CycleGAN produces small input variations to the segmentation network.This variation can be considered as a data augmentation process, which helps improve the generalization ability.

Do Semantic Labels Help Style Conversion?
Another advantage of our end-to-end training framework is that it introduces semantic information to the style transfer framework and thus it will benefit to style conversion.Fig. 6 gives a comparison example with or without the help of semantic guidance when performing the CycleGAN style conversion.There are subtle differences in the results produced by using the two configurations.The stylized images generated by our end-to-end trained CGFCN are much closer to the distribution of the target domain than that of the original CycleGAN [44].This improvement helps in generating more accurate segmentation results.
To further evaluate the effectiveness of our method, we quantitatively compare with CycleGAN on their generated images, as shown in Table 4.We use three image similarity evaluation metrics: the Peak Signal-to-Noise Ratio (PSNR), the Structural Similarity (SSIM) index [65], and the Fréchet Inception Distance (FID) [66].The PSNR and SSIM two are classic metrics for evaluating image restoration results.The FID is a more recent popular metric that can better evaluate the visual perceptual quality.The FID measures the deviation between the distribution of deep features of generated images and that of real images, which is widely used in adversarial image synthesis.It should be noticed that although the PSNR and SSIM are computed by comparing the resulting image to a reference image, which requires paired inputs, the FID can be evaluated free from such restrictions.As there is no ground truth reference in the "Game→Real" experiment, we only report the FID score in Table 4.To do this, we randomly divide the generated results and the real data into five groups and then compute the average FID similarity of them.We evaluate the style conversion results of two settings: 1) "game → real-world" conversion, and 2) "game → real-world → game" conversion.Our method achieves the best conversion results in terms of all evaluation metrics.

Conclusion
We investigate an interesting question that whether game data generalize well for remote sensing image segmentation.To do this, we training a deep learning model on the city map of the game "GTA-V" and then adapting the model to real-world remote sensing building segmentation tasks.To tackle the "domain shift" problem, we propose a CycleGAN-based FCN model where the mappings between the two domains are jointly learned with the building segmentation network.By using the above methods, we have obtained promising results.Experimental results suggest the effectiveness of our method for both segmentation and style conversion.Table 4. Evaluation results on image style transfer results with different similarity evaluation metrics: FID [66], PSNR, and SSIM [65].The column "Game→Real": we compute the similarity between a group of the real images and generated ones.The column "Game→Real→Game": we back-convert the generated image to the game domain and then compute their "self-similarity".For FID, lower scores indicate better.For PSNR (dB) and SSIM, higher scores indicate better.The CGFCN (ours) achieves the best results, which suggests that introducing semantical supervision will help improve image style conversion.
Foundation under the Grant 4192034 and the National Defense Science and Technology Innovation Special Zone Project.

Figure 1 .
Figure 1.An official map of the PC game GTA-V: the city of Los Santos.(a) The satellite imagery rendered from aerial view.(b) An in-game frame rendered from the "first-person perspective".(c) A part of the game map that is used in our experiment.(d) The legend of the map (in a similar fashion of Google maps).Different from the previous datasets[27,28] that focuses on rendering street-view images from the "first-person perspective" (like (b)), we build our dataset from the "aerial perspective" of the city (c-d).

Figure 2 .
Figure 2.An overview of our method.Our method consists of five networks: G X , G Y , D X , D Y , and F. The former four networks learn two mappings between the game domain X and the remote sensing domain Y (G X : Y → X, G Y : X → Y).The last network F learns to predict building masks of the transformed data.In our method, we first transfer the style of a synthetic game map x to a real one G Y (x) and then train the network F based on the transformed image G Y (x) (input) and the game map legend (ground truth).

Figure 4 .
Figure 4. (Better viewed in color) Some image translation results by using our method.The first two rows show the translation results from the game domain to the real-world domain.The last two rows show the inverse translation results from the real-world domain to the game domain.

Figure 6 .
Figure 6.(Better viewed in color) A visual comparison of style transfer results with different configurations.The first row corresponds to the results of the original CycleGAN [44] and the second row corresponds to the results of our method.Columns: (a) input game data, (b) game → real, (c) game →real→game.It can be seen that the images generated by our method are closer to the target domain and thus helps cross-domain segmentation.It removes some unrelated image contents (e.g., the pollution area marked by the arrows) during the conversion.

Table 2 .
A Comparison of different methods that are trained based on synthetic data and then tested on real data: UNet

Table 3 .
Results of our ablation analysis on "domain adaptation" and "end-to-end training".Baseline method: Res50FCN.Adaptation: style transfer networks ( D, G) and segmentation network F are separately trained.En2En: jointly train all networks in an end-to-end fashion.The integration of the domain adaptation and end-to-end learning yields noticeable improvements in the segmentation accuracy.