Semantic Segmentation of Remote-Sensing Imagery Using Heterogeneous Big Data: International Society for Photogrammetry and Remote Sensing Potsdam and Cityscape Datasets

: Although semantic segmentation of remote-sensing (RS) images using deep-learning networks has demonstrated its e ﬀ ectiveness recently, compared with natural-image datasets, obtaining RS images under the same conditions to construct data labels is di ﬃ cult. Indeed, small datasets limit the e ﬀ ective learning of deep-learning networks. To address this problem, we propose a combined U-net model that is trained using a combined weighted loss function and can handle heterogeneous datasets. The network consists of encoder and decoder blocks. The convolutional layers that form the encoder blocks are shared with the heterogeneous datasets, and the decoder blocks are assigned separate training weights. Herein, the International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam and Cityscape datasets are used as the RS and natural-image datasets, respectively. When the layers are shared, only visible bands of the ISPRS Potsdam data are used. Experimental results show that when same-sized heterogeneous datasets are used, the semantic segmentation accuracy of the Potsdam data obtained using our proposed method is lower than that obtained using only the Potsdam data (four bands) with other methods, such as SegNet, DeepLab-V3 + , and the simpliﬁed version of U-net. However, the segmentation accuracy of the Potsdam images is improved when the larger Cityscape dataset is used. The combined U-net model can e ﬀ ectively train heterogeneous datasets and overcome the insu ﬃ cient training data problem in the context of RS-image datasets. Furthermore, it is expected that the proposed method can not only be applied to segmentation tasks of aerial images but also to tasks with various purposes of using big heterogeneous datasets.


Introduction
Semantic segmentation involves the allocation of a semantic label to each pixel of an image containing an object, which can deliver high-level structure information [1]. Semantic segmentation is a crucial task in intelligent applications, such as mobile robots and autonomous driving vehicles, because it can provide an accurate understanding of a scene [2]. Recently, significant advances have been achieved in semantic segmentation techniques of natural RGB scenes owing to the development of deep convolutional neural networks (CNNs). Deep-learning models can learn high-level abstract features from raw images with excellent performance; however, these approaches rely on large training samples [3]. To satisfy this requirement, various public datasets have been proposed for scene labeling. For example, PASCAL VOC [4] is a large-scale dataset used for object class recognition and contains 2913 images with pixel-level labeling with 20 classes, such as vehicles and animals. Although similar for satellite image segmentation of heterogeneous datasets. They used active learning, wherein a trained network was refined using a few sample images in the heterogeneous training and test datasets, and their method improved the network's performance with minimal human intervention. Several studies have been conducted using heterogeneous datasets with similar types of images and labels; however, very few studies have employed natural RGB-and RS-image datasets. For the segmentation of RS-images, there were cases in which pre-trained segmentation networks from natural-image datasets, such as ImageNet, were used as initial values of the segmentation network for RS-images, but the research on directly learning RS and natural RGB-images together is insufficient [3,29].
To address this paucity of research, we analyze the possibility of sharing networks of heterogeneous datasets and identify the impact of learning using the combined weighted loss function between two networks trained on different datasets to overcome the limitations imposed by a lack of training datasets. In this study, aerial and road-view image datasets were used to evaluate the proposed method. Finally, it is expected that the proposed method can be applied to not only segmentation tasks of aerial images, but also the segmentation or object detection of street-view image on web sites and scanned floorplans. The remainder of this paper is organized as follows. The architecture of the proposed method is presented in Section 2, and the datasets and environmental conditions for the experiments are described in Sections 3 and 4. The results and discussion are provided in Sections 5 and 6, respectively, and the conclusions are drawn in Section 7.

Methods
The aim of this study is to develop a deep-learning network that can improve the accuracy of semantic segmentation of RS-images using large-scale natural-image datasets with different characteristics. To this end, assuming that the number of RS-images is insufficient, experiments are conducted to determine whether the problem of insufficient training data can be solved when training is performed using both RS-image and natural-image datasets.

Proposed Combined U-net Model
The proposed method performs semantic segmentation using a technique based on a simplified version of U-net. U-net is a representative deep-learning model used for segmenting RS-images. To reduce the training burden on the proposed model, simplified U-net is used here. Figure 1 shows the architecture of the combined U-net model. semantic relationships between the labels of each dataset. Ghassemi et al. [28] designed an encoder and decoder network for satellite image segmentation of heterogeneous datasets. They used active learning, wherein a trained network was refined using a few sample images in the heterogeneous training and test datasets, and their method improved the network's performance with minimal human intervention. Several studies have been conducted using heterogeneous datasets with similar types of images and labels; however, very few studies have employed natural RGB-and RS-image datasets. For the segmentation of RS-images, there were cases in which pre-trained segmentation networks from natural-image datasets, such as ImageNet, were used as initial values of the segmentation network for RS-images, but the research on directly learning RS and natural RGBimages together is insufficient [3,29] To address this paucity of research, we analyze the possibility of sharing networks of heterogeneous datasets and identify the impact of learning using the combined weighted loss function between two networks trained on different datasets to overcome the limitations imposed by a lack of training datasets. In this study, aerial and road-view image datasets were used to evaluate the proposed method. Finally, it is expected that the proposed method can be applied to not only segmentation tasks of aerial images, but also the segmentation or object detection of street-view image on web sites and scanned floorplans. The remainder of this paper is organized as follows. The architecture of the proposed method is presented in Section 2, and the datasets and environmental conditions for the experiments are described in Sections 3 and 4. The results and discussion are provided in Sections 5 and 6, respectively, and the conclusions are drawn in Section 7.

Methods
The aim of this study is to develop a deep-learning network that can improve the accuracy of semantic segmentation of RS-images using large-scale natural-image datasets with different characteristics. To this end, assuming that the number of RS-images is insufficient, experiments are conducted to determine whether the problem of insufficient training data can be solved when training is performed using both RS-image and natural-image datasets.

Proposed Combined U-net Model
The proposed method performs semantic segmentation using a technique based on a simplified version of U-net. U-net is a representative deep-learning model used for segmenting RS-images. To reduce the training burden on the proposed model, simplified U-net is used here. Figure 1 shows the architecture of the combined U-net model. Figure 1. Architecture of the combined U-net model. "Conv2D" denotes the two-dimensional (2D) convolutional layers, and "Conv2DTranspose" denotes a transposed 2D convolutional layer. "Concatenate" denotes a concatenated layer. Figure 1. Architecture of the combined U-net model. "Conv2D" denotes the two-dimensional (2D) convolutional layers, and "Conv2DTranspose" denotes a transposed 2D convolutional layer. "Concatenate" denotes a concatenated layer. The combined U-net model is based on the U-net architecture. The two main components of the U-net architecture are encoder and decoder. For encoding, the combined U-net encoder consists of three main blocks that are shared between the two datasets during the training phase. The model receives input from two different data sources (i.e., the RS-image and natural-image datasets), whose input widths and heights are n × n. The three encoding blocks receive data from the two sources and share the training-parameter weights between them. Every block in the shareable blocks mainly consists of two 2D convolutional layers followed by a max-polling layer for downscaling. Thus, at the end of the encoding phase, the feature map has a size of n 8 × n 8 . The shared convolutional layers can learn common information that applies to all datasets across all domains [3]. Lee et al. [3] confirmed that shared layers across a domain can be more effective in CNN optimization instead of using only a single dataset because the sharing-layer approach notably improves the classification accuracy compared with the individually trained case. However, in the study by Lee et al. [3], only the middle of the network was shared and only three RS-images were used to train the combined network. In the present study, the initial three encoder blocks are shared during the learning process and the later blocks handle the dataset-specific segmentation tasks. After finishing the encoding phase, the feature maps are decoded separately, implying that every dataset follows a separate decoding path and has separate training weights.
There are two different decoding paths. The first path for the RS-image data consists of three blocks. The initial two blocks primarily consist of one transposed convolutional layer to upscale the feature map by two, followed by two convolutional layers. Furthermore, the last block contains eight convolutional layers. The second path for the natural-image data consists of six blocks. Each block mainly consists of one transposed convolutional layer to upscale the feature maps, followed by two convolutional layers. The decoding blocks end with a feature map of size n × n × 16, which is scanned using a 1 × 1 2D convolutional filter (3,3) to generate an output of size n × n × c, which matches the shape of the data labels; c is the number of label classes.
In this paper, we use natural-image and RS-image datasets. During the training, the combined U-net handles two inputs in parallel by sharing the three encoder blocks. Moreover, the network is trained with the combined weighted loss, L c , which is defined as the weighted sum of losses of the two paths. The combined model is then updated using the combined weighted losses ( Figure 2). The losses of the first and second paths are denoted by L n1 and L n2 , respectively, and the spatial cross entropy loss can be defined as shown in Equation (1). ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 4 of 15 The combined U-net model is based on the U-net architecture. The two main components of the U-net architecture are encoder and decoder. For encoding, the combined U-net encoder consists of three main blocks that are shared between the two datasets during the training phase. The model receives input from two different data sources (i.e., the RS-image and natural-image datasets), whose input widths and heights are × . The three encoding blocks receive data from the two sources and share the training-parameter weights between them. Every block in the shareable blocks mainly consists of two 2D convolutional layers followed by a max-polling layer for downscaling. Thus, at the end of the encoding phase, the feature map has a size of × . The shared convolutional layers can learn common information that applies to all datasets across all domains [3]. Lee et al. [3] confirmed that shared layers across a domain can be more effective in CNN optimization instead of using only a single dataset because the sharing-layer approach notably improves the classification accuracy compared with the individually trained case. However, in the study by Lee et al. [3], only the middle of the network was shared and only three RS-images were used to train the combined network. In the present study, the initial three encoder blocks are shared during the learning process and the later blocks handle the dataset-specific segmentation tasks. After finishing the encoding phase, the feature maps are decoded separately, implying that every dataset follows a separate decoding path and has separate training weights.
There are two different decoding paths. The first path for the RS-image data consists of three blocks. The initial two blocks primarily consist of one transposed convolutional layer to upscale the feature map by two, followed by two convolutional layers. Furthermore, the last block contains eight convolutional layers. The second path for the natural-image data consists of six blocks. Each block mainly consists of one transposed convolutional layer to upscale the feature maps, followed by two convolutional layers. The decoding blocks end with a feature map of size × × 16, which is scanned using a 1 × 1 2D convolutional filter (3,3) to generate an output of size × × , which matches the shape of the data labels; is the number of label classes.
In this paper, we use natural-image and RS-image datasets. During the training, the combined U-net handles two inputs in parallel by sharing the three encoder blocks. Moreover, the network is trained with the combined weighted loss, , which is defined as the weighted sum of losses of the two paths. The combined model is then updated using the combined weighted losses ( Figure 2). The losses of the first and second paths are denoted by and , respectively, and the spatial cross entropy loss can be defined as shown in Equation (1).
where H × represents the height and width of input x to the network, respectively, c is the number of classes, and , and , are the ground-truth and predicted values for the ith pixel  where H ×W represents the height and width of input x to the network, respectively, c is the number of classes, and y i,k andŷ i,k are the ground-truth and predicted values for the ith pixel x i and kth class among C different possible classes, respectively [30]. The combined weighted loss, L c , is defined as follows: where w 1 and w 2 are the weights of the two networks; these values are empirically determined. Because the ultimate goal of this network is to improve the semantic segmentation accuracy of RS-images, the weight of the network handling RS-image data is set higher than natural-image data.

Performance Evaluation
There are a number of different ways to evaluate the accuracy of semantic segmentation. In this paper, the overall accuracy (OA), precision, recall, and F1 score are used. OA represents the proportion of correctly classified observations relative to the ground-truth values, and it can be described in terms of true positive (TP), true negative (TN), false negative (FN), and false positive (FP) as OA is a simple and easy approach for evaluating the classification accuracy; however, when the class distributions are dissimilar, OA cannot be appropriately used to show the effectiveness of the results. Instead, the F1 score is a better approach for evaluating the results when there are imbalanced classes, similar to our case. The F1 score (Equation (4)) is the harmonic mean of precision and recall, which in turn measure the correctly identified positive cases from all predicted and actual positive cases, respectively, as given in Equations (5) and (6):

Datasets
Here, two different heterogeneous datasets are used to show the effectiveness of the proposed method. The ISPRS Potsdam and Cityscape datasets are used as the RS and natural-image datasets, respectively. Considering that the aim of this paper is to perform effective semantic segmentation of RS-images using a large-scale natural-image dataset, two datasets with high similarity between the data-label types were selected, even though the properties of the images are different. Both Cityscape and ISPRS Potsdam datasets mainly consist of images of civil areas and have several identical semantic-level features such as roads, cars, and buildings.

ISPRS Potsdam/Germany Dataset
The ISPRS 2D semantic label Potsdam/Germany dataset is an open benchmark dataset provided online [31] that contains high-resolution airborne images with a spatial resolution of 5 cm and consists of near-infrared (NIR), red, blue, and green orthorectified imagery with corresponding digital surface models. Furthermore, it contains ground-truth images comprising impervious surfaces, buildings, trees, low vegetation, cars, and unknown objects ( Figure 3).
The 2D Potsdam dataset contains 38 patches; only 24 images with corresponding labeling images were used for training and validation. Table 1 shows the patch numbers of the labeled data. Twenty-four large multispectral images of size 6000 × 6000 × 4 pixels in the.tiff format were used as inputs. As the size of the ISPRS images is very large, slices of 256 × 256 × 4 pixels with labels were extracted, separated into batches, and then stored. We trained the network using only a subset of the subimages to consider cases where RS-image data are limited.
ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 6 of 15 extracted, separated into batches, and then stored. We trained the network using only a subset of the subimages to consider cases where RS-image data are limited.

Cityscape Dataset
Cityscape provides a large and diverse set of stereo video sequences recoded from streets from 50 different cities with pixel-level ground-truth images [7]. It is a large road scene dataset that can provide semantic understanding of urban street scenes. Cityscape defines 19 semantic labels containing the "void" class for "do-not-care" regions.  Table 2 shows the semantic labels of the Cityscape dataset.

Cityscape Dataset
Cityscape provides a large and diverse set of stereo video sequences recoded from streets from 50 different cities with pixel-level ground-truth images [7]. It is a large road scene dataset that can provide semantic understanding of urban street scenes. Cityscape defines 19 semantic labels containing the "void" class for "do-not-care" regions.    To improve the performance efficiency of the proposed method by minimizing the differences between the two datasets, several preprocessing steps were employed. For instance, in the case of the Cityscape dataset, which has deeper semantic-level details than the ISPRS Potsdam dataset, the final classes were adjusted to match those in the Potsdam dataset ( Figure 5). Thus, the final classes of the Cityscape dataset included flat areas, construction, nature, vehicles, and voids. Furthermore, all items not included in the above final classes were reclassified as voids. Table 3    To improve the performance efficiency of the proposed method by minimizing the differences between the two datasets, several preprocessing steps were employed. For instance, in the case of the Cityscape dataset, which has deeper semantic-level details than the ISPRS Potsdam dataset, the final classes were adjusted to match those in the Potsdam dataset ( Figure 5). Thus, the final classes of the Cityscape dataset included flat areas, construction, nature, vehicles, and voids. Furthermore, all items not included in the above final classes were reclassified as voids. Table 3

Experimental Conditions
To demonstrate the effectiveness of the proposed method, several experiments were conducted to compare the segmentation results of the Potsdam images. SegNet, DeepLab-V3+, and simplified U-net, which followed the RS-image path in the combined U-net model, were used to segment the Potsdam dataset with its original four bands. In addition, the Potsdam dataset was trained with only RGB bands using the simplified U-net model to compare with the combined U-net model, which only deals with the RGB bands in the Potsdam images to share convolutional layers. We randomly selected a subset of images from the Potsdam dataset to train the networks. In particular, 1600 images were used as training data, and 400 and 150 images were used as validation and test data, respectively. Finally, the combined U-net model trained the Potsdam dataset with different two conditions. In Case 1, the same number of training datasets of Cityscape dataset was used. Moreover, we varied the number of Cityscape images to confirm the effect of the Cityscape dataset size when training using the combined U-net model. In Case 2, 3000, 550, and 300 images from the Cityscape dataset were used as training, validation, and test data, respectively. Further, to provide larger weights to the Potsdam dataset, (the weight of loss in RS-image path) and (the weight of loss in natural-image path) were set as 0.8 and 0.2, respectively.
The combined U-net model was trained on the free Google Colaboratory (Colab) platform [32]. The final epoch was set to 1000 of Adam with a learning rate of 10 . Considering the available memory on Colab, the batch size was set to 4. The two inputs had the same size (256 × 256 × 3); hence, we could share the weights in the initial layers. Moreover, 256 × 256 × 3 is a suitable size when using limited training resources such as RAMs and GPUs. Therefore, the original 6000 × 6000 × 4 Potsdam images were divided into smaller samples of 256 × 256 × 3, and the Cityscape images

Experimental Conditions
To demonstrate the effectiveness of the proposed method, several experiments were conducted to compare the segmentation results of the Potsdam images. SegNet, DeepLab-V3+, and simplified U-net, which followed the RS-image path in the combined U-net model, were used to segment the Potsdam dataset with its original four bands. In addition, the Potsdam dataset was trained with only RGB bands using the simplified U-net model to compare with the combined U-net model, which only deals with the RGB bands in the Potsdam images to share convolutional layers. We randomly selected a subset of images from the Potsdam dataset to train the networks. In particular, 1600 images were used as training data, and 400 and 150 images were used as validation and test data, respectively. Finally, the combined U-net model trained the Potsdam dataset with different two conditions. In Case 1, the same number of training datasets of Cityscape dataset was used. Moreover, we varied the number of Cityscape images to confirm the effect of the Cityscape dataset size when training using the combined U-net model. In Case 2, 3000, 550, and 300 images from the Cityscape dataset were used as training, validation, and test data, respectively. Further, to provide larger weights to the Potsdam dataset, w 1 (the weight of loss in RS-image path) and w 2 (the weight of loss in natural-image path) were set as 0.8 and 0.2, respectively.
The combined U-net model was trained on the free Google Colaboratory (Colab) platform [32]. The final epoch was set to 1000 of Adam with a learning rate of 10 −3 . Considering the available memory on Colab, the batch size was set to 4. The two inputs had the same size (256 × 256 × 3); hence, we could share the weights in the initial layers. Moreover, 256 × 256 × 3 is a suitable size when using limited training resources such as RAMs and GPUs. Therefore, the original 6000 × 6000 × 4 Potsdam images were divided into smaller samples of 256 × 256 × 3, and the Cityscape images were rescaled to 256 × 256 × 3. In particular, in the second path of the combined U-net, as the six decoding block ends output a feature map of size 1024 × 2048 × 16, this feature map was scanned using a 1 × 1 2D convolution filter to generate output maps of size 1024 × 2048 × 5 to match the shape of the labels.
We did not rescale the label images to obtain the same size as the input images because rescaling would lead to a loss of class information, which in turn would yield a distorted output. Thus, we decided to match the output of the Cityscape path with the size of the original label size (i.e., 1024 × 2048) to keep the class information untouched. Figure 6 shows the learning graphs of the OA of the training and validation sets for the six aforementioned cases. Since a limited number of images were used for training, there was a difference in the OA between the training and validation sets. When a model has high training and low validation accuracies, this case is probably known as overfitting. Since only a part of the Potsdam dataset was used for training, insufficient training data can sometimes lead to overfitting problems [33]. DeepLab-V3+ showed a higher OA for the validation set in comparison with the SegNet and simplified U-net models (Figure 6b). Also, the difference in the OAs between the validation and training sets was less than when using the SegNet model. In the case of the simplified U-net model, when training was performed by only using the RGB bands of the Potsdam images, the OA of the validation set was lower than when using the original four bands (Figure 6c,d). Also, the learning graph of the simplified U-net model of the training Potsdam images with its four bands showed that the OA of the training set was higher than in the other two cases with the combined U-net models; however, the OA of the validation set was relatively lower than that of the training set. When comparing between the simplified U-net model training with original four bands and Case 1 in the combined model, it can be seen that the OAs of the validation sets of both are similar, but the OAs of the training set in Case 1 were lower than those in the simplified U-net model (Figure  6d,e). Although the combined U-net model used for Case 1 trained the Potsdam and Cityscape datasets together, it used only the RGB-images from the Potsdam data. In contrast, the simplified Unet model also used the NIR band of the Potsdam data rather than simply using only the RGB bands, which aided the meaningful classification of objects such as trees and low vegetation that are When comparing between the simplified U-net model training with original four bands and Case 1 in the combined model, it can be seen that the OAs of the validation sets of both are similar, but the OAs of the training set in Case 1 were lower than those in the simplified U-net model (Figure 6d,e). Although the combined U-net model used for Case 1 trained the Potsdam and Cityscape datasets together, it used only the RGB-images from the Potsdam data. In contrast, the simplified U-net model also used the NIR band of the Potsdam data rather than simply using only the RGB bands, which aided the meaningful classification of objects such as trees and low vegetation that are particularly prominent at NIR wavelengths.

Results
In Case 2 of the combined U-net model, in which the size of the Cityscape dataset was increased relative to Case 1, the OAs of the training and validation sets further improved relative to Case 1. Although the OA of the training data in Case 2 was lower than that of DeepLab-V3+ and simplified U-net, the gap in training and validation accuracy was decreased. As the amount of data in the Cityscape dataset increased, the overfitting problem was effectively reduced. For the OAs of the Cityscape dataset in Case 1 and Case 2, they showed a similar tendency; furthermore, as the amount of data increased, the validation accuracy of the Cityscape dataset improved. Table 3 shows the average F1 scores of the five classes and the OA of the test set for the six models. The OAs of the SegNet, DeepLab-V3+, simplified U-net models with the two cases, and combined U-net models with the two cases are 0.8346, 0.8605, 0.8477, 0.7841, 0.8268 and 0.8721, respectively. Among the single models, the OA of the DeepLab-V3+ was highest, and Case 2 of the proposed combined U-net model had a higher OA than that of the DeepLab-V3+. The simplified U-net model trained using only the RGB bands of the Potsdam images had the lowest OA.
In addition, in the simplified U-net trained using original four bands, the F1 scores of the impervious surface, building, low vegetation, tree, and car classes are 0.8623, 0.8535, 0.8205, 0.8457 and 0.8123, respectively. Although Case 1 in the combined U-net model exhibited higher F1 scores for the impervious surface and building classes, it showed lower F1 scores for the vegetation and tree classes. In particular, Case 2 in the combined U-net model exhibited higher F1 scores for the impervious surface, building, and car classes. The road boundaries and car shapes were well predicted in Case 2. In order to visually analyze the segmentation results when using the combined U-net model rather than when using the simplified U-net model, we selected sites that can show the characteristics of three cases as an example, and the semantic segmentation results are shown in Figure 7. In the simplified U-net model trained with original four bands, there were errors in classifying roads and buildings; however, compared with Cases 1 and 2, low vegetation and tree classes were effectively distinguished (Figure 7a-j). In Case 1, materials on the roof were misclassified as car because their shape and colors were similar (Figure 7k-o). Case 2 showed the most efficiency in classifying buildings and roads; however, it could not clearly distinguish between trees and low vegetation (Figure 7k-u). ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 11 of 15 6. Discussion

Comparison with Other Algorithms
Among the single models, the DeepLab-V3+ model had higher segmentation OAs for the Potsdam images than the SegNet and simplified U-net models. In particular, the simplified U-net model had the lowest OA among the other networks compared with when training with RGB-images only. This is because the segmentation accuracies of the tree and low vegetation classes were reduced when training without the NIR band. In this context, although the Potsdam dataset was trained using an identically sized Cityscape dataset in Case 1, the OA and F1 scores of the vegetation, tree, and car classes showed no improvements compared with DeepLab-V3+ and simplified U-net. The results show that training using the Potsdam data in its original four bands is more effective in classifying vegetation classes than that using just the RGB bands. This is because the NIR band is a key band for classifying trees and low vegetation; in addition, several trees and low vegetation with low reflectance had similar colors to ground and impervious surfaces in the RGB-images. However, the semantic segmentation accuracy was improved by increasing the size of the Cityscape dataset when training using the combined U-net model. In particular, the F1 scores of impervious surface, buildings,

Comparison with Other Algorithms
Among the single models, the DeepLab-V3+ model had higher segmentation OAs for the Potsdam images than the SegNet and simplified U-net models. In particular, the simplified U-net model had the lowest OA among the other networks compared with when training with RGB-images only. This is because the segmentation accuracies of the tree and low vegetation classes were reduced when training without the NIR band. In this context, although the Potsdam dataset was trained using an identically sized Cityscape dataset in Case 1, the OA and F1 scores of the vegetation, tree, and car classes showed no improvements compared with DeepLab-V3+ and simplified U-net. The results show that training using the Potsdam data in its original four bands is more effective in classifying vegetation classes than that using just the RGB bands. This is because the NIR band is a key band for classifying trees and low vegetation; in addition, several trees and low vegetation with low reflectance had similar colors to ground and impervious surfaces in the RGB-images. However, the semantic segmentation accuracy was improved by increasing the size of the Cityscape dataset when training using the combined U-net model. In particular, the F1 scores of impervious surface, buildings, and cars improved in Case 2 relative to other networks. Furthermore, when training Potsdam using the Cityscape dataset, the overfitting problem was decreased. This is because although there are various roads, cars, and buildings in the Cityscape dataset and the shooting angle and shapes are different relative to such objects in the Potsdam images, there may be common points in terms of their color and/or relationships with the surrounding environment.

Cityscape Dataset Impact
By comparing the results of Case 1 and Case 2, it was confirmed that the combined U-net model was affected by the number of Cityscapes. We varied the number of Cityscape training data and changed the weight values of the Potsdam and Cityscape datasets (w 1 and w 2 ) (Figure 8). When the same number of Potsdam datasets was used; for example, 1600 images were used as training data and 400 and 150 images were used as validation and test data, respectively, the accuracy of the Potsdam test data improved with the increase in the training data of the Cityscape. In particular, when the number of Cityscape images was about 2500-2900, the OA of the combined U-net model became similar to that of the single models such as SegNet, simplified U-net, and DeepLab-V3+, and the number of Cityscape images was 3000, which is about twice that of the Potsdam dataset, so the OA dramatically increased (Figure 8a). ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 12 of 15 and cars improved in Case 2 relative to other networks. Furthermore, when training Potsdam using the Cityscape dataset, the overfitting problem was decreased. This is because although there are various roads, cars, and buildings in the Cityscape dataset and the shooting angle and shapes are different relative to such objects in the Potsdam images, there may be common points in terms of their color and/or relationships with the surrounding environment.

Cityscape Dataset Impact
By comparing the results of Case 1 and Case 2, it was confirmed that the combined U-net model was affected by the number of Cityscapes. We varied the number of Cityscape training data and changed the weight values of the Potsdam and Cityscape datasets ( and ) (Figure 8). When the same number of Potsdam datasets was used; for example, 1600 images were used as training data and 400 and 150 images were used as validation and test data, respectively, the accuracy of the Potsdam test data improved with the increase in the training data of the Cityscape. In particular, when the number of Cityscape images was about 2500-2900, the OA of the combined U-net model became similar to that of the single models such as SegNet, simplified U-net, and DeepLab-V3+, and the number of Cityscape images was 3000, which is about twice that of the Potsdam dataset, so the OA dramatically increased (Figure 8a). In the experiments, to give larger weights to the Potsdam dataset, and were fixed to 0.8 and 0.2, respectively. We examined the weight values effect through an experiment, where was set to 0.2 and was set to 0.8, where the weights of the two datasets were the same. In this case, the number of Potsdam and Cityscape datasets was the same as in Case 2 of the combined U-net model. As a result, with the decrease in the weight of the Potsdam dataset, the OA of the test set decreased. This is because the loss due to the Potsdam dataset was less reflected when training the combined U-net model (Figure 8b). In particular, when and were set to 0.2 and 0.8, respectively, the OA was less than with the DeepLab-V3+ model.

Limitations and Future Work
Although the combined U-net model can train heterogeneous datasets, its network structure is relatively complex, and it takes a long time to perform learning processes compared with network learning with a single dataset. For example, when training the combined U-net model in Google Colab, when the batch size was set to be more than 4 (8 or 16), we faced a memory problem. Furthermore, there was a limitation in that the performance of the combined U-net model was changed according to the number of Cityscapes and weight values.
In addition, since RS-images generally have more than four bands, including the NIR band, not using the additional bands for learning with natural RGB-images can lower the segmentation accuracy of vegetation-related classes. To overcome these limitations, future work is needed to develop a method that can effectively include the NIR band of RS-images while learning with natural RGB-images. Furthermore, to examine the effect of the architecture of the combined U-net model, we In the experiments, to give larger weights to the Potsdam dataset, w 1 and w 2 were fixed to 0.8 and 0.2, respectively. We examined the weight values effect through an experiment, where w 1 was set to 0.2 and w 2 was set to 0.8, where the weights of the two datasets were the same. In this case, the number of Potsdam and Cityscape datasets was the same as in Case 2 of the combined U-net model. As a result, with the decrease in the weight of the Potsdam dataset, the OA of the test set decreased. This is because the loss due to the Potsdam dataset was less reflected when training the combined U-net model (Figure 8b). In particular, when w 1 and w 2 were set to 0.2 and 0.8, respectively, the OA was less than with the DeepLab-V3+ model.

Limitations and Future Work
Although the combined U-net model can train heterogeneous datasets, its network structure is relatively complex, and it takes a long time to perform learning processes compared with network learning with a single dataset. For example, when training the combined U-net model in Google Colab, when the batch size was set to be more than 4 (8 or 16), we faced a memory problem. Furthermore, there was a limitation in that the performance of the combined U-net model was changed according to the number of Cityscapes and weight values.
In addition, since RS-images generally have more than four bands, including the NIR band, not using the additional bands for learning with natural RGB-images can lower the segmentation accuracy of vegetation-related classes. To overcome these limitations, future work is needed to develop a method that can effectively include the NIR band of RS-images while learning with natural RGB-images. Furthermore, to examine the effect of the architecture of the combined U-net model, we plan to see how the OA of the Potsdam dataset changes when the shared phase in the encoder and decoder part of the combined U-net model is changed (e.g., non-shared encoder and shared decoder).

Conclusions
In this paper, we proposed the combined U-net model that can train RS-images using a natural-image dataset. The network composed of encoder and decoder blocks, and the encoder blocks were shared with the two different datasets (Potsdam and Cityscape); the network was updated using the combined weighted loss function. The results obtained from the experiments indicated that when training using the identically sized Potsdam data with RBG bands and Cityscape data, the OA was decreased compared with single models training using only the original Potsdam data. However, the accuracy of the Potsdam dataset improved with an increase in the size of the Cityscape dataset. These results show that the use of a large-scale natural-image dataset can improve the semantic segmentation accuracy of the RS-image dataset using the proposed method. The proposed method can solve the problem of insufficiently large RS-image datasets for semantic segmentation. Furthermore, this study confirms the possibility of learning heterogeneous datasets at the same time by sharing the encoder phase and generated weights from two datasets in deep learning networks. It is expected that this approach can not only be applied to segmentation tasks of aerial images but also to tasks with various purposes of using big heterogeneous datasets. For example, when using a relatively limited number of datasets such as newly constructed floorplans and street-view images on web sites for special tasks such as classification and object detection, the big datasets, such as existing scanned floorplans and satellite or unmanned aerial vehicle images which are datasets acquired from different times and sensors but with similar characteristics, can be used to solve the problem of insufficient training data in input dataset by sharing specific part of deep learning networks.
However, the computational burden of the proposed method is relatively high because the combined U-net model trains heterogeneous datasets at the same time. In addition, the segmentation accuracy can be changed according to the number of Cityscape datasets and the weight values between Potsdam and Cityscape. Also, there was a problem in that the provided information by the NIR band could not be used because only the RGB bands were used. For future work, we aim to improve the combined U-net model structure by conducting experiments in regard to including the NIR band information of the Potsdam dataset.