Generative Learning for Postprocessing Semantic Segmentation Predictions: A Lightweight Conditional Generative Adversarial Network Based on Pix2pix to Improve the Extraction of Road Surface Areas

: Remote sensing experts have been actively using deep neural networks to solve extraction tasks in high-resolution aerial imagery by means of supervised semantic segmentation operations. However, the extraction operation is imperfect, due to the complex nature of geospatial objects, limitations of sensing resolution, or occlusions present in the scenes. In this work, we tackle the challenge of postprocessing semantic segmentation predictions of road surface areas obtained with a state-of-the-art segmentation model and present a technique based on generative learning and image-to-image translations concepts to improve these initial segmentation predictions. The proposed model is a conditional Generative Adversarial Network based on Pix2pix, heavily modiﬁed for computational efﬁciency (92.4% decrease in the number of parameters in the generator network and 61.3% decrease in the discriminator network). The model is trained to learn the distribution of the road network present in official cartography, using a novel dataset containing 6784 tiles of 256 × 256 pixels in size, covering representative areas of Spain. Afterwards, we conduct a metrical comparison using the Intersection over Union (IoU) score (measuring the ratio between the overlap and union areas) on a novel testing set containing 1696 tiles (unseen during training) and observe a maximum increase of 11.6% in the IoU score (from 0.6726 to 0.7515). In the end, we conduct a qualitative comparison to visually assess the effectiveness of the technique and observe great improvements with respect to the initial semantic segmentation predictions.


Introduction
Remotely sensed images have been used lately by researchers in machine vision applications such as object identification [1,2], detection [3], or extraction [4]. At the same time, deep learning algorithms proved to be useful for classification tasks and land use analysis [5] in satellite imagery data [6,7]-an important remote sensing application, where semantic segmentation techniques (based on supervised learning) are applied to assign a land cover class to every pixel of an image. This extraction task is generally carried out by means semantic segmentation and can be considered very challenging due to complex nature of geospatial objects, due to defects present in imagery (noise, occlusions, etc.), due to imperfections in the ground-truth segmentation masks or due to particularities of the segmentation algorithms applied. In one of our previous works [8], we studied the appropriateness of using state-of-theart segmentation models for extracting the surface areas of secondary roads and conducted a large-scale evaluation on unseen areas, obtaining IoU and F1 scores of 0.5790 and 0.7120, respectively (with 97.87% of the samples being correctly classified). However, even the best performing state-of-the-art segmentation model (U-Net [9] as base architecture with SERes-NeXt50 [10] as backbone network) displayed the problem of inaccurate extraction. Many resulting segmentation masks presented discontinuities, and the connection points were often overlooked, resulting in road segments that were unconnected. We also identified higher rates of "false positive labels in areas where the materials used in the road pavement have a similar spectral signature with their surroundings, or areas where geospatial objects with similar features are present (such as dry riverbeds, railroads, or irrigation canals) and higher rates of false negatives in sections where other objects cover large portions of the roads were covered" (page 13 in [8]). Similar problems are still observed in recent works dealing with the road extraction from high-resolution aerial imagery-improving the road extraction task is an active area of research [11][12][13][14].
To overcome the deficiencies observed in our previous work [8], we developed a postprocessing technique based on image-to-image translation [15] concepts to operate over the initial semantic segmentation predictions and improve the road surface extraction for automatic mapping purposes. In this work, we apply generative learning techniques and propose a conditional Generative Adversarial Network (cGAN) architecture based on Pix2pix [15], greatly improved for computational efficiency (92.4% decrease in the number of parameters in the generator G network and 61.3% decrease in the discriminator D network, when compared to the original Pix2pix) to improve the initial road surface area extraction. For training and testing the model's performance, we use a novel dataset containing 8480 rasterized masks of roads tagged at pixel level, covering a land area of approximately 181 km 2 from representative areas of Spain.
The goal is to conditionally generate new synthetic images based on the initial segmentation predictions similar to samples belonging to the domain of official cartography, in this way reducing the effect and overcoming the inaccurate extraction. We evaluated the model on unseen data and calculated the Intersection over Union score (IoU score). This metric is a number from 0 to 1 that specifies the amount of overlap between predictions and ground truth masks (or the area of intersection divided with the area of union) for any two sets, M and N; IoU score (M, N) = |M ∩ N|/|M ∪ N|. We observed average improvements in IoU scores of the order of 11.3% when compared to the initial predictions, and 4.04% when compared to the original Pix2pix model. In the end, we conducted a perceptual validation to assess the quality of the postprocessing operation and observed significant improvements.
Our contributions can be summarized as follows.
• We propose a conditional GAN architecture based on Pix2pix (which we heavily modify for computational efficiency) to postprocess binary semantic segmentation predictions of road surface areas. Our Generator G is based on the U-Net [9] architecture (modified to reduce the number of parameters by 92.4%), while our Discriminator D is a modified version of PatchGAN [15], which allows the processing of larger patches of images (128 × 128, instead of 32 × 32), while reducing the number of parameters with 61.3%.

•
We train the proposed architecture on a new dataset composed of 6784 real segmentation maps tagged at pixel level (representing our target domain) and their corresponding initial segmentation masks (representing our conditional information) obtained with a state-of-the-art segmentation model, after applying Gaussian noise to the input.

•
We study the appropriateness of applying generative learning techniques for postprocessing initial semantic segmentation predictions of road surface areas by conducting a metrical (IoU score) comparison and a perceptual validation on a new test set composed of 1696 real segmentation maps and their correspondent semantic segmentation predictions (unseen during training). We proceed as follows. In Section 2, we discuss related works. In Section 3, we describe the task from a mathematical perspective.
In Section 4, we present the dataset used for training and testing. Details of our proposed model are presented in Section 5. The experiments carried out are described in Section 6. The results obtained in the postprocessing operation are analyzed in Section 8 from a metrical and a perceptual perspective. Finally, Section 8 offers the conclusions.

Related Works
Unsupervised learning is a paradigm of learning where only input variables, X (and no output variables, y), are given to the model. The goal is to learn underlying hidden distribution of the data using just the unlabeled data. In [16], unsupervised learning based on grammar-guided genetic programming has been successfully applied to obtain new Convolutional Neural Network (CNN) architectures specialized in road recognition in aerial imagery. In [17], unsupervised training was applied to generate filters that improved the predictions of the road detection model.
Generative models are a class of unsupervised learning, where the goal is to generate new samples from an unlabeled distribution. This task addresses the density estimation in an explicit way (e.g., PixelRNN [18], or variational autoencoders [19]) to learn lowerdimensional feature representations from unlabeled training data), or in an implicit way, with Generative Adversarial Networks (GANs). GANs [20] were introduced by Goodfellow et al. in 2014, and act like a system composed of two networks (called Generator G and Discriminator D) trained simultaneously in an adversarial setting to create variations in data. The goal of the training is to implicitly find the probability density function that best describes the training examples. This way, the model learns to successfully map from random noise z to an output image, y, G : z → y .
GANs evolved over the following years [21]. Deep Convolutional Generative Adversarial Networks (DCGANs) [22] are an extension of the GAN architecture that use deep convolutional neural networks with certain architectural constraints for both G and D networks (e.g., use of upsampling networks with fractionally-strided convolutions), allowing to learn representations from unlabeled image data. After training, DCGANs are able to generate high-quality synthetic images from the learned underlying distribution.
Conditional Generative Adversarial Networks (cGANs) [23] applied to computer vision tasks make use of image conditional information as additional input to both the generator and the discriminator. The mapping to the output image y is learned from the observed image x and the random noise z, G : {x, z} → y . Popular applications of cGANs in land use analysis are centred on improving the representations obtained from remotely sensed imagery. Some examples are the creation of higher resolution images in image super resolution task [24][25][26][27], and texture synthesis and realistic reconstructions [28][29][30][31].
Pix2pix [15] was introduced by Isola et al. as an extension of the cGAN architecture for that task referred as image-to-image translation. The model uses an U-Net-based [9] network as G (updated to include the vector distance from the target output image), and the PatchGAN architecture as discriminator network (only penalising structures at the scale of image patches). This task is one of the most important applications of cGANs, directly applicable to land use analysis and geospatial elements detection, and can be used for mapping from one domain to another (e.g., assign a class to every pixel in a remotely sensed image, turn aerial images into maps, etc.), G and D being conditioned during training with additional information. The model evolved in recent years with the introductions of CycleGAN [32], DiscoGAN [33], or DualGAN [34].
Postprocessing semantic segmentation predictions has been traditionally done by means of conditional random fields [35]. In [36], shape filtering is applied to improve the extraction the roads' centerlines by combining high-resolution imagery with LiDAR data and vectorial data from OpenStreetMap. In [37], the authors extract the linear characteristics of selected road segments by constructing a geometric knowledge base of rural roads to improve the initial results. In [38], a network for road extraction with a final module to highlight high level information and improve the classification is proposed.
Recently, GAN-based approaches for postprocessing road extraction emerged. The authors of [39] tackle the road extraction task by adding the Wasserstein distance and gradient penalty to a standard GAN and applying ensembling techniques to achieve an IoU score of 0.73 and obtain road geometries in Chinese rural areas. In [40], a GAN is trained to synthesize arbitrary-sized road network patches and enrich the attributes in areas where the extraction is difficult (where discontinuities are present or in complex areas, such as intersections or highway ramps). In [41], the authors propose a Multi-conditional Generative Adversarial Network (McGAN) to refine the road topology and obtain complete road networks graphs. McGAN is composed of two discriminators (one to employ the original spectral information, and the other discriminator to refine the road network topology) and a generator. The authors of [42] propose a method for extracting roads consisting in a GAN stage for detecting road edges and a second stage of smoothing-based optimization, postprocessing at pixel level to improve the initial segmentation masks.

Problem Description
The goal is to learn a correct transformation from a simple distribution (e.g., Gaussian distribution) to the complex target distribution using a neural network, while applying a condition. In our case, the cGAN will learn a mapping from random Gaussian noise vector z of dimension d, to the target domain containing the real road network features present in official cartography, y, while incorporating conditional information from x (initial segmentation masks) coming from a similar distribution, as proposed in [43]. By adding conditional information, the generation of the output image is conditioned on a source image x, and z will not be completely random anymore (the training will not be m).
The training procedure can be seen as a two-player game, where G tries to "fool" D by producing images that look real, while D trains to distinguish between real and fake images. As proposed in [20], D and G are trained jointly in a Minimax game, where the Minimax objective function min max(D, G) performs a gradient ascent on D, max [log D(x, y|x) + (1 − log(D(x, G(z|x))))] and a gradient descent on G, min [1 − log(D(x, G(z|x)))]. This encourages G to produce samples with a low probability of being fake. The objective function of the model can be expressed as L (G, , where E z,x is the expected value over all generated fake instances G(z|x) given the condition x, and E y,x is the expected value over all real data instances (belonging to the density distribution we wish to replicate) given x [15,44]. The generator and discriminator are therefore two "players" that alternate in updating their model weights with the criteria of maximizing the likelihood of D being wrong, and maximising the likelihood of G being right (maximize the probability of the images produced by G being real) [45]. A simplification of the described cGAN training procedure is presented in Figure 1.

Dataset
First, the target domain (or real samples, ) required was obtained from the National Topographical Map, scale 1:25,000 (Spanish: Mapa Topográfico Nacional 1:25,000, or MTN25), by rasterizing the available openly information containing the road network in vectorial format and dividing the resulting images in tiles of 256 × 256 pixels in size. We binarized the tiles to black and white to represent the classes "Road exists" and "No road" (background). The dataset contains 8480 tiles (covering 181 km 2 of representative areas from the Spanish territory) and was divided by applying the 80:20% criteria (allowing for more training data [8]), resulting in 6784 tiles used for training and 1696 tiles used for testing (data unseen during training). Compared to our previous work [8], we made the dataset bigger, and included more road structures (highways, paved roads, urban roads, etc.). Therefore, the target domain is represented by the ground-truth tiles of 256 × 256 pixels present in MTN25.
Second, we obtained our source domain ( ) by retraining the semantic segmentation model that statistically proved to be the most suitable for road extraction tasks in [8] (U-Net as base architecture and SEResNeXt50 as backbone network). We used the resulting model to evaluate the entire dataset, this way obtaining the initial segmentation masks stored in the PNG (Portable Network Graphics) lossless format. These predictions will represent our conditional information ( ). The performance of the segmentation model was evaluated on the corresponding test set (containing tiles our target domain, ), the model achieving an IoU score of 0.6726. As intuition, the IoU score will be 1 if the prediction completely overlaps with the ground-truth mask; a model obtaining an IoU score greater than 0.5 is considered to have a good performance [46]. In Figure 2, we can find the correspondence between the aerial orthoimage, the binarized ground-truth segmentation mask (or target domain, ), and the initial segmentation prediction (or source domain, ) from six random tiles.

Dataset
First, the target domain (or real samples, y) required was obtained from the National Topographical Map, scale 1:25,000 (Spanish: Mapa Topográfico Nacional 1:25,000, or MTN25), by rasterizing the available openly information containing the road network in vectorial format and dividing the resulting images in tiles of 256 × 256 pixels in size. We binarized the tiles to black and white to represent the classes "Road exists" and "No road" (background). The dataset contains 8480 tiles (covering 181 km 2 of representative areas from the Spanish territory) and was divided by applying the 80:20% criteria (allowing for more training data [8]), resulting in 6784 tiles used for training and 1696 tiles used for testing (data unseen during training). Compared to our previous work [8], we made the dataset bigger, and included more road structures (highways, paved roads, urban roads, etc.). Therefore, the target domain y is represented by the ground-truth tiles of 256 × 256 pixels present in MTN25.
Second, we obtained our source domain (x) by retraining the semantic segmentation model that statistically proved to be the most suitable for road extraction tasks in [8] (U-Net as base architecture and SEResNeXt50 as backbone network). We used the resulting model to evaluate the entire dataset, this way obtaining the initial segmentation masks stored in the PNG (Portable Network Graphics) lossless format. These predictions will represent our conditional information (x). The performance of the segmentation model was evaluated on the corresponding test set (containing tiles our target domain, y), the model achieving an IoU score of 0.6726. As intuition, the IoU score will be 1 if the prediction completely overlaps with the ground-truth mask; a model obtaining an IoU score greater than 0.5 is considered to have a good performance [46]. In Figure 2, we can find the correspondence between the aerial orthoimage, the binarized ground-truth segmentation mask (or target domain, y), and the initial segmentation prediction (or source domain, x) from six random tiles.
In this work, we will use the initial segmentation masks (third row) as conditional information for training and testing the model. The goal is to map this initial semantic segmentation distribution of data to the target domain containing the distribution of the road network in official cartography. The training procedure will enable the generator to synthesize fake examples belonging to targeted data domain, improving this way the initial segmentation predictions. We will evaluate G's performance on the test set and compare its IoU score with the one obtained by the segmentation model on the same set. We are actively increasing the size of this dataset and plan to make it publicly available once we reach around 500,000 tiles. The relation between the aerial orthoimage (first row (a-f)), the rasterized segmentation mask (ground-truth or real sample coming from MTN25, found in the second row (g-l)), and the semantic segmentation predictions (conditional information, , seen in third row (m-r)). Note: The train set used as conditional information for contains the same 6784 tiles as the set from the target distribution . The same is valid for the test set (unseen data).
In this work, we will use the initial segmentation masks (third row) as conditional information for training and testing the model. The goal is to map this initial semantic segmentation distribution of data to the target domain containing the distribution of the road network in official cartography. The training procedure will enable the generator to synthesize fake examples belonging to targeted data domain, improving this way the initial segmentation predictions. We will evaluate 's performance on the test set and compare its IoU score with the one obtained by the segmentation model on the same set. We are actively increasing the size of this dataset and plan to make it publicly available once we reach around 500,000 tiles.

Proposed cGAN Architecture
The cGAN architecture presented in this work is based on Pix2pix [15]. The code from Pix2pix's implementation [47] was used as basis for building the model; however, we changed the original implementation in order to make the computation more efficient and better suited for postprocessing sematic segmentation predictions containing road surface areas. The proposed model was defined and trained using the open source deep learning library TensorFlow version 2.2.0 [48] on a Linux server with a 12-core Intel Core i7 processor and a Nvidia RTX 2060 graphics card with 6 gigabytes of memory.

Generator
The generator takes random noise and a condition (initial semantic segmentation mask) as input and will output a synthetic image, ( | ). In our case, is randomly generated from a normal distribution. The generator will apply a generative function to obtain a new sample, this time from the generative model. The output sample, ( | ), should be reasonably similar to training data distribution.
is a fully convolutional network consisting of an encoder with skip connections (introduced in U-Net [9]) and takes as input conditional Gaussian noise (with the condi- Figure 2. The relation between the aerial orthoimage (first row (a-f)), the rasterized segmentation mask (ground-truth or real sample coming from MTN25, found in the second row (g-l)), and the semantic segmentation predictions (conditional information, x, seen in third row (m-r)). Note: The train set used as conditional information for G contains the same 6784 tiles as the set from the target distribution y. The same is valid for the test set (unseen data).

Proposed cGAN Architecture
The cGAN architecture presented in this work is based on Pix2pix [15]. The code from Pix2pix's implementation [47] was used as basis for building the model; however, we changed the original implementation in order to make the computation more efficient and better suited for postprocessing sematic segmentation predictions containing road surface areas. The proposed model was defined and trained using the open source deep learning library TensorFlow version 2.2.0 [48] on a Linux server with a 12-core Intel Core i7 processor and a Nvidia RTX 2060 graphics card with 6 gigabytes of memory.

Generator
The generator G takes random noise z and a condition x (initial semantic segmentation mask) as input and will output a synthetic image, G(z|x) . In our case, z is randomly generated from a normal distribution. The generator will apply a generative function to obtain a new sample, this time from the generative model. The output sample, G(z|x) , should be reasonably similar to training data distribution.
G is a fully convolutional network consisting of an encoder with skip connections (introduced in U-Net [9]) and takes as input conditional Gaussian noise (with the condition x applied). In the encoder part, we used strided convolutions (with a stride of 2) instead of pooling layers and applied padding to allow for more space for the kernel to convolute and not lose information near the borders. We used the ReLU [49] activation function in all generator layers, except for the last one, where we uses tanh (instead of sigmoid) and applied Batch Normalisation [50] to all layers, except the input layer, following recommendations in [22].
In the decoder, we used transposed convolutions (with a stride of 2) to upsample and resize the output to the input's dimensions. As a means to avoid overfitting, we applied a dropout operation with the rate of 0.5 between the encoder and decoder. Similarly to U-Net [9], skip connections were added between the encoder and decoder parts, to avoid the loss of low-level information and enable the share of information between different stages across the network. Details about the generator architecture can be found in Figure 3. www.mdpi.com/journal/land convolute and not lose information near the borders. We used the ReLU [49] activation function in all generator layers, except for the last one, where we uses tanh (instead of sigmoid) and applied Batch Normalisation [50] to all layers, except the input layer, following recommendations in [22]. In the decoder, we used transposed convolutions (with a stride of 2) to upsample and resize the output to the input's dimensions. As a means to avoid overfitting, we applied a dropout operation with the rate of 0.5 between the encoder and decoder. Similarly to U-Net [9], skip connections were added between the encoder and decoder parts, to avoid the loss of low-level information and enable the share of information between different stages across the network. Details about the generator architecture can be found in Figure 3. takes in the input image (the segmentation mask with Gaussian noise) and passes it through a series of convolution and upsampling layers to produce an output image that has the same size as the input. The generator is trained to produce synthetic outputs un- Figure 3. Description of the generator architecture. Note: The input is passed through a series of layers that progressively downsample its size until the bottleneck, where the process is reversed, and the tensor is upsampled.
G takes in the input image (the segmentation mask with Gaussian noise) and passes it through a series of convolution and upsampling layers to produce an output image that has the same size as the input. The generator is trained to produce synthetic outputs undistinguishable from "real" images. By using this architecture, the total number of parameters in G decreased from 54,425,859 in the original Pix2pix to 4,117,825 (a 92.4% reduction).

Discriminator
The discriminator network takes as input both the conditioned real sample from the target domain, D(x, y|x) , and the conditioned fake sample generated by G, D(x, G(z|x)). D analyzes the distribution of data and decides whether the data are generated or coming from the target domain data (using a sigmoid function that outputs the probability between 0 and 1). The output of discriminator D represents the probability that the sample is coming from the training distribution. The discriminator D will take G(z|x) and real y and will output whether the image is real or fake, every input of D has a 1/2 probability of being real and 1/2 of being fake (acts like a binary classifier for the generated data, training to detect as well as possible the synthetic images produced by G).
Our D is a modified PatchGAN (introduced in [15]). We used Leaky ReLU activation [51] instead of ReLU activation for all D's layers to reduce gradient sparsity and applied Batch Normalization in all layers, except for the input. Instead of polling operations, we used again strided convolutions (with a stride of 2) and applied zero padding to avoid the loss of information near the borders. Details about D's architecture can be found in Figure 4.

Discriminator
The discriminator network takes as input both the conditioned real sample from the target domain, ( , | ), and the conditioned fake sample generated by , ( , ( | )).
analyzes the distribution of data and decides whether the data are generated or coming from the target domain data (using a sigmoid function that outputs the probability between 0 and 1). The output of discriminator represents the probability that the sample is coming from the training distribution. The discriminator will take ( | ) and real and will output whether the image is real or fake, every input of has a 1/2 probability of being real and 1/2 of being fake (acts like a binary classifier for the generated data, training to detect as well as possible the synthetic images produced by ).
Our is a modified PatchGAN (introduced in [15]). We used Leaky ReLU activation [51] instead of ReLU activation for all 's layers to reduce gradient sparsity and applied Batch Normalization in all layers, except for the input. Instead of polling operations, we used again strided convolutions (with a stride of 2) and applied zero padding to avoid the loss of information near the borders. Details about 's architecture can be found in Figure 4.  [15]). Note: In our case, we will concatenate and as input to the generator ( , ) with the objective of learning the distribution of the target domain, .
operates convolutionally over the 256 × 256 tile and is capable of classifying larger image patches (128 × 128, instead of 32 × 32) to decide if the input is real or fake. The discriminator takes four patches of 128×128 as input and classifies from which distribution the input comes from (real or fake). The decision scores are averaged to obtain a final prediction for the input tile (as seen in Figure 5).  [15]). Note: In our case, we will concatenate z and x as input to the generator G(z, x) with the objective of learning the distribution of the target domain,y.
D operates convolutionally over the 256 × 256 tile and is capable of classifying larger image patches (128 × 128, instead of 32 × 32) to decide if the input is real or fake. The discriminator takes four patches of 128×128 as input and classifies from which distribution the input comes from (real or fake). The decision scores are averaged to obtain a final prediction for the input tile (as seen in Figure 5).
By applying these changes, the total number of parameters was reduced from 2,770,433 to 1,071,105, a 61.3% decrease when compared to the original PatchGAN [15] network used in Pix2pix. By applying these changes, the total number of parameters was reduced from 2,770,433 to 1,071,105, a 61.3% decrease when compared to the original PatchGAN [15] network used in Pix2pix.

Experiments
As explained in Section 3, cGANs [23] take conditional information (in our case, the initial sematic segmentation results) as an extension to the latent space . The conditioning is performed by concatenating into the correspondent tensors of images ( | ) and . In the generator, the prior input Gaussian noise and are combined in a joint hidden representation, the resulting ( | ) being fed into , together with its correspondent tile from the target domain. The discriminator tries to identify which images are real and which one comes from (produces a guess about how realistic they look), while trains to maximise the log-probability of the discriminator , ( | ) being mistaken [52].
When cannot distinguish real images from fake images, the optimal state is reached. The intuition is that over time, the generator will be forced to create synthetic data that comes as closely as possible to the distribution of the target domain, while the discriminator will become better at telling them apart. Details about the training procedure of the proposed cGAN can be found in Figure 6. Figure 5. Analysis of the input data distribution with the proposed discriminator D (estimation of the probability that the input is real).

Experiments
As explained in Section 3, cGANs [23] take conditional information x (in our case, the initial sematic segmentation results) as an extension to the latent space z. The conditioning is performed by concatenating x into the correspondent tensors of images G(z|x) and y. In the generator, the prior input Gaussian noise and x are combined in a joint hidden representation, the resulting G(z|x) being fed into D, together with its correspondent tile from the target domain. The discriminator tries to identify which images are real and which one comes from G (produces a guess about how realistic they look), while G trains to maximise the log-probability of the discriminator D(x, G(x|z)) being mistaken [52]. When D cannot distinguish real images from fake images, the optimal state is reached. The intuition is that over time, the generator will be forced to create synthetic data that comes as closely as possible to the distribution of the target domain, while the discriminator will become better at telling them apart. Details about the training procedure of the proposed cGAN can be found in Figure 6.
As preprocessing for the input data, we applied online data augmentation techniques to rotate the images randomly picked as training sets by 90, 180, and 270 degrees, reducing this way the overfitting behavior.
As for the cost functions, the total generator loss is a modified version of the loss introduced by [15], where we sum the adversarial loss to the mean absolute error (also called L1 regularization) between the generated image and the target image (expected output image), while add a weight λ = 1000 to the L1_loss, L(G) = (1 − log D(x, G(z|x))) + λ * L1 loss . The adversarial loss is a cross entropy between 1 (as G tries to fool D, and the real values have the label 1 assigned) and the predicted value from G(z|y) (0, in case D realises the G(z|x) is fake) and encourages the generator to produce images similar to the training domain. In [15], it was shown that the L1 regularization allows G(z|x) to become structurally similar to the target image (y), L L1 (G) = E x,y,z (|y − G(z|x)|), where E x,y,z is the expected value over all generated fake instances G(z|x) and real data instances y given the condition x [15,44]. Therefore, the generator will be updated via a weighted sum of the adversarial loss and the L1 loss (with a weight of 1000 to 1 in favor of the L1 loss), to encourage G to produce more realistic images (and obtain plausible translations).
As preprocessing for the input data, we applied online data augmentation techniques to rotate the images randomly picked as training sets by 90, 180, and 270 degrees, reducing this way the overfitting behavior.
As for the cost functions, the total generator loss is a modified version of the loss introduced by [15], where we sum the adversarial loss to the mean absolute error (also called L1 regularization) between the generated image and the target image (expected output image), while add a weight = 1000 to the L1_loss, ℒ( ) = 1 − log , ( | ) + * 1 . The adversarial loss is a cross entropy between 1 (as tries to fool , and the real values have the label 1 assigned) and the predicted value from ( | ) (0, in case realises the ( | ) is fake) and encourages the generator to produce images similar to the training domain. In [15], it was shown that the L1 regularization allows ( | ) to become structurally similar to the target image ( ), ℒ (G) = , , (| − ( | )|), where , , is the expected value over all generated fake instances ( | ) and real data instances given the condition [15,44]. Therefore, the generator will be updated via a weighted sum of the adversarial loss and the L1 loss (with a weight of 1000 to 1 in favor of the L1 loss), to encourage to produce more realistic images (and obtain plausible translations). Second, 's inputs are and ( | ) concatenated in a tensor ( , ( | ) ), which should be classified as fake, and the conditional information concatenated to the target image , which should be classified as real. When a pixel comes from the target domain (real sample), the true label is 1, whereas when a patch is fake, the true label of the pixels will be 0. The final layer of is a sigmoid function, with 's cost function being a sum of Second, D's inputs are x and G(z|x) concatenated in a tensor ( x, G(z|x) ), which should be classified as fake, and the conditional information x concatenated to the target image y, which should be classified as real. When a pixel comes from the target domain (real sample), the true label is 1, whereas when a patch is fake, the true label of the pixels will be 0. The final layer of D is a sigmoid function, with D's cost function being a sum of the cross-entropy between the D's prediction and the actual labels for these two inputs (array of zeros in the case the fake images, and array of ones in the case of the real images).
The model was trained from scratch with minibatch stochastic gradient descent, the batch size being 10, each epoch taking around 350 seconds on the GPU. G's starting weights were randomly initialised from a Gaussian distribution, the mean of the random values to generate being 0 and the standard deviation being 0.04. For training, we used Adam optimizer [53] with learning rates 1 × e −4 for G and 2 × e −4 for D. The experiments apply a decay rate for the first momentum estimate β 1 = 0.5, and a decay rate for the second moment estimate β 2 = 0.999. We adopt different learning rates for generator and the discriminator to improve the convergence of GANs (as proposed in [54]). In our experiments, we found that not adding noise to the discriminator inputs improves the predictions, although some researchers have found it to be a form a regularization to improve the convergence [55].
The training is done via backpropagation, the model alternating between training D and training G. The weights of D are adjusted based on the classification error produced. G is then updated via the discriminator network (during generator training, D does not update its weights, but its gradients are used so that the generator can update its weights) to minimize G's cost function. Over time, G improves its output to better reproduce the real data distribution, iteratively changing the produced synthetic data to make it more realistic.
We repeated the experiments described above five times using random weight initializations. For comparison reasons, we trained the original Pix2pix model as well. Each time, the trained generator was exported to h5 format and used to evaluate the capacity of G in mapping from the source domain to the target domain.

Metrical Analysis and Perceptual Validation of the Results
To evaluate the model's performance, we used the test set containing the initial predictions (described in Section 4) unseen during training. These initial semantic segmentation predictions were passed through the trained generator, accumulating the goodness of results in a confusion matrix (by comparing them with the ground truth segmentation masks) to calculate the IoU score with the formula IoU score = True Positives/(True Positives + False Positives + False Negatives). A comparison between the IoU scores obtained is reported in Table 1. We observe a significant increase in IoU score by using the proposed cGAN model, from 0.6726 to an average of 0.7530 (an 11.27% increase over the performance metric obtained by the semantic segmentation model). At the same time, the original Pix2pix model achieved an average IoU score of 0.7232 on the test set containing data unseen during training (an average increase of 7.25%, when compared to initial IoU score of 0.6726). Our model obtained an average improvement of 4.04% when compared to the original Pix2pix model. The maximum IoU scores are 0.7300 in the case of the original Pix2pix, and 0.7556 in the case of our proposed cGAN architecture.
To see what these gains in IoU score mean for the postprocessing operation, we conducted a perceptual validation using the predictions obtained by passing test set (data unseen by the model, described in Section 4) through the trained generator. These predictions were subsequently stored in PNG format. In Figure 7, we can find a comparison between the results from eight random scenes.
The qualitative inspection of the generated images shows clear improvements over the initial segmentation predictions. The trained generator enabled a correct image-to-image translation ( G : {x, z} → y ) and learned to produce synthetic images similar to those belonging to the target domain. The results from Figure 7 assert our metrical comparison from Table 1.
Compared to the initial semantic segmentation masks, we can observe a thinner road line representation, similar to the representation from the target domain, y. We also observe a consistent elimination of unconnected parts (e.g., Figure 7(c2-b2),(c6-e6)). Compared to the original Pix2pix, we observe smoother road line representations m, and cleaner road predictions (e.g., Figure 7(d1-e1),(d5-e5)). However, we still observe imperfections, mostly in urban areas (e.g., Figure 7(e5,e2)), but we believe they are caused by insufficient urban road data in the training set. We are actively working into solving this drawback and by actively building a bigger dataset. Nonetheless, the predictions show a clear enhancement over the initial data. To see what these gains in IoU score mean for the postprocessing operation, we conducted a perceptual validation using the predictions obtained by passing test set (data unseen by the model, described in Section 4) through the trained generator. These predictions were subsequently stored in PNG format. In Figure 7, we can find a comparison between the results from eight random scenes. Figure 7. Perceptual validation carried out on eight random tiles from the test set (data unseen during training). The first row (a1-a8) represents the aerial orthoimage, the second row (b1-b8) represents the ground truth mask (target domain, ), the third row (c1-c8) represents the initial semantic segmentation prediction (conditional information, ), the fourth row (d1-d8) presents the predictions obtained with the original Pix2pix, and the fifth row (e1-e8) presents the prediction obtained by the model proposed in this paper.
The qualitative inspection of the generated images shows clear improvements over the initial segmentation predictions. The trained generator enabled a correct image-to-image translation ( : { , } → ) and learned to produce synthetic images similar to those belonging to the target domain. The results from Figure 7 assert our metrical comparison from Table 1.
Compared to the initial semantic segmentation masks, we can observe a thinner road line representation, similar to the representation from the target domain, . We also observe a consistent elimination of unconnected parts (e.g., Figure 7(c2-b2),(c6-e6)). Compared to the original Pix2pix, we observe smoother road line representations (e.g. Figure  7(d3-e3),(d6-e6)), and cleaner road predictions (e.g., Figure 7(d1-e1),(d5-e5)). However, we still observe imperfections, mostly in urban areas (e.g., Figure 7(e5,e2)), but we believe they are caused by insufficient urban road data in the training set. We are actively working into solving this drawback and by actively building a bigger dataset. Nonetheless, the predictions show a clear enhancement over the initial data. Figure 7. Perceptual validation carried out on eight random tiles from the test set (data unseen during training). The first row (a1-a8) represents the aerial orthoimage, the second row (b1-b8) represents the ground truth mask (target domain, y), the third row (c1-c8) represents the initial semantic segmentation prediction (conditional information, x), the fourth row (d1-d8) presents the predictions obtained with the original Pix2pix, and the fifth row (e1-e8) presents the prediction obtained by the model proposed in this paper.
The increases in performance metrics from Table 1 demonstrate the suitability of using conditional Generative Adversarial Networks for postprocessing initial semantic segmentation. Although the results from Figure 7 are not perfect, they represent a great improvement over state-of-the art semantic segmentation models and demonstrate the appropriateness of applying generative learning techniques in postprocessing tasks.

Conclusions
We presented an effective approach for postprocessing road segmentation masks in an adversarial way using a cGAN architecture based on Pix2pix, greatly modified for computational efficiency. We demonstrated its effectiveness by training and testing the model on a new dataset containing initial road surface area and observed average increases by the level of 11.3% when compared to a state-of-the-art segmentation model and 4.04% when compared to the original Pix2pix architecture, while achieving a 92.4% reduction in the numbers of parameters in G and a 61.3% decrease in the number of parameters in D.
The proposed architecture delivered significant improvements and can be viewed as a postprocessing technique for semantic segmentation predictions of road surface areas. The metrical comparison presented in Table 1 and the perceptual validation carried out in Section 7 proved the efficacy of the network-we presume that the quality of these predictions can be further improved by using a bigger dataset. We also strongly believe that the proposed architecture can be used to enhance the extraction operation of other continuous geospatial objects such as rivers, railroads and other transportation networks, Land 2021, 10, 79 13 of 15 or irrigation canals. Furthermore, based on the results obtained in this work, we consider that the proposed procedure can be applied to other land analysis tasks or operations that involve the extraction of geospatial objects from remote sensing images.
As future lines of research, we are actively working to increase the size of the dataset and plan to integrate a complete end-to-end solution capable of large-scale recognizing, segmenting, and postprocessing the road network elements by means of classification, semantic segmentation, and GAN operations. The end goal is to obtain a robust system capable of monitoring and mapping the occurring changes in the road network for the whole national territory, reducing this way the human factor in updating existent road cartography. We also leave for a future study the implementation of the proposed cGAN architecture for postprocessing predictions from urban and rural areas where multiple land cover classes are present.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to ongoing efforts to considerably increase the size of the dataset to around 500,000 tiles.