Aerial Image Road Extraction Based on an Improved Generative Adversarial Network

: Aerial photographs and satellite images are one of the resources used for earth observation. In practice


Introduction
Roads act as a fundamental unit for many geographic information system applications, such as vehicle navigation, traffic management, and emergency response.It is also an important element of military surveying and mapping.In addition, for ensuring dynamics and accuracy, rapid city development requires frequent road updating, and the growing development of aerial technologies also provides an efficient, low-cost and reliable solution to receive dynamical road information.Besides aerial images, there are also other kinds of remote sensing data can be used for road extraction, such as hyperspectral images (HSI) [1,2], synthetic aperture radar (SAR) data [3][4][5], airborne laser scanning (ALS) data [6][7][8] and mobile laser scanner (MLS) data [9][10][11].In this paper, we only focus on aerial images.
Traditional road network data mainly comes from manual extraction, which consumes intensive human resources.Aerial images provide abundant information about the ground covers.With the improvement of spatial resolution, it becomes an increasingly important data source for extracting road network information from aerial images.With the continuous updating of road information, traditional manual operation has been unable to meet the demand.Combining remote sensing technology with computer vision to extract road information from aerial images helps automate and accelerate road monitoring.
The research for road extraction has a long history before deep learning methods become widely used.We here summarize the traditional road extraction methods on three levels: feature, object, and knowledge levels.
(1) Road extraction methods on feature level: In previous work, roads were often extracted using spectral, geometric, topological and textural features of the roads.For example, a template matching method was used to extract a certain number of seed pixels, or specific templates [12][13][14] and then generate roads based on the extracted seed points or templates.Using the characteristic edges of roads (e.g.parallelism), edges and parallel lines of roads were extracted using a distance function [15][16][17][18][19]. Using established mathematical models, such as Snake model [20][21][22][23] and Markov models, the edges of roads were examined.Some specific filters can be used to enhance road pixels for better road extraction [24].Hierarchical feature level algorithms were usually performed based on spectral features of images [24,25], and the design of these algorithms was slightly simple but had impressive efficiency.However, these algorithms cannot produce satisfactory performance in complex environments and may produce 'salt and pepper' noise [24]. ( The extraction methods based on object hierarchy: These object-based road extraction methods usually cluster image into a number of small areas (objects) or first segment the image, and then take a small area as a unit for road extraction.One example is the multi-resolution analysis method [26][27][28][29] based on the different resolutions of the aerial image or a single image with different scales, which can improve the accuracy of road extraction through two combination operations.Regional statistical analysis methods [30,31] were based on a probability model as well as the widely used road unit trimming [32][33][34] and joining [35] methods.These algorithms have achieved good performance in complex situations by merging pixels into homogeneous regions, which helps to reduce the influence of noise.However, these algorithms usually require initial segmentation or clustering of images, which has significantly influenced the final extraction precision, and is prone to the "sticky" phenomenon [29].
(3) Road information extraction methods based on knowledge level: Knowledge-based road extraction methods usually use the previous knowledge about roads, or the supplementary information to extract the targets.Methods such as multi-source data analysis based on existing road databases to guide or assist the extraction of road networks [36,37] are commonly used.They also exploited the self-characteristics of roads, such as spectrum and context [38][39][40][41][42][43][44].These methods have achieved satisfactory results in complex situations [45][46][47][48][49].However, these algorithms are not efficient.It is reasonable to combine multi-source to distinguish "different bodies with the same spectrum" or "same body with different spectra" but the acquisition of multi-resources data is relatively difficult, which limits the applications of the method.
In recent years, deep learning has made successful applications in image analysis [50] and natural language processing [51].Combining with low-level features, high-level representations of images can be formed, which have the capability of mining the distributions of data.The essence of deep learning is to learn more meaningful features from massive training data by constructing networks with multiple hidden layers, and its destination is to improve the accuracy of classification or prediction.Therefore, deep learning can be regarded as a case of feature learning.Methods based on deep learning transform features from one layer to another, which makes classification or prediction much easier.
There are some available deep learning based road extraction methods.Convolutional Neural networks (CNNs) have achieved excellent performance in image classification.Different from the classification of the whole image, road extraction is considered as a binary classification problem at the pixel level.It is needed to classify each pixel in the input aerial image as road or background.Therefore, the extraction method based on CNN usually uses a sliding window [52,53], by which the category of the central pixel of the window is obtained.Recently, a cascaded end-to-end convolutional neural network called CasNet [54] was proposed to detect roads and extract the centerline of an aerial image, ARCNet [55] combine CNN with attention mechanism to classify scene in very high resolution remote sensing images including roads.For some 3D data, such as HSI and ALS data, some end-to-end 3D CNN [56][57][58] has been proposed to detection and classification.A full convolutional network (FCN) can output pixel level classification information with the same size as the input image, which is suitable for road extraction in aerial images [59][60][61][62].FCN includes down-and up-sampling.The up-sampling process in FCN uses the features obtained from the down-sampling process to increase the dimension by deconvolution layers, and obtains the same dimension of the classification map as the input image.They can be divided into FCN-32s, FCN-16s, FCN-8s, FCN-4s, corresponding to the FCN networks with different upper sampling steps 32,16,8, and 4 respectively.Some works also use Generative Adversarial Network (GAN) [63] model to extract roads form aerial images [64,65].GAN as a kind of deep learning model is inspired by the zero-sum game theory.It contains a generator model G and a discriminator model D where the generator G can capture the distribution of the sample data, and the discriminator D is a binary classifier used to check whether the input is the real data or the generated sample.Generally speaking, GAN based methods regard road extraction as a task of image-to-image translation, so in [64,65] FCN was used as the architecture of the generator, and CNN as the discriminator.[64] used an encoder-decoder architecture in generator, and added a term of entropy loss in loss function; [65] used a two-stage framework to extract roads, in which two GANs were first used to detect roads and intersections and then the best covering road graph was found by applying a smoothing-based graph optimization procedure.Both methods chose encoder-decoder architecture in generator, which makes the generator be of poor ability to generate finer images.
In this paper, we propose an improved GAN model to extract roads from aerial images, and the overall framework of the proposed model is shown in Figure 1.In comparison to the available road extraction methods based on GAN, our proposed method has a simpler architecture than two stages method [65] and an easier loss function than [63].In addition, Since GAN can produce promising results with a small amount of samples, which overcomes the scarcity in quantity of remote sensing images compared to natural images when using deep learning methods.According to the characteristics of GANs, we train a model to automatically generate a binary image of roads and the background.For the specific road extraction task, we enhance the original GAN loss function, adding a content-based loss item to ensure that the generated image is more accurate.Our approach has improved the extraction outcome compared to the other methods based on deep learning.FCN-8s, FCN-4s, corresponding to the FCN networks with different upper sampling steps 32,16,8, and 4 respectively.Some works also use Generative Adversarial Network (GAN) [63] model to extract roads form aerial images [64][65].GAN as a kind of deep learning model is inspired by the zero-sum game theory.It contains a generator model G and a discriminator model D where the generator G can capture the distribution of the sample data, and the discriminator D is a binary classifier used to check whether the input is the real data or the generated sample.Generally speaking, GAN based methods regard road extraction as a task of image-to-image translation, so in [64][65] FCN was used as the architecture of the generator, and CNN as the discriminator.[64] used an encoder-decoder architecture in generator, and added a term of entropy loss in loss function; [65] used a two-stage framework to extract roads, in which two GANs were first used to detect roads and intersections and then the best covering road graph was found by applying a smoothing-based graph optimization procedure.Both methods chose encoder-decoder architecture in generator, which makes the generator be of poor ability to generate finer images.
In this paper, we propose an improved GAN model to extract roads from aerial images, and the overall framework of the proposed model is shown in Figure 1.In comparison to the available road extraction methods based on GAN, our proposed method has a simpler architecture than two stages method [65] and an easier loss function than [63].In addition, Since GAN can produce promising results with a small amount of samples, which overcomes the scarcity in quantity of remote sensing images compared to natural images when using deep learning methods.According to the characteristics of GANs, we train a model to automatically generate a binary image of roads and the background.For the specific road extraction task, we enhance the original GAN loss function, adding a content-based loss item to ensure that the generated image is more accurate.Our approach has improved the extraction outcome compared to the other methods based on deep learning.The remainder of this paper is organized as follows: Section 2 introduces the standard generative adversarial networks and the network structure we use for road extraction.Section 3 mainly shows our experimental results and comparisons.In Section 4, summary and future expectations are given.

Generative Adversarial Network
GAN has received much attention in recent years due to its ability to generate new samples similar to the training samples by learning the given samples' probability distributions.Different from the other deep learning models, GAN consists of two networks, i.e. generative and discriminate networks.As the name suggests, GAN can learn probability distributions from the given dataset and generate new samples similar to the given samples by using random noise.The training process of GAN can be seen as that the two networks optimize themselves against each other.Briefly, the generative network needs to generate more realistic samples to 'fool' the discriminate network, on the contrary, discriminate networks need to learn a better way to detect The remainder of this paper is organized as follows: Section 2 introduces the standard generative adversarial networks and the network structure we use for road extraction.Section 3 mainly shows our experimental results and comparisons.In Section 4, summary and future expectations are given.

Generative Adversarial Network
GAN has received much attention in recent years due to its ability to generate new samples similar to the training samples by learning the given samples' probability distributions.Different from the other deep learning models, GAN consists of two networks, i.e., generative and discriminate networks.As the name suggests, GAN can learn probability distributions from the given dataset and generate new samples similar to the given samples by using random noise.The training process of GAN can be seen as that the two networks optimize themselves against each other.Briefly, the generative network needs to generate more realistic samples to 'fool' the discriminate network, on the contrary, discriminate networks need to learn a better way to detect fake images generated by the generative network.For the trained GAN, we can use the generative network to generate new samples, and also use the discriminate network for feature extraction or classification.The loss function of GAN is defined as follows: where p data denotes the distribution of the real data (usually real images), p z denotes the distribution of the input noise, D denotes the discriminate network, G denotes the generative network, x denotes the input from the real data, and z denotes the input from the random noise.The discriminate network needs to detect fake samples so as to make D(x) → 1 and D(G(z)) → 0 , whilst the generative network needs to generate realistic samples, leading to D(G(z)) → 1 .
In the last few years, the original GAN has a number of variants, such as Deep Convolutional Generative Adversarial Network (DCGAN) which combines CNN with GAN [66]; Conditional Generative Adversarial Network (CGAN) which takes the inputs with random noise or certain conditions [67]; Wasserstein GAN [68] which uses Wasserstein distance to define the loss function for solving the vanishing gradient problem [68,69]; CycleGAN [70] that has outstanding performance on the task of image translation.
DCGAN replaces the multilayer perceptron in GAN with CNN.In our approach, in the generative network, we use fractional-stride convolutions for up-sampling, which indicates that we map the input vector of low dimension into the image of high dimension.In the discriminate network, we use stride convolutions for down-sampling, mapping the input image into (0,1), which denotes the probability of the input belonging to the real data.
CGAN's input contains random noise z.Given condition y, the loss function of CGAN can be defined as follows: For different tasks and datasets, condition y is different, but for a certain task and dataset, condition y should be the same for different samples in the same category.
Different from the previous methods, CycleGAN has outstanding performance on the task of image translation where we need pair datasets (the original and the transformed images).We treat image translation as a reversible process, in other words, when we use a GAN to translate image a ∈ A into image b ∈ B, images a and b have different styles of A and B, and we can also use another GAN to translate image b into image a. CycleGAN uses a cycle consistency loss function to guarantee the reversible process: where G AtoB and G BtoA denote two generators respectively, a and b denote images belonging to style A and B respectively, and . 1 denotes L1-norm.The loss function of CycleGAN can be defined as follows: where D A and D B denote the discriminators corresponding to the generators G AtoB and G BtoA respectively, L G denotes the GAN loss and L cyc denotes the cycle consistency loss.There are also some other methods based on GAN to conduct image translation, or called image-to-image translation such as DiscoGAN [71], DualGAN [72], and pix2pix [73].

The Structure of Generative Adversarial Network for Road Extraction
Inspired by the outstanding performance of CGAN to image translation tasks, we use CGAN to extract roads in aerial images.Using CycleGAN and some other CGAN models, we can easily translate images into another style, such as horse into zebra, day into night, summer into winter and so on, we want to use this idea translate aerial images into label images.
Road extraction can be regarded as a problem of binary classification at the pixel level, in which we predict whether a pixel belongs to roads or background.When targeting a binary image regardless of process details, we can regard this task as an image translation, where we want to translate the aerial image into the binary image that depicts the road and background.
We combine DCGAN and CGAN in our model to extract roads from aerial images, in short, we use a structure of DCGAN with certain conditions, and here we just input aerial images as our condition without any random noise.Due to the particularity of our task where the input and the output are images with the same size, we replace the deconvolution layers with FCN.The structure of the discriminator is the same as that of the discriminator of DCGAN.Our framework is shown in Figure 1.The structure of FCN is shown as Figure 2, where the blue blocks denote the down-sampling layers and the pink ones denote the up-sampling.The numbers on the blocks equal to the numbers of the feature maps of each layer.Our FCN model inherits the traditional FCN-4s structure.We also collect low level features by adding them to the feature maps of up-sampling.Different from the traditional FCN-4s model, we remove the pooling layers in terms of down-sampling and have different numbers of layers and feature maps in our FCN network.
Remote Sens. 2019, 11, x FOR PEER REVIEW 5 of 19 translate images into another style, such as horse into zebra, day into night, summer into winter and so on, we want to use this idea translate aerial images into label images.Road extraction can be regarded as a problem of binary classification at the pixel level, in which we predict whether a pixel belongs to roads or background.When targeting a binary image regardless of process details, we can regard this task as an image translation, where we want to translate the aerial image into the binary image that depicts the road and background.
We combine DCGAN and CGAN in our model to extract roads from aerial images, in short, we use a structure of DCGAN with certain conditions, and here we just input aerial images as our condition without any random noise.Due to the particularity of our task where the input and the output are images with the same size, we replace the deconvolution layers with FCN.The structure of the discriminator is the same as that of the discriminator of DCGAN.Our framework is shown in Figure 1.The structure of FCN is shown as Figure 2, where the blue blocks denote the down-sampling layers and the pink ones denote the up-sampling.The numbers on the blocks equal to the numbers of the feature maps of each layer.Our FCN model inherits the traditional FCN-4s structure.We also collect low level features by adding them to the feature maps of up-sampling.Different from the traditional FCN-4s model, we remove the pooling layers in terms of down-sampling and have different numbers of layers and feature maps in our FCN network.In fact, the structure of the FCN part has many choices.We here use Unet [74] due to its deeper architecture and good performance in road extraction task (see Figure 3).In our approach, Unet has a symmetric structure of down-sampling and up-sampling layers, including eight convolutional or deconvolutional layers, and there is not any pooling layer, as shown in Figure 4.In fact, the structure of the FCN part has many choices.We here use Unet [74] due to its deeper architecture and good performance in road extraction task (see Figure 3).In our approach, Unet has a symmetric structure of down-sampling and up-sampling layers, including eight convolutional or deconvolutional layers, and there is not any pooling layer, as shown in Figure 4. Road extraction can be regarded as a problem of binary classification at the pixel level, in which we predict whether a pixel belongs to roads or background.When targeting a binary image regardless of process details, we can regard this task as an image translation, where we want to translate the aerial image into the binary image that depicts the road and background.
We combine DCGAN and CGAN in our model to extract roads from aerial images, in short, we use a structure of DCGAN with certain conditions, and here we just input aerial images as our condition without any random noise.Due to the particularity of our task where the input and the output are images with the same size, we replace the deconvolution layers with FCN.The structure of the discriminator is the same as that of the discriminator of DCGAN.Our framework is shown in Figure 1.The structure of FCN is shown as Figure 2, where the blue blocks denote the down-sampling layers and the pink ones denote the up-sampling.The numbers on the blocks equal to the numbers of the feature maps of each layer.Our FCN model inherits the traditional FCN-4s structure.We also collect low level features by adding them to the feature maps of up-sampling.Different from the traditional FCN-4s model, we remove the pooling layers in terms of down-sampling and have different numbers of layers and feature maps in our FCN network.In fact, the structure of the FCN part has many choices.We here use Unet [74] due to its deeper architecture and good performance in road extraction task (see Figure 3).In our approach, Unet has a symmetric structure of down-sampling and up-sampling layers, including eight convolutional or deconvolutional layers, and there is not any pooling layer, as shown in Figure 4.

Loss Function
Different from CycleGAN, we do not need two GANs or any cycle consistency as our road dataset contains the matches between the aerial and binary images.Moreover, the road extraction task does not need image translation.Our loss function can be defined as follows: where cGAN L denotes the loss of CGAN; content L denotes the loss of the content, and α and β are hyper-parameters to balance the two different losses.First, we use an L1-norm loss function due to its simple form, i.e.
( ) , in which 1 . denotes the L1 distance, ( ) G x denotes the binary image generated from the input aerial image, y denotes the ground truth, and X denotes the aerial image in the training batch.
( ) also can be written as follows: where m denotes the batch size during the training, i denotes the index of the samples in the current batch, k denotes the sample which is different from the i th sample, j denotes the index of the pixels in each image, and M × N denote the size of the image.The loss function of our model with L1-norm can be written as follows: ( ) The L2-norm loss function is another choice, i.e.
( ) , and ( ) can be written as Equation (8).The loss function of our model with L2 loss can be written as Equation (9).

Loss Function
Different from CycleGAN, we do not need two GANs or any cycle consistency as our road dataset contains the matches between the aerial and binary images.Moreover, the road extraction task does not need image translation.Our loss function can be defined as follows: where L cGAN denotes the loss of CGAN; L content denotes the loss of the content, and α and β are hyper-parameters to balance the two different losses.First, we use an L1-norm loss function due to its simple form, i.e., L content (G) = L 1 (G) = E x∼X G(x) − y 1 , in which . 1 denotes the L1 distance, G(x) denotes the binary image generated from the input aerial image, y denotes the ground truth, and X denotes the aerial image in the training batch.E x∼X G(x) − y 1 also can be written as follows: where m denotes the batch size during the training, i denotes the index of the samples in the current batch, k denotes the sample which is different from the ith sample, j denotes the index of the pixels in each image, and M × N denote the size of the image.The loss function of our model with L1-norm can be written as follows: The L2-norm loss function is another choice, i.e., , and E x∼X G(x) − y 2 can be written as Equation (8).The loss function of our model with L2 loss can be written as Equation (9).
Remote Sens. 2019, 11, 930 The results of our model with L1-norm and L2-norm are shown in Figure 5.Both of them have satisfactory performance, and we choose L2 loss as our element-wise loss function due to its better performance.Detailed comparison will be given in the experimental section.
The results of our model with L1-norm and L2-norm are shown in Figure 5.Both of them have satisfactory performance, and we choose L2 loss as our element-wise loss function due to its better performance.Detailed comparison will be given in the experimental section.In summary, we use Equation ( 9) as the loss function of our proposed method to extract roads from aerial images.

Training Algorithm
In our proposed method, we choose Adaptive Moment Estimation (Adam) [75] to train our network because it is one of the best algorithms in deep learning to optimize the network parameters.In fact, we usually regard Adam as the combination of Stochastic Gradient Descent with Momentum (SGDM) [76] and Root Mean Square prop (RMSprop).When we update the parameters of the generative network, each iteration of the Adam algorithm can be written as Equations (10) to (13).
We first use Equation (10) to calculate the gradient of the generative network like other training algorithms based on mini-batch gradient descent, and then we use Equation (11) to calculate mini-batch gradient descent with momentum to avoid the oscillation of the gradient whilst accelerating convergence by retaining ρ 1 of the gradient in the previous iterations and using only 1− ρ 1 ( ) of the gradient in current iteration as our gradient to update the parameters in the current iteration.1− ρ 1 t in the denominator is mainly used to remove the bias of the gradient in first few iterations.Next, we calculate RMSprop term by Equation ( 12), different from Equation (11), we use the square of the current gradient to replace the current gradient in Equation (11) and use the result to change the learning rate adaptively.Finally, we use Equation (13) to update the parameters of the generative network in the current iteration.
( ) In summary, we use Equation ( 9) as the loss function of our proposed method to extract roads from aerial images.

Training Algorithm
In our proposed method, we choose Adaptive Moment Estimation (Adam) [75] to train our network because it is one of the best algorithms in deep learning to optimize the network parameters.In fact, we usually regard Adam as the combination of Stochastic Gradient Descent with Momentum (SGDM) [76] and Root Mean Square prop (RMSprop).When we update the parameters of the generative network, each iteration of the Adam algorithm can be written as Equations (10) to (13).
We first use Equation (10) to calculate the gradient of the generative network like other training algorithms based on mini-batch gradient descent, and then we use Equation (11) to calculate mini-batch gradient descent with momentum to avoid the oscillation of the gradient whilst accelerating convergence by retaining ρ 1 of the gradient in the previous iterations and using only (1 − ρ 1 ) of the gradient in current iteration as our gradient to update the parameters in the current iteration.1 − ρ t 1 in the denominator is mainly used to remove the bias of the gradient in first few iterations.Next, we calculate RMSprop term by Equation ( 12), different from Equation (11), we use the square of the current gradient to replace the current gradient in Equation (11) and use the result to change the learning rate adaptively.Finally, we use Equation (13) to update the parameters of the generative network in the current iteration.where g G denotes the gradient of the parameters in the generative network, s G and r G are corresponding to the first moment estimate and the second raw moment estimate with bias-correction respectively.
In other words, we can regard s G as a moment term, and r G as an RMSprop term.ρ 1 and ρ 2 are two hyper-parameters called exponential decay rates which are usually set to be 0.9 and 0.999 in the experiments; ε denotes the learning rate and t denotes the number of the iterations; δ is a small positive number to keep Equation ( 13) stable, which is experimentally set as 10-8; and denotes the dot products of the matrix.
For the discriminate network, we use Equations ( 14) to (17) to update the paraments in each iteration.16) where, g D denotes the gradient of the parameters in the discriminate network, s D and r D are corresponding to the first moment and the second raw moment estimates with bias-correction, respectively.
Since our model has two networks, i.e. generative and discriminate networks, whilst one loss function is defined as Equation ( 9), we train these two parts alternately.We first train the discriminate network using stochastic gradient ascend as shown in Equation ( 17) for one iteration, and then train the generative network using stochastic gradient descent shown in Equation ( 13) for another iteration till the training loss converges.

Datasets
All the experiments are conducted on the Massachusetts Roads Dataset.This dataset contains aerial images depicting urban, suburban, and rural areas in the state of Massachusetts, USA.The dataset consists of 1171 aerial images, where each image is with the size of 1500 × 1500 pixels.1108 of these images have been randomly assigned to the training set.The remaining 49 and 14 images are allocated to the test and validation sets respectively.The dataset covers an area of approximately 2600 square kilometers in total, suggesting a Ground Sample Distance (GSD) of 1.0 meter per pixel.
Each aerial image has an accompanying binary label image, indicating whether a pixel in the aerial image belongs to either the road or non-road class.Road centerline vectors retrieved from the OpenStreetMap project were used to generate the label images.The vectors were rasterized as white lines with a line thickness of 7 pixels, which, based on the GSD, is equivalent to 7 meters on the ground.An aerial and label image pair example from this dataset is illustrated in Figure 6.

Evaluation Criteria
We use accuracy, precision, recall and F1-score to evaluate our results.F1-score is a number between 0 to 1 which considers both precision and recall.The closer F1-score approaches 1, the better results are achieved.Accuracy (A), precision (P), recall (R), and F1-score (F1) can be expressed as follows:

TP TN A TP FN FP TN
where TP , TN , FN and FP denote the true positive, true negative, false negative, and true positive respectively.

Parameter Settings
We set 1, 300 in the loss function shown in Equation ( 9) since they can obtain the best results.We choose Unet as our FCN part due to its good performance, and the number of the feature maps of each layer are shown in Figure 3.We use mini-batch Adam to train our network and set the learning rate as 0.0002, the momentum as 0.5, and the max epoch as 300.We perform our network on a GTX 1080ti GPU to accelerate the training process, which consumes about 500 seconds per epoch.
We set the batch size as 2 which means that m = 2 in Equation (9).

Comparison Algorithms
To verify the performance, our proposed method is compared with the other methods in three aspects: road extraction on a model of image to image translation (pix2pix), road extraction based on other deep learning methods and road extraction based on GANs.

Evaluation Criteria
We use accuracy, precision, recall and F1-score to evaluate our results.F1-score is a number between 0 to 1 which considers both precision and recall.The closer F1-score approaches 1, the better results are achieved.Accuracy (A), precision (P), recall (R), and F1-score (F 1 ) can be expressed as follows: where TP, TN, FN and FP denote the true positive, true negative, false negative, and true positive respectively.

Parameter Settings
We set α = 1, β = 300 in the loss function shown in Equation ( 9) since they can obtain the best results.We choose Unet as our FCN part due to its good performance, and the number of the feature maps of each layer are shown in Figure 3.We use mini-batch Adam to train our network and set the learning rate as 0.0002, the momentum as 0.5, and the max epoch as 300.We perform our network on a GTX 1080ti GPU to accelerate the training process, which consumes about 500 seconds per epoch.We set the batch size as 2 which means that m = 2 in Equation (9).

Comparison Algorithms
To verify the performance, our proposed method is compared with the other methods in three aspects: road extraction on a model of image to image translation (pix2pix), road extraction based on other deep learning methods and road extraction based on GANs.
(1) Pix2pix [73]: Pix2pix is a kind of framework that can achieve the state of the art performance in image-to-image translation.The source code can be found at https://github.com/phillipi/pix2pix.During the training and testing, the architecture and hyper-parameters are same as [73].
(2) CNN [62]: The architecture and hyper-parameters we use are totally same as [62] which used a CNN with 6 hidden layers including 3 convolutional layers, 1 pooling layer and 2 full connection layers to extract roads in aerial images.The input of the network is a sliding window with the size 64 × 64, and the output is a label image with the size 16 × 16 corresponding to the central region of the input.During the training, we randomly choose 200 windows of 64 × 64 in each training image, and use Nesterov's Accelerated Gradient (NAG) algorithm [77] with learning rate 0.0025, max epoch 100.
During the test, we set the sliding window as 64 × 64 with stride 16 to cover each image of the test set.Furthermore, before the training, we do some data augmentation by mirror and reversal to obtain more training samples, we also throw some selected windows which only contain background to keep the training data of road and background classes balance.
(3) FCN [56]: We use FCN-8s and FCN-4s models respectively, both of which have 13 convolutional layers, 5 pooling layers and 2 deconvolutional layers (FCN-4s has 3 deconvolutional layers).During the down-sampling, we fine-tune partial parameters of VGG16 to accelerate convergence.The algorithm to train is stochastic gradient descent (SGD) with a small learning rate, and we set the max epoch to be 3000.
(4) DCGAN: The framework of DCGAN shown in Figure 7, and the of the discriminator is only the output of the generator without any condition.Since we only use GAN loss, we set the loss function as: Remote Sens. 2019, 11, x FOR PEER REVIEW 10 of 19 (1) Pix2pix [73]: Pix2pix is a kind of framework that can achieve the state of the art performance in image-to-image translation.The source code can be found at https://github.com/phillipi/pix2pix.During the training and testing, the architecture and hyper-parameters are same as [73].
(2) CNN [62]: The architecture and hyper-parameters we use are totally same as [62] which used a CNN with 6 hidden layers including 3 convolutional layers, 1 pooling layer and 2 full connection layers to extract roads in aerial images.The input of the network is a sliding window with the size 64  (3) FCN [56]: We use FCN-8s and FCN-4s models respectively, both of which have 13 convolutional layers, 5 pooling layers and 2 deconvolutional layers (FCN-4s has 3 deconvolutional layers).During the down-sampling, we fine-tune partial parameters of VGG16 to accelerate convergence.The algorithm to train is stochastic gradient descent (SGD) with a small learning rate, and we set the max epoch to be 3000.
(4) DCGAN: The framework of DCGAN shown in Figure 7, and the input of the discriminator is only the output of the generator without any condition.Since we only use GAN loss, we set the loss function as: The FCN Network in Figure 7 is the same as Figure 4.The CNN with Sigmoid has same structure as Figure 1.We use mini-batch Adam with learning rate 0.0002, momentum 0.9, batch size 32, and max epoch 500.

Upsampling
Generator Discriminator Other parameters are the same as those shown in (5).

Experimental Results
We use two types of Figures to show our results.In the first type, we compare the extracted images against the label images using different methods, such as Figures 8, 10 and 12.In the second type, we can extract the hit/miss image by superposing the label images and the extracted images upon the original images to find the hit and miss areas, like Figures 9, 11 and 13.In these Figures, green lines (or points) denote the areas that we extract correctly, red lines (or points) denote the The FCN Network in Figure 7 is the same as Figure 4.The CNN with Sigmoid has same structure as Figure 1.We use mini-batch Adam with learning rate 0.0002, momentum 0.9, batch size 32, and max epoch 500.
(5) C-DCGAN: We add conditions to DCGAN in order to improve performance.The structure is the same as Figure 1.We use the loss function shown in Equation ( 9) with α = 1, β = 0. Other parameters are the same as those presented in (4).
(6) L2 loss only: In order to explore the influence of L2 loss in our model, we set an experiment that only uses L2 loss.It means we set the loss function shown in Equation ( 9) with α = 0, β = 300.Other parameters are the same as those shown in (5).

Experimental Results
We use two types of Figures to show our results, shown in Figures 8-13.In the first type, we compare the extracted images against the label images using different methods, such as Figures 8, 10 and 12.In the second type, we can extract the hit/miss image by superposing the label images and the extracted images upon the original images to find the hit and miss areas, like Figures 9, 11 and 13.In these Figures, green lines (or points) denote the areas that we extract correctly, red lines (or points) denote the areas that contain roads but the model does not correctly extract, and blue lines (or points) denote the areas that do not contain any road but the model extracts 'roads' incorrectly.
13. From the sixth column of the Figure, we can see that although DCGAN uses deep neural networks, it fails to distinguish roads and the background.The main reason is that DCGAN is free to generate random images [61], and the content of the generated images may contain much noise.show the results when we only use L2 loss as our loss function.The results present that L2 loss leads to better performance than C-DCGAN shown in Figures 8-13        present that L2 loss leads to better performance than C-DCGAN shown in Figures 8-13                 We choose three images to demonstrate the system performance, shown in Figures 8-13.The first image contains an area in which roads are narrow, like countryside or edge of the city, shown in Figures 8 and 9.The second image contains an area in which some roads are wide and the road network is relatively complex, like the center of the city, shown in Figures 10 and 11.The third image contains the body of water, and there are roads both inland and waterside, shown in Figures 12 and 13.From the sixth column of the Figure, we can see that although DCGAN uses deep neural networks, it fails to distinguish roads and the background.The main reason is that DCGAN is free to generate random images [61], and the content of the generated images may contain much noise.
Figures 8i, 9i, 10i, 11i, 12i and 13i show the results when we only use L2 loss as our loss function.The results present that L2 loss leads to better performance than C-DCGAN shown in Figures 8h, 9h, 10h, 11h, 12h and 13h.
In our proposed model, we assign large hyperparameters to L2 loss (α = 1, β = 300), and the results of our model are shown in Figures 8j, 9j Table 1 shows qualitative results of these 8 models, including average accuracy, precision, recall, and F1-score, which are calculated by Equations ( 18) to (21).The values of accuracy, precision and recall shown in Table 1 come from the average ones on the whole test set (49 images), and F1-score comes from the average precision and recall.From Table 1, we can see our method has the best performance due to the highest F1-score.CNN also has good results, but compared to our method, CNN based methods usually need to take data augmentation and apply sliding windows before the training, these will lead to more computational costs.In some way, CNN based methods are not an end-to-end framework for road extraction.Although our method has achieved the best performance, the results shown in the last row of Table 1 can be improved.Our model can be further improved to extract roads in the areas where road networks are complex, especially when for the thin roads, such as country roads.And in the cases when some objects are similar to roads, such as roofs, our model faces challenges to distinguish these regions, e.g.red blocks shown in Figure 14.In reality, different roads have different widths, but in Massachusetts Roads Dataset, different roads are labeled at the same width (7 pixel).It means that each road in the dataset is labeled with the width of 7 meters.Therefore, some incorrect results that extracted by our model comes from the miss of part of road width in the ground truth.This issue usually occurs when there are wide roads in the images like the purple block shown in Figure 14.For the green blocks shown in Figure 14, some roads are not properly labeled in the ground truth which also leads to the mistakes in road extraction.

Parameter Analysis
As said in section 2.2, we choose Unet as our FCN element in the generative network due to its good performance, and in section 2.3, we choose L2 loss as our element-wise loss.
In this section, we will provide statistics of accuracy, recall, precision and F1-score of our FCN structure, Unet, L1 loss and L2 loss as shown in Table 2 and Table 3 to validate the choice of the FCN structure and the element-wise loss.From Table 3, it is observed that L2 loss provides better performance in comparison to L1 loss.15 and Table 4.
From Figure 15 and Table 4, we notice that the results of a small kernel size are better than those of a large kernel size.Generally speaking, the size of the kernel size is corresponding to the size of the receptive field.In the task of road extraction, we do not need a large receptive filed because the roads in the aerial images are usually tiny so a smaller kernel size can lead to better results.

Parameter Analysis
As said in Section 2.2, we choose Unet as our FCN element in the generative network due to its good performance, and in Section 2.3, we choose L2 loss as our element-wise loss.
In this section, we will provide statistics of accuracy, recall, precision and F1-score of our FCN structure, Unet, L1 loss and L2 loss as shown in Tables 2 and 3 to validate the choice of the FCN structure and the element-wise loss.From Table 3, it is observed that L2 loss provides better performance in comparison to L1 loss.Another important aspect is to determine a proper size of the convolutional kernel for CNN.We choose two groups of the kernel size, one is [4,4,4,4,3,3,3,3] and the other is [11,11,7,7,5,5,4,4], and the results are shown in Figure 15 and Table 4.
From Figure 15 and Table 4, we notice that the results of a small kernel size are better than those of a large kernel size.Generally speaking, the size of the kernel size is corresponding to the size of the receptive field.In the task of road extraction, we do not need a large receptive filed because the roads in the aerial images are usually tiny so a smaller kernel size can lead to better results.
We use Equation ( 9) as the loss function in our model.How to balance C-DCGAN loss and L2 loss becomes a problem to be solved for road extraction task.From the above experiments, we find that L2 loss plays an important role in the task of road extraction, so we need to choose the best weight of L2 loss for Equation (9).We undertake experiments on the test set by different weights of L2 loss and plot the curves in Figure 16.We use Equation ( 9) as the loss function in our model.How to balance C-DCGAN loss and L2 loss becomes a problem to be solved for road extraction task.From the above experiments, we find that L2 loss plays an important role in the task of road extraction, so we need to choose the best weight of L2 loss for Equation (9).We undertake experiments on the test set by different weights of L2 loss and plot the curves in Figure 16.
From Figure 16, we can observe that when we increase the weight, F1-score on the whole test set increases as well, especially when the weights fall in the range of 0 to 250, and the performance on the test set significantly improves (F1-score increases from 0.74 to 0.86).When the weights are larger than 250, F1-score increases slowly and reaches 0.87 when the weight equals to 300.After this, F1-score does not change.So we set the weight as 300, and 1, 300 are used in Equation (9).

Conclusion
In this paper, a novel end-to-end generative adversarial network has been proposed to perform the road extraction task in aerial images.A conditional GAN with L2 loss achieves better performance than the state of the art methods.Our proposed method, road extraction based on generative adversarial networks, does not need large training datasets, and still has the best    We use Equation ( 9) as the loss function in our model.How to balance C-DCGAN loss and L2 loss becomes a problem to be solved for road extraction task.From the above experiments, we find that L2 loss plays an important role in the task of road extraction, so we need to choose the best weight of L2 loss for Equation (9).We undertake experiments on the test set by different weights of L2 loss and plot the curves in Figure 16.
From Figure 16, we can observe that when we increase the weight, F1-score on the whole test set increases as well, especially when the weights fall in the range of 0 to 250, and the performance on the test set significantly improves (F1-score increases from 0.74 to 0.86).When the weights are larger than 250, F1-score increases slowly and reaches 0.87 when the weight equals to 300.After this, F1-score does not change.So we set the weight as 300, and 1, 300 are used in Equation ( 9).

Conclusion
In this paper, a novel end-to-end generative adversarial network has been proposed to perform the road extraction task in aerial images.A conditional GAN with L2 loss achieves better performance than the state of the art methods.Our proposed method, road extraction based on generative adversarial networks, does not need large training datasets, and still has the best From Figure 16, we can observe that when we increase the weight, F1-score on the whole test set increases as well, especially when the weights fall in the range of 0 to 250, and the performance on the test set significantly improves (F1-score increases from 0.74 to 0.86).When the weights are larger than 250, F1-score increases slowly and reaches 0.87 when the weight equals to 300.After this, F1-score does not change.So we set the weight as 300, and α = 1, β = 300 are used in Equation (9).

Conclusions
In this paper, a novel end-to-end generative adversarial network has been proposed to perform the road extraction task in aerial images.A conditional GAN with L2 loss achieves better performance than the state of the art methods.Our proposed method, road extraction based on generative adversarial networks, does not need large training datasets, and still has the best performance.Compared to the other methods which also achieve good performance, our method is an end to end framework to extract roads and needs less computational costs.
Although the proposed model has achieved the best performance, extraction results on country roads and complex road network need to be further improved.The performance of remote sensing image processing methods based on deep neural networks relies on the given training dataset.Data

Figure 1 .
Figure 1.The framework of our method.

Figure 1 .
Figure 1.The framework of our method.

Figure 2 .
Figure 2. Structure of the FCN we used

Figure 3 .
Figure 3. Results of the FCN and Unet.(a) Ground Truth; (b) Result of FCN; (c) Result of Unet.

Figure 2 .
Figure 2. Structure of the FCN we used.
Remote Sens. 2019, 11, x FOR PEER REVIEW 5 of 19 translate images into another style, such as horse into zebra, day into night, summer into winter and so on, we want to use this idea translate aerial images into label images.

Figure 2 .
Figure 2. Structure of the FCN we used

Figure 3 .
Figure 3. Results of the FCN and Unet.(a) Ground Truth; (b) Result of FCN; (c) Result of Unet.

Figure 3 .
Figure 3. Results of the FCN and Unet.(a) Ground Truth; (b) Result of FCN; (c) Result of Unet.

Figure 4 .
Figure 4. Structure of the Unet we used

Figure 4 .
Figure 4. Structure of the Unet we used.

19 (Figure 6
Figure 6 Image and label example taken from the test set in the Massachusetts Roads Dataset.(a) Aerial image; (b) Label image.

Figure 6 .
Figure 6.Image and label example taken from the test set in the Massachusetts Roads Dataset.(a) Aerial image; (b) Label image.

64 ×, 16 ×
and the output is a label image with the size 16 corresponding to the central region of the input.During the training, we randomly choose 200 windows of 64 64 × in each training image, and use Nesterov's Accelerated Gradient (NAG) algorithm [77] with learning rate 0.0025, max epoch 100.During the test, we set the sliding window as 64 64 × with stride 16 to cover each image of the test set.Furthermore, before the training, we do some data augmentation by mirror and reversal to obtain more training samples, we also throw some selected windows which only contain background to keep the training data of road and background classes balance.

Figure 7 .
Figure 7.The framework of DCGAN
(a) Ground Truth (c) Result of small kernel size (b) Result of large kernel size

Figure 15 .
Figure 15.Results of different kernel size.(a) Ground Truth; (b) Result of large kernel size; (c) Result of large kernel size.

Figure 15 .
Figure 15.Results of different kernel size.(a) Ground Truth; (b) Result of large kernel size; (c) Result of large kernel size.

Figure 15 .
Figure 15.Results of different kernel size.(a) Ground Truth; (b) Result of large kernel size; (c) Result of large kernel size.

Table 1 .
Performance comparison of different methods on the test set

Table 1 .
Performance comparison of different methods on the test set.

Table 2 .
Performance comparison of our FCN and Unet on the test set

Table 3 .
Performance comparison of L1 loss and L2 loss on the test set

Table 2 .
Performance comparison of our FCN and Unet on the test set.

Table 3 .
Performance comparison of L1 loss and L2 loss on the test set.

Table 4 .
Performance comparison of different kernel size on the test set

Table 4 .
Performance comparison of different kernel size on the test set.

Table 4 .
Performance comparison of different kernel size on the test set