Deep Learning Approaches Applied to Remote Sensing Datasets for Road Extraction: A State-Of-The-Art Review

: One of the most challenging research subjects in remote sensing is feature extraction, such as road features, from remote sensing images. Such an extraction inﬂuences multiple scenes, including map updating, tra ﬃ c management, emergency tasks, road monitoring, and others. Therefore, a systematic review of deep learning techniques applied to common remote sensing benchmarks for road extraction is conducted in this study. The research is conducted based on four main types of deep learning methods, namely, the GANs model, deconvolutional networks, FCNs, and patch-based CNNs models. We also compare these various deep learning models applied to remote sensing datasets to show which method performs well in extracting road parts from high-resolution remote sensing images. Moreover, we describe future research directions and research gaps. Results indicate that the largest reported performance record is related to the deconvolutional nets applied to remote sensing images, and the F1 score metric of the generative adversarial network model, DenseNet method, and FCN-32 applied to UAV and Google Earth images are high: 96.08%, 95.72%, and 94.59%, respectively. dataset and 80.9% for the Abu Dhabi dataset. The results showed that the approach was e ﬀ ective in road and building extraction. However, further processing was needed to determine boundaries precisely. increase the size of the dataset. An encoder–decoder SegNet model was used for generative part to generate a high-resolution segmentation map. The accuracy that they achieved for recall, precision, and F1 score was 91.01%, 88.31%, and 89.63%, respectively, that shows the superiority of the proposed model for road extraction. The access link to the GANs model code for image segmentation can be found at https: // github.com / eriklindernoren / Keras-GAN / tree / master / pix2pix. , ,


Introduction
Spaceborne, airborne, and drone-based sensors using advanced Earth observation and remote sensing technologies have obtained large amounts and different types of high-resolution images. Such images are extensively used in several applications, such as urban planning [1], disaster management [2], and emergency tasks [3]. Among topographic object classes, road objects are essential urban features. Therefore, the constant updating of road databases is necessary to achieve several geospatial information systems (GIS) goals, such as emergency functions, automated means of navigation, urban planning, and traffic control [4]. A road database can be created and updated using feature extraction from spatial high-resolution satellite imagery [5]. Consequently, generating automatic novel techniques for extracting road classes from high-resolution satellite images and keeping road networks up-to-date in GIS databases are useful for a variety of applications [6]. High-resolution remote sensing imagery can produce a massive amount of data and has become the main data source A set of inclusion and exclusion criteria was ascertained as competency factors to identify previous studies and subjects based on the purpose of this work. The exclusion factors were as follows:  The full text of the papers was not provided by publishers;  Remote sensing images were not used in the papers.
The inclusion factors were as follows:  Articles written in English;  Peer-reviewed papers, such as conferences and journals;  Published papers during the 10-year period (i.e., 2010-2019);  Products that revealed a deep learning technique for road extraction from remote sensing images. A total of 38 records were initially identified. Subsequently, we excluded redundant papers, and those that did not use remote sensing images for road extraction; thus, only 25 studies were accepted. Finally, we classified the documents selected based on purpose as an outcome integration process and revealed the consequences in detail in Section 4. In Section 4, we present major findings, including the benefits and drawbacks of current products for road segmentation from remote sensing imagery via deep learning models, as well as evidence for each main outcome. We also discuss several recommendations for future research. A set of inclusion and exclusion criteria was ascertained as competency factors to identify previous studies and subjects based on the purpose of this work. The exclusion factors were as follows:

Results
• The full text of the papers was not provided by publishers; • Remote sensing images were not used in the papers.
The inclusion factors were as follows: • Articles written in English; • Peer-reviewed papers, such as conferences and journals; • Published papers during the 10-year period (i.e., 2010-2019); • Products that revealed a deep learning technique for road extraction from remote sensing images.
A total of 38 records were initially identified. Subsequently, we excluded redundant papers, and those that did not use remote sensing images for road extraction; thus, only 25 studies were accepted. Finally, we classified the documents selected based on purpose as an outcome integration process and revealed the consequences in detail in Section 4. In Section 4, we present major findings, Remote Sens. 2020, 12, 1444 5 of 22 including the benefits and drawbacks of current products for road segmentation from remote sensing imagery via deep learning models, as well as evidence for each main outcome. We also discuss several recommendations for future research.

Results
This section elaborates on prior studies on deep learning methods that were applied to remote sensing images to extract road sections. We split the results into several subsections based on the type of deep learning methods used ( Figure 2).
Remote Sens. 2020, 11, x FOR PEER REVIEW 5 of 22 This section elaborates on prior studies on deep learning methods that were applied to remote sensing images to extract road sections. We split the results into several subsections based on the type of deep learning methods used ( Figure 2).

Road Extraction Based on the Patch-Based CNN Model
In the patch-based CNN model, the possibility of road dispensation is firstly predicted piece-bypiece with a particular stride and then the label map of the whole image is produced by assembling all of the label patches. Figure 3 illustrates a general architecture of the patch-level CNNs model. The initial section is convolutional and max pooling layers chased by fully connected layers acting as a linear discriminator. In this section, we describe the prior studies that used the CNN model for road extraction.

Road Extraction Based on the Patch-Based CNN Model
In the patch-based CNN model, the possibility of road dispensation is firstly predicted piece-by-piece with a particular stride and then the label map of the whole image is produced by assembling all of the label patches. Figure 3 illustrates a general architecture of the patch-level CNNs model. The initial section is convolutional and max pooling layers chased by fully connected layers acting as a linear discriminator. In this section, we describe the prior studies that used the CNN model for road extraction.
In the patch-based CNN model, the possibility of road dispensation is firstly predicted piece-bypiece with a particular stride and then the label map of the whole image is produced by assembling all of the label patches. Figure 3 illustrates a general architecture of the patch-level CNNs model. The initial section is convolutional and max pooling layers chased by fully connected layers acting as a linear discriminator. In this section, we describe the prior studies that used the CNN model for road extraction.  Zhong et al. [39] provisionally implemented the newest CNN model to extract road and building objects from satellite imagery. The model fused low-level fine-grained features and high-level semantic meaning. In addition, further hyperparameters, such as the input image size, training epoch, and learning rate, were analyzed to specify the capability of the method in the context of high-resolution remote sensing images. The Massachusetts dataset, with a 1-m spatial resolution and 1500 × 1500 pixel size, containing 1711 images for the road and 151 images for the building datasets, was used for the evaluation. The Massachusetts dataset is related to the state of Massachusetts. The dataset covers over 2600 square kilometers with diverse rural, suburban, and urban areas [43]. With the integration of the pretrained FCN method with a novel four-stride pooling layer output to the last score layer, as well as fine-tuned with high-resolution spatial data, the extraction accuracy of the adjusted model was upgraded significantly to over 78%. Wei et al. [47] used a technique on aerial images for extracting road classes based on a road structure-refined CNN model, which provided road geometric information and spatial correlation. The proposed model was merged with fusion and deconvolutional layers to obtain structured output. Furthermore, a novel road structure-based loss function was applied to cross-entropy loss to yield a weight map by using the minimum Euclidean distance of every pixel to the road section and to model the road geometric structure. The Massachusetts road dataset, including 1172 images randomly divided into 49, 14, and 1108 images for testing, validation, and training, respectively, was used to calculate the proposed technique. Efficiency measures, namely, F1 score, recall, precision, and accuracy, were calculated for comparison, which were 66.2%, 72.9%, 60.6%, and 92.4%, respectively. The outcomes proved that the suggested model could extract roads effectively and achieve better accuracy compared with other existing road segmentation methods. However, postprocessing was needed to improve results. The link to download the public Massachusetts dataset and CNN code can be found in the online version, at https://www.cs.toronto.edu/~{}vmnih/data/, https://github.com/AhmedAhres/Satellite-Image-Classification.
Alshehhi et al. [48] implemented a patch-based CNN model for extracting road and building parts simultaneously from remote sensing imagery. Global average pooling was replaced with fully connected layers to consider a medium of feature maps from the final convolutional layer. Furthermore, the authors implemented a simple linear iterative clustering method during postprocessing to integrate CNN features with low-level features, such as the compactness and asymmetry of buildings and roads. This process integrated ungrouped areas of buildings and connected-disconnected road parts, as well as improved the performance of the proposed method. The Massachusetts dataset, including 10 images for testing, 137 images for training, and 4 images for validation, and the Abu Dhabi dataset with a 0.5 m spatial resolution per pixel, including 30 images for testing, 150 images for training, and 30 images for validation, were used for the evaluation. The authors used prevalent measure correctness to calculate the performance of the suggested approach, which was 91.7% for the Massachusetts dataset and 80.9% for the Abu Dhabi dataset. The results showed that the approach was effective in road and building extraction. However, further processing was needed to determine boundaries precisely.
Liu et al. [49] presented an approach for road centerline extraction from high-resolution remote sensing imagery that comprised four major stages. First, a CNN model was used to classify aerial images and learn features from raw images. Second, edge-preserving filtering was applied to the classified images with the original images to exploit road edges. Third, multidirectional morphological and shape feature filtering was used during postprocessing to obtain trustworthy roads. Finally, an integrated Gabor filter model and multiple directional nonmaximum suppression were applied to extract road centerlines. The suggested method was applied to two datasets, namely, the EPFL dataset and the Massachusetts road dataset. Three accuracy measures, namely, completeness, which was 95.40%; correctness, which was 89.97%; and quality, which was 86.21%, were used to quantify the performance, which indicated the advantage of the proposed method for road centerline extraction. However, certain centerlines were not single-pixel wide in the proposed method. Li et al. [50] employed a model based on a CNN to extract roads from high-resolution satellite imagery. First, a CNN model was applied to allocate labels to every pixel and anticipate the possibility of each pixel relating to road sections. Second, a line integral convolutional-based method was executed to maintain edge information, conjoin tiny gaps, and soften a rough map. Finally, several image-processing operations were implemented to acquire road centerlines. The authors used images from the Pleiades-1A satellite, with a spatial resolution of 0.5 m, and the Geoeye satellite to test their model. The completeness indicator was 80.57%, the correctness indicator was 96.57%, and the quality indicator was 78.27%, which showed that the proposed model achieved high precision for road extraction in terms of correctness. However, completeness and quality percentages were low, which was related to the complexity of the texture of various features in the images.

Road Extraction Based on the FCNs Model
Compared to the CNN model that utilizes a dense layer to achieve a fixed-length feature vector and only accepts images with a fixed size, the FCNs model uses the interpolation layer after the final convolutional layer to upsample the feature map and restore the similar input size, as well as accepts input images of any size. In the FCNs, the final dense layers are replaced with convolutional layers and then output is a label map. A general architecture of FCNs model is presented in Figure 4. In the following, the previous research related to the FCNs model and road extraction are explained.
Remote Sens. 2020, 11, x FOR PEER REVIEW 7 of 22 line integral convolutional-based method was executed to maintain edge information, conjoin tiny gaps, and soften a rough map. Finally, several image-processing operations were implemented to acquire road centerlines. The authors used images from the Pleiades-1A satellite, with a spatial resolution of 0.5 meters, and the Geoeye satellite to test their model. The completeness indicator was 80.57%, the correctness indicator was 96.57%, and the quality indicator was 78.27%, which showed that the proposed model achieved high precision for road extraction in terms of correctness. However, completeness and quality percentages were low, which was related to the complexity of the texture of various features in the images.

Road Extraction Based on the FCNs Model
Compared to the CNN model that utilizes a dense layer to achieve a fixed-length feature vector and only accepts images with a fixed size, the FCNs model uses the interpolation layer after the final convolutional layer to upsample the feature map and restore the similar input size, as well as accepts input images of any size. In the FCNs, the final dense layers are replaced with convolutional layers and then output is a label map. A general architecture of FCNs model is presented in Figure 4. In the following, the previous research related to the FCNs model and road extraction are explained. Varia et al. [51] applied a deep learning technique, namely, the FCN-32 for extracting road parts from extremely high-resolution UAV imagery. UAV-based imaging systems, which commonly use drones, can be used for the real-time assessment of several applications, monitoring tasks, and largescale mapping, and are managed autonomously by onboard computers or remotely by human operators [52]. UAV-based remote sensing systems are used in various remote sensing applications, such as object recognition [53] and digital elevation model (DEM) generation [54]. Compared with traditional remotely sensed systems, UAVs have multiple advantages, including improved security, high speed, low cost, and high flexibility. In addition, improved details can be provided by highresolution images taken by drone systems for object extraction and detection. The suggested techniques were evaluated on a UAV image dataset with 189 training and 23 test images. The training Varia et al. [51] applied a deep learning technique, namely, the FCN-32 for extracting road parts from extremely high-resolution UAV imagery. UAV-based imaging systems, which commonly use drones, can be used for the real-time assessment of several applications, monitoring tasks, and large-scale mapping, and are managed autonomously by onboard computers or remotely by human operators [52]. UAV-based remote sensing systems are used in various remote sensing applications, such as object Remote Sens. 2020, 12, 1444 8 of 22 recognition [53] and digital elevation model (DEM) generation [54]. Compared with traditional remotely sensed systems, UAVs have multiple advantages, including improved security, high speed, low cost, and high flexibility. In addition, improved details can be provided by high-resolution images taken by drone systems for object extraction and detection. The suggested techniques were evaluated on a UAV image dataset with 189 training and 23 test images. The training time for the FCN-32 was approximately 370 s per image. The authors evaluated quality, correctness, and completeness assessment measures to show the models' efficiency for road extraction and found that the proposed models achieved satisfactory results. Moreover, they are effective for road extraction from UAV images. However, the models misclassified nonroad areas as road areas in certain areas with high complexity, thereby resulting in a large number of false negatives (FN) and reducing the percentage of completeness and quality in the final output. The suggested models were highly dependent on the number of images fed into them for training. Thus, they should be applied to many images with a large variety for better training and improved accuracy.
Kestur et al. [55] presented a novel architecture based on the FCN called the U-shaped FCN (UFCN) to extract roads from UAV images. The model was used on a UAV dataset with 109 images, approximately 70% of which were used for training and 30% for testing. The authors applied data augmentation during the training step to increase dataset size efficiently to improve training. The prediction took 1.95, 7.68, 43.87, and 1.09 s per image for UFCN, SVM, 1D-CNN, and 2D-CNN, respectively. The 1D-CNN model was slower than the UFCN model because of the computationally intensive architecture of the 1D-CNN network. Metric indicators, namely, F1 score, recall, precision, and overall accuracy, were calculated to assess classification performance, which were 89.6%, 86.8%, 92.5%, and 95.2%, respectively. The authors also compared their model with a two-dimensional CNN model, a one-dimensional CNN model, and an SVM model. They found that the model outperformed all the aforementioned methods in terms of accuracy and prediction time. Although the result achieved by the proposed model was promising, the dataset could be extended over a large area to use the suggested method for road extraction from extremely high-resolution remote sensing imagery. An FCN-8 network was proposed by [56] for road extraction from SAR images. The method was implemented on the TerraSAR-X dataset with 20% for testing and 80% for training. The experimental outcomes proved that the proposed model was able to extract the road part accurately. The access link to the open source code of FCN models for satellite image segmentaion can be found at https://github.com/Mattymar/satellite-image-segmentation.

Road Extraction Based on the Deconvolutional Neural Networks (Dense Net)
Deconvolutional networks struggle to extract hierarchical features from images that closely pertain to a number of deep learning methods from the machine learning community. These models comprise an encoder and decoder part, which a bottom-up mapping from the input image to the latent feature space is provided by the encoder part while the latent features are mapped back to the input image using the decoder part. A general architecture of deconvolutional networks is shown in Figure 5. Following this, the previous works related to using deconvolutional models for road extraction from remote sensing datasets are highlighted.
Panboonyuen et al. [30] presented a technique based on a modified deep encoder-decoder neural network to extract road objects from remote sensing imagery. To improve the suggested model, the authors enhanced certain phases of the suggested approach containing the incorporation of the exponential linear unit (ELU) function against the rectified linear unit function. In addition, the authors increased the number of training datasets by rotating images to eight different angles incrementally and used a landscape metrics (LM) method to eliminate false road parts and improve the general accuracy of the output. The designed model was tested on the Massachusetts dataset containing 49, 14, and 1108 images for testing, validation, and training, respectively. The most common metrics, namely, F1 score, recall, and precision, were also used for the performance evaluation, which gained 85.7%, 86.1%, and 85.4%, respectively. The results proved that the suggested approach yields satisfactory Remote Sens. 2020, 12, 1444 9 of 22 results and outperforms state-of-the-art approaches in road extraction from remote sensing imagery in terms of performance metrics. Wang et al. [57] introduced a semiautomatic technique based on the finite state machine (FSM) and DNN, including two main steps, namely, training and tracking, for road extraction from high-resolution remote sensing imagery. In the training step, the model was trained to recognize the pattern of an input image. To generate training samples, a vector-guided labeling approach that elicited huge image-direction mates from available vector road maps and images was defined. In the tracking step, a fusion strategy was used to detect the size of a detection window, and the trained DNN was used to recognize extracted image patches. In general, the DNN was applied to the proposed method to determine a pattern from complicated scenes, and the FSM was used to control the behavior of trackers and translate identified patterns into states. The model was applied to two datasets, namely, aerial and Google Earth images, which were divided into 60%, 20%, and 20% for training, testing, and validation, respectively. Completeness, correctness, and quality percentage indices were used for the performance assessment, which were 75%, 70%, and 74%, respectively, thereby proving that the suggested method could effectively exploit road classes from high-resolution remote sensing imagery in areas that were not highly complex. However, the proposed method could not operate properly in extremely complicated positions where road and other occlusions roughly contribute equal reflectance characteristics.
Remote Sens. 2020, 11, x FOR PEER REVIEW 9 of 22 Panboonyuen et al. [30] presented a technique based on a modified deep encoder-decoder neural network to extract road objects from remote sensing imagery. To improve the suggested model, the authors enhanced certain phases of the suggested approach containing the incorporation of the exponential linear unit (ELU) function against the rectified linear unit function. In addition, the authors increased the number of training datasets by rotating images to eight different angles incrementally and used a landscape metrics (LM) method to eliminate false road parts and improve the general accuracy of the output. The designed model was tested on the Massachusetts dataset containing 49, 14, and 1108 images for testing, validation, and training, respectively. The most common metrics, namely, F1 score, recall, and precision, were also used for the performance evaluation, which gained 85.7%, 86.1%, and 85.4%, respectively. The results proved that the suggested approach yields satisfactory results and outperforms state-of-the-art approaches in road extraction from remote sensing imagery in terms of performance metrics. Wang et al. [57] introduced a semiautomatic technique based on the finite state machine (FSM) and DNN, including two main steps, namely, training and tracking, for road extraction from high-resolution remote sensing imagery. In the training step, the model was trained to recognize the pattern of an input image. To generate training samples, a vector-guided labeling approach that elicited huge image-direction mates from available vector road maps and images was defined. In the tracking step, a fusion strategy Panboonyuen et al. [58] developed a new enhanced deep convolutional encoder-decoder model based on SegNet to segment road classes from high-resolution remote sensing imagery. A new activation function, namely, the ELU, was incorporated into the model to improve accuracy. The LM method was applied to remove falsely categorized road classes and identify road patterns. In the final step, the authors used CRFs to sharpen extracted roads. The proposed model was applied to two aerial and satellite datasets: (1) the Massachusetts dataset, including 1171 images divided into 1108, 14, and 49 images for training, validation, and testing, respectively, and (2) the Thailand Earth Observation System (THEOS) dataset containing 855 satellite images. The authors used F1 score, recall, and precision performance measures, which achieved 87.6%, 89.4%, and 85.8%, respectively, for the Massachusetts dataset and 64.9%, 58.4%, and 75.1%, respectively, for the THEOS dataset. The results indicated that the suggested approach outperforms other existing road segmentation techniques. However, this framework only works on extremely high-resolution remote sensing images, and distinguishing road sections from low-and medium-resolution remote sensing imagery is challenging. Constantin et al. [59] introduced a modified U-net CNN for extracting road classes from high-resolution remote sensing imagery. The authors applied a novel binary cross entropy loss function and Jaccard distance fusion to train the model to decrease the number of false positives (FPs) and enhance the accuracy of binary classification. The proposed method was tested on the Massachusetts dataset, including 49 aerial test images, 14 validation data, and 1108 training data, with extra data augmentation to extend the dataset. For the accuracy assessment, overall accuracy, F1 score, recall, and precision were calculated, which were 97.14%, 74.54%, 75.48%, and 74.15%, respectively. Although the proposed model achieved a high accuracy of over 97%, its accuracy for other parameters was low. Therefore, additional pre-and postprocessing operations are necessary to improve the classification efficiency of the proposed approach for road extraction.
Zhang et al. [60] developed a deep residual U-net model similar to a U-net architecture for road semantic segmentation from high-resolution remote sensing imagery. The proposed network was designed based on residual units, which simplify network training. Rich skip connections were also used inside the model, which allowed few parameters and facilitated information propagation while achieving improved performance. The authors used their model on the Massachusetts road dataset, including 1171 images divided into 49, 14, and 1108 images as the test, validation, and training data, respectively. The authors compared the suggested model with the U-net model and two other deep networks for road extraction and found that the suggested technique was more efficient in extracting roads from high-resolution remote sensing imagery in terms of precision and recall. However, the introduced approach could not identify road sections in parking lots and under trees. Hong et al. [61] employed a method based on richer convolutional features (RCFs) for road segmentation from high-resolution remote sensing imagery. The proposed model contains four principal phases. (1) Training and testing samples were generated based on dataset preprocessing on the main image. (2) The RCF network was trained on the training samples and implemented on the testing images to generate strict road feature maps. (3) Autothreshold segmentation was applied to remove nonroad information and produce a road binary map. (4) Finally, road sections were extracted and vectorized. The authors applied their method on the Massachusetts road dataset, including 865 images. Four metrics, namely, precision, recall, F1 score, and overall accuracy, were used to determine the capability of the proposed method for road extraction, which were 85.8%, 98.5%, 91.5%, and 96.3%, respectively. Although the suggested approach achieved high accuracy for road class extraction from high-resolution remote sensing imagery, it could not gain precise road width information owing to combined pixel and model structure issues.
Xin et al. [62] applied the DenseUNet model for road extraction from remote sensing images. The DenseUNet model included skip connection and dense connection units that facilitated the merging of various scales by joints at different network layers. Two main datasets, namely, the Massachusetts and Conghua datasets, were used to calculate model efficiency. The image resolution of the Conghua dataset was 0.2 m and consisted of three red, blue, and green bands (RGB). A total of 47 aerial images were used in this dataset, with each image consisting of 3 × 6000 × 6000 pixels. In this dataset, 80% of the data were used for training and the remaining 20% were used for model validation. The Massachusetts dataset was separated into 49 images, 14 data items, and 1108 data items for testing, validation, and training, respectively. The authors used precision, recall, F1 score, Intersection Over Union (IOU), and the Kappa coefficient to calculate the efficiency of the proposed method for road extraction.
Li et al. [63] suggested a new convolutional neural network called the Y-Net, which includes two main fusion and feature extraction modules for extracting road parts from high-resolution remote sensing imagery. A feature extraction module consisting of a deep downsampling-to-upsampling subnetwork was introduced for semantic feature extraction, and a convolutional subnetwork without downsampling was introduced for detail feature extraction. The authors applied a fusion module to combine features for segmenting road classes. Moreover, the proposed technique was tested on the public Massachusetts dataset and a private dataset from the Jlin 1 business satellite. Both datasets were split into a training dataset with 12,376 images, a validation dataset with 474 images, and a testing dataset with 531 images. The authors calculated mean region IOU (mean IOU), the Dice coefficient, mean accuracy, the Matthew correlation coefficient, and pixel accuracy for the accuracy assessment of the proposed model, which were 77.09%, 85.58%, 82.53%, 71.56%, and 97.36%, respectively. The experiment results showed the superiority and potential of the model for road semantic segmentation from remote sensing imagery. However, the proposed approach possesses several road extraction limitations. A small portion of the remote sensing imagery is occupied by a number of road pixels; thus, class imbalance is a considerable dilemma in road segmentation, particularly in narrow road sections. Thus, the method does not perform well in such areas. In addition, the proposed method requires additional time for training, which could be reduced by introducing transfer learning and generative adversarial network (GAN) fusion in the model, thereby improving accuracy. In general, deep learning models can achieve high accuracy in road extraction from remote sensing imagery compared with other machine learning approaches.
Cheng et al. [64] presented a new deep learning technique called the cascaded end-to-end (CasNet) deep learning model for detecting road classes and extracting road centerlines from extremely high-resolution remote sensing imagery. The suggested model includes two networks. The first is for detecting road regions, and the second is for extracting road centerlines, which are cascaded to the previous one and take full advantage of feature maps provided previously. The authors used a thinning method to achieve a single-pixel width and smooth road centerline. The model was evaluated on Google Earth images with 224 images. The Earth images obtained using Google Earth were in the form of aerial or satellite images with RGB color and different spatial resolutions based on the data source [47]. The dataset was randomly divided into 180, 14, and 30 images for training, validation, and testing, respectively. Several regularization methods and data augmentation approaches were applied to reduce overfitting and increase the size of the dataset. Classification metrics, namely, quality, correctness, and completeness, were introduced to evaluate the road extraction performance of the proposed model, which were 88%, 92%, and 94%, respectively. The results showed that the method is effective for road centerline extraction and road detection. However, the proposed method does not perform well in areas where roads are covered by tree occlusions. Therefore, additional high-level semantic information is needed to improve the performance of the method and to extract obstructions effectively. Xu et al. [65] used a new technique based on a densely connected convolutional network (DenseNet) by introducing local and global road information to segment roads from high-resolution remote sensing images. The method was applied to Google Earth data with a 1.2-m spatial resolution containing 224 images. The authors calculated F1 score, accuracy, precision, and recall measurement indicators for the accuracy evaluation, which were 95.72%, 96.3%, 96.30%, and 95.15%, respectively. The results proved that the introduced technique is efficient for road extraction. The experiment results were compared with other semantic segmentation methods, such as the DeepLab V3+, FCN, and U-net models, and showed that the proposed method outperformed the others.
Buslaev et al. [66] developed a deep learning technique based on the U-net family to extract roads from remote sensing imagery. The authors used an encoder similar to the RezNet-34 network, and a decoder was used based on the vanilla U-Net decoder. The authors also produced a loss function that considers binary cross-entropy and IOU simultaneously. In addition, data augmentation was used to improve the performance of the method. The model was evaluated on a dataset collected by the DigitalGlobe satellite, with a 50 cm pixel resolution and 6226 images. Furthermore, 1243 validation images were provided to calculate the performance of the model. IOU was used as a metric for the accuracy assessment of the suggested method, which was 64%, thereby indicating satisfactory results for road extraction. However, the model can be further improved by preparing high-quality labeled masks and amending data augmentation. Zhou et al. [67] introduced the D-LinkNet model for road semantic segmentation from remote sensing imagery. The proposed model contains an encoder-decoder structure, dilated convolution, and a pretrained encoder for extracting road sections. A dilated convolution is a beneficial alternative to pooling layers, which is a valuable kernel for expanding and modifying receptive feature point fields and keeping detailed information, such as narrowness, connectivity, and complexity, without reducing the resolution of feature maps. The proposed technique was tested on the DigitalGlobe road dataset with 6226, 1243, and 1101 data items for training, validation, and testing, respectively. The IOU metric was evaluated and showed that the method has road extraction capabilities but retains several issues concerning road connectivity and recognition.
Doshi [68] applied an integrated model based on the ResNet and an inception-style encoder called the residual inception skip net to extract roads from satellite images. The introduced model was implemented on a dataset with a 0.5-m pixel resolution and 6226 images. The dataset was gathered by DigitalGlobe satellites. The dataset was randomly divided into 85% and 15% for training and testing, respectively. The IOU metric was calculated to assess the accuracy of the model, which was 61.3%, thereby showing that the suggested united method can generally exceed the two other baseline approaches (i.e., U-Net and DeepLab). However, various postprocessing strategies, such as the use of CRFs, can definitely promote and optimize the performance of the suggested method. Xu et al. [69] applied a deep CNN based on deep residual networks to extract roads from WorldView-2 satellite images. A Gaussian filter was first applied as a preprocessing operation to eliminate noise. Next, the M-Res-U-Net model was introduced for road semantic segmentation. The authors calculated precision, recall, and F1 score to assess the classification performance, which were 90.04%, 95.17%, and 92.77%, respectively. The proposed method could extract road classes efficiently and achieve improvements for the assessment factors. However, the approach did not perform well in certain areas wherein objects such as cars and building roofs had similar colors and spatial distributions. The authors generated ground truths using vector maps by setting a buffer in which all road areas with similar widths affected the accuracy of the model. Therefore, generating trustworthy labels and considering topological relationships could improve accuracy. Henry, Azimi, and Merkle [56] used DeepLabV3+ and Deep Residual U-Net to extract road sections from SAR images. The authors also used a control variable and mean squared error in the training process over the spatial tolerance of the network to promote the capability of the method. Each road was manually labeled, from major apparent highways to minor detectable roads. The authors applied the proposed approaches on a TerraSAR-X dataset with 80% for training and 20% for testing. For the accuracy evaluation, IOU, precision, and recall indices were calculated, which were 45.46%, 71.69%, and 75.17%, respectively. The results showed that though the FCNN models obtained satisfactory quantitative outcomes, the models missed multiple road sections and predicted unanticipated features, such as forests and hills.
He et al. [70] implemented a transfer learning technique for road segmentation from high-resolution remote sensing imagery. First, the authors applied a deep network based on an improved U-net model for training. Second, cross-modal data were used to fine tune the first two layers of a pretrained network to adjust the local features of the cross-modal data. An autoencoder was used to convert the data into three bands and extract local features for the cross-modal data of various bands. For the evaluation, the proposed method was tested on 6626 WorldView-3 images with a 0.5-m spatial resolution per pixel. The images were split into 6035 and 591 images for training and testing, respectively. F1 score, precision, recall, and IOU indicators were used to evaluate performance, which were 58.03%, 59.23%, 59%, and 42.03%, respectively. According to the results, the suggested model could extract road sections efficiently but could not achieve high accuracy in complex environments where other objects exhibited reflectances similar to road classes. Xia et al. [71] applied a DeepLab architecture for road extraction from high-resolution satellite images. The authors first implemented a semiautomatic approach to produce labeled data. A road benchmark was generated automatically then revised manually based on the construction characteristics and road patterns built by the transportation industry. The authors studied data influenced by color distortion as a type of road. Subsequently, they trained a DCNN model with deep layers to learn different road attributes. The designed method was tested on a GF-2 dataset, with spatial resolutions of 1 and 4 m for the panchromatic and multispectral scanners, respectively. The experiment results illustrated that the suggested approach can recognize road classes from complicated positions with an accuracy of more than 80% in indistinguishable regions. However, smoothness estimation for curved lines is not successfully achieved by the proposed approach. Gao et al. [72] introduced a new framework called the refined deep residual CNN to extract roads from high-resolution satellite imagery. The proposed method comprises two main units, namely, residual connected and dilated perception units. The authors applied a postprocessing step based on a tensor-voting technique and math morphology to incorporate split roads and promote the performance of the proposed model. The suggested method was implemented on two datasets: (1) Massachusetts road images with a 1-m spatial resolution per pixel, including 60, 6, and 10 images for training, validation, and testing, respectively, and (2) GF-2 road images with a 0.8-m spatial resolution consisting of 60, 16, and 10 images for training, validation, and testing, respectively. The authors calculated IOU, accuracy, recall, precision, and F1 score indicators to assess the quantitative performance of the suggested approach, which were 65.91%, 98.10%, 77.94%, 83.88%, and 80.58%, respectively. The experimental results confirmed the efficiency advantage of the proposed technique for road extraction from remote sensing imagery. However, further processing is needed to achieve high accuracy in outline boundaries and complex urban areas. Xie et al. [73] applied a new road extraction method using a high-order spatial information global perception framework (HsgNet), which uses LinkNet as its basic network and embeds a middle block between encoder and decoder. The middle block learns to maintain various feature dependencies and channels' information, long-distance spatial relationship and information, and global-context semantic information. They implemented the proposed model on the DeepGlobe dataset that consists of 622 test images, 622 validation images and 4971 training images with spatial resolution of 0.5 m and image resolution of 1024 × 1024, as well as the SpaceNet dataset that includes 567 test images and 2213 training images with image resolution of 512 × 512. For evaluating the performance of the proposed method for road extraction, they calculated measurement metrics such as precision, recall, F1 score and IOU that obtained 83%, 82%, 71.1%, and 71.1%, respectively, for the DeepGlobe dataset and 81.6%, 84.5%, 83%, and 71%, respectively, for the SpaceNet dataset. The experimental results showed that the suggested model performed well for road extraction from high-resolution remote sensing imagery. The links to download the public datasets and official code repositories of aforementioned deep learning models can be found in the online version at https://github.com/robmarkcole/satellite-image-deep-learning; https: //github.com/jeradhoy/DeepSatelliteData, https://github.com/divamgupta/image-segmentation-keras.

Road Extraction Based on the GANs Model
GANs comprises two main generative and discriminator models, in which the generative term tries to obtain the data dispensation and the discriminator part tries to determine the likelihood that a representation refers to training data instead of being created by a generative model [74]. The generic architecture of GANs model is presented in Figure 6. In this section, previous work related to applying the GANs model for road segmentation is highlighted.
Costea et al. [75] presented a new method named dual-hot generative adversarial networks (DH-GAN) to detect intersections and roads from UAV images at the higher semantic level of road graphs during the first step. Then, they applied a smoothing-based graph optimization method for pixelwise road segmenting and finding the road graph. They used F1 score, precision, and recall for evaluating the performance of the model, which were 86%, 89.84%, and 82.48% that proved the efficiency of the proposed model for road extraction, and also was able to minimize the memory costs. Varia, Dokania, and Senthilnath [51] applied the GANs model for road extraction from UAV images. They used the U-Net model for the generator part and the model was trained on 189 UAV images and evaluated on 23 test images. The training took 300 s per image for GANs-UNet. They achieved an accuracy of 96.08 for the F1 score, which shows that the proposed model was more efficient for road extraction from UAV images. Shi, Liu, and Li [74] implemented the GANs model for attaining a smooth road segmentation map from Google Earth images with 550 images: 320 images were used for training, 100 images for validation, and 130 images for testing. They also used data augmentation procedures to increase the size of the dataset. An encoder-decoder SegNet model was used for generative part to generate a high-resolution segmentation map. The accuracy that they achieved for recall, precision, and F1 score was 91.01%, 88.31%, and 89.63%, respectively, that shows the superiority of the proposed model for road extraction. The access link to the GANs model code for image segmentation can be found at https://github.com/eriklindernoren/Keras-GAN/tree/master/pix2pix. Costea et al. [75] presented a new method named dual-hot generative adversarial networks (DH-GAN) to detect intersections and roads from UAV images at the higher semantic level of road graphs during the first step. Then, they applied a smoothing-based graph optimization method for pixelwise road segmenting and finding the road graph. They used F1 score, precision, and recall for evaluating the performance of the model, which were 86%, 89.84%, and 82.48% that proved the efficiency of the proposed model for road extraction, and also was able to minimize the memory costs. Varia, Dokania, and Senthilnath [51] applied the GANs model for road extraction from UAV images. They used the U-Net model for the generator part and the model was trained on 189 UAV images and evaluated on 23 test images. The training took 300 seconds per image for GANs-UNet. They achieved an accuracy of 96.08 for the F1 score, which shows that the proposed model was more efficient for road extraction from UAV images. Shi, Liu, and Li [74] implemented the GANs model for attaining a smooth road segmentation map from Google Earth images with 550 images: 320 images were used for training, 100 images for validation, and 130 images for testing. They also used data augmentation procedures to increase the size of the dataset. An encoder-decoder SegNet model was used for generative part to generate a high-resolution segmentation map. The accuracy that they achieved for recall, precision, and F1 score was 91.01%, 88.31%, and 89.63%, respectively, that shows the superiority of the proposed model for road extraction. The access link to the GANs model code for image segmentation can be found at https://github.com/eriklindernoren/Keras-GAN/tree/master/pix2pix.

Discussion
Several deep learning techniques have been suggested for extracting road classes from highresolution remote sensing imagery. However, demands to obtain improved precision for segmented road outcome sets remain. Compared with other machine learning methods, deep learning techniques have shown notable development in object segmentation from images. However, their efficiency in road extraction can be scaled based on the processing power, model complexity, and the size of the training data. This review of existing research proves that compared with other machine learning and traditional techniques, deep learning methods have obtained higher precision in extracting road parts from high-resolution remote sensing imagery.

Discussion
Several deep learning techniques have been suggested for extracting road classes from high-resolution remote sensing imagery. However, demands to obtain improved precision for segmented road outcome sets remain. Compared with other machine learning methods, deep learning techniques have shown notable development in object segmentation from images. However, their efficiency in road extraction can be scaled based on the processing power, model complexity, and the size of the training data. This review of existing research proves that compared with other machine learning and traditional techniques, deep learning methods have obtained higher precision in extracting road parts from high-resolution remote sensing imagery.
Based on previous studies, we categorize all the CNNs into four main models: the patched-based CNN model [40]; the FCN-based model [76,77]; deconvolutional net-based models, such as U-Net [78], SegNet [79], and DeepLab [80]; and the GAN-based model [81]. GANs contain two sections called the generator and discriminator parts, which have recently gained considerable attention [82]. The generator part struggles to make fake images from realistic ones, whereas the discriminator part strives to identify feigned images from actual images. Finally, dynamic balance can be achieved by the two parts, and an image can be segmented by the generator portion. In FCN models, each pixel can be inferred end-to-end by examining the patch-to-pixel anticipation. In these models, convolutional layers are replaced by final dense layers in which the output of the label map is the last convolutional layer. Deconvolutional net-based models are identified by deconvolutional layers, which are called decoder sections. Finally, the image block around a pixel can be used to train and anticipate input in the patch-based CNN model. The throughput outcomes of the aforementioned studies have shown that the deconvolutional networks are the most popular models that most of the researchers apply for the purpose of road semantic segmentation from high-resolution remote sensing imagery. We elaborate on the advantages and disadvantages of the discussed approaches to develop a general comparison (Table 1). Table 1 shows that each model has its own limitations and strengths. For example, simple interpolation is utilized in the upsampling of the FCN models, thereby causing the models to achieve low precision. However, pixel-to-pixel reasoning can be obtained as well as end-to-end can be learned by FCNs inspired by CNN-based models that need expansive samples, ignore the correlation among neighboring pixels, and require a high processing step to recognize precise road boundaries. While FCNs models encounter problems with road connectivity and cannot make smoothness predictions for curved lines as well as the segmentation map encounters with low spatial constancy, the DeconvNet model can obtain higher spatial precision and contains high adaptability compared with FCNs, as the former uses low-level information in deconvolutional layers. However, a large amount of storage and memory as well as a high computing process is required for applying this model. By contrast, the GANs model is more efficient because this model can achieve a constant segmentation map with road boundary information. However, the model encounters problems with a lack of convergence, gradient destruction, and complex training. In addition, we attempt to compare the accuracy of different deep learning models applied to remote sensing datasets based on the common metrics [83] used to evaluate the efficiency of the proposed approaches for road extraction. Popular evaluation measures are calculated based on a confusion matrix comprising four main factors, namely, false negative (FN), true negative, true positive, and FP [83,84]. A general comparison of all the methods used in all datasets is provided to elaborate on the most efficient technique for road extraction (Figure 7). All the aforementioned works and corresponding values are plotted using an x-axis and y-axis, respectively. Only the methods that include a dataset and research performance reports are compared.
We consider the F1 score metric, which is a trade-off measure between recall and precision, to compare the results achieved by different deep learning models for road extraction, except for the models such as U-Net, D-linkNet, and RISN applied to the DigitalGlobe satellite images, as the authors utilized only the IOU indicator for the performance evaluation. However, this indicator is only approximately 65% and does not demonstrate high precision. Figure 8 shows that the F1 score percentage is high for the GANs-UNet model, DenseNet method, and FCN-32 applied to UAV and Google Earth images, with accuracies of 96.08%, 95.72%, and 94.59%, respectively. One of the elegant fully convolutional neural networks named U-Net model was used for a generative model in the GANs framework to create a high-resolution segmentation map with more accuracy. Also, the model was applied on UAV images that consist of very high spatial resolution with a variety for the angle of capture, color, shapes, and orientation, which lead to achieving a highly precise road segmentation map compared to the other deep learning models. Figure 8 illustrates the results achieved for road segmentation from UAV images (Figure 8a,b) with image dimension of 128 × 128, Google Earth images ( Figure 8c) with a spatial resolution of 1.2 m and image dimension of 256 × 256, and the Massachusetts dataset ( Figure 8d) with a spatial resolution of 1 m and image dimension of 375 × 375, by using the FCN-32, GANs-UNet, DenseNet, DeepLab V3+, CNN, and RSRCNN methods. The first and second columns are original and ground truth images, while the third and fourth columns depict the results achieved by the state-of-the-art methods. As it can be seen from Figure 8, the GANs model applied on UAV images performed better and predicted less FP and FN pixels when compared to other methods. Also, a smooth segmentation map with more details of boundary information is attained by the proposed model. In contrast, the CNN model applied on the Massachusetts dataset was unable to achieve high accuracy in road extraction compared to the RSRCNN method that was applied on the same dataset. The extracted road parts by CNN has a significant issue of fuzzy boundaries and "salt and pepper" phenomena because the CNN model only counts on texture and spectral features; the mixed pixels in road borders lead to misclassification while the other methods improve the classification performance by restraining the effect of mixed pixels by the segmentation process. In the models such as DenseNet and GANs, road features are extracted from every convolutional layer and then integrated on multiscales. Multiscale merging of road features not only uses high-level semantic information to avoid influence of width changes, curvatures, and shadows to achieve precise road boundaries, but also utilizes low-level information to preserve detailed information of road features. As a result, the CNN model predicted more nonroad pixels that lead to extract larger road parts compared to the reference map with low accuracy.
Remote Sens. 2020, 11, x FOR PEER REVIEW 16 of 22 map with road boundary information. However, the model encounters problems with a lack of convergence, gradient destruction, and complex training.
In addition, we attempt to compare the accuracy of different deep learning models applied to remote sensing datasets based on the common metrics [83] used to evaluate the efficiency of the proposed approaches for road extraction. Popular evaluation measures are calculated based on a confusion matrix comprising four main factors, namely, false negative (FN), true negative, true positive, and FP [83,84]. A general comparison of all the methods used in all datasets is provided to elaborate on the most efficient technique for road extraction (Figure 7). All the aforementioned works and corresponding values are plotted using an x-axis and y-axis, respectively. Only the methods that include a dataset and research performance reports are compared. We consider the F1 score metric, which is a trade-off measure between recall and precision, to compare the results achieved by different deep learning models for road extraction, except for the models such as U-Net, D-linkNet, and RISN applied to the DigitalGlobe satellite images, as the authors utilized only the IOU indicator for the performance evaluation. However, this indicator is only approximately 65% and does not demonstrate high precision. Figure 8 shows that the F1 score percentage is high for the GANs-UNet model, DenseNet method, and FCN-32 applied to UAV and Google Earth images, with accuracies of 96.08%, 95.72%, and 94.59%, respectively. One of the elegant fully convolutional neural networks named U-Net model was used for a generative model in the GANs framework to create a high-resolution segmentation map with more accuracy. Also, the model was applied on UAV images that consist of very high spatial resolution with a variety for the angle of capture, color, shapes, and orientation, which lead to achieving a highly precise road segmentation map compared to the other deep learning models. Figure 8 illustrates the results achieved for road segmentation from UAV images (Figure 8a  process. In the models such as DenseNet and GANs, road features are extracted from every convolutional layer and then integrated on multiscales. Multiscale merging of road features not only uses high-level semantic information to avoid influence of width changes, curvatures, and shadows to achieve precise road boundaries, but also utilizes low-level information to preserve detailed information of road features. As a result, the CNN model predicted more nonroad pixels that lead to extract larger road parts compared to the reference map with low accuracy.

Conclusion
Spatial data, especially for road networks, should be updated regularly owing to rapid changes in artificial and natural features. Providing road data using traditional methods is ineffective, as such approaches are costly and time consuming. By contrast, extracting road types using advanced remote

Conclusions
Spatial data, especially for road networks, should be updated regularly owing to rapid changes in artificial and natural features. Providing road data using traditional methods is ineffective, as such approaches are costly and time consuming. By contrast, extracting road types using advanced remote sensing technologies can be economically and practically efficient. Numerous proposed methods for road extraction and road data updates using remote sensing images are described in this review. In this research, we discover that most studies concentrate on using powerful methods to overcome constraints. Therefore, the development of advanced machine learning methods, such as deep CNNs, for feature segmentation and extraction from remote sensing images has encouraged researchers to apply such models to extract road networks from high spatial resolution remote sensing imagery, owing to the considerable efficiency of deep convolutional approaches in different applications.
Although the methods utilized for road extraction used different data, this study can provide the following important outcomes.

1.
The capabilities of deep learning methods for road extraction are more effective than those of regular approaches.

2.
When the complexity of images is high and various road types are present, the accuracy of the models is low. Therefore, mixing robust pre-and postprocessing techniques is recommended and useful to achieve satisfactory results. 3.
The appropriateness of deep learning approaches for road extraction pertaining to different variables, such as architecture, data, and hyperparameters, is determined. 4.
The low efficiency of the proposed methods in terms of data quality, training dataset, and model hyperparameters is presented.

5.
Occlusions, such as shadows, cars, and buildings, are similar to road features, such as colors, reflectance, and patterns. Road extraction remains challenging owing to such issues. 6.
Further research is required to build detailed techniques with high precision. CNNs trained by one dataset may be inconsistent with other scenes. Nonetheless, if training datasets are adequate and a deep learning model can be created effectively, then the model can be implemented properly on most prevalent datasets.
In this review, state-of-the-art deep convolutional models that represent common and newly advanced methodologies are described. In conclusion, introducing several new methods related to road semantic segmentation is important, and research on different proposed techniques with cutting-edge technology is increasing.