Extraction and Calculation of Roadway Area from Satellite Images Using Improved Deep Learning Model and Post-Processing

Roadway area calculation is a novel problem in remote sensing and urban planning. This paper models this problem as a two-step problem, roadway extraction, and area calculation. Roadway extraction from satellite images is a problem that has been tackled many times before. This paper proposes a method using pixel resolution to calculate the area of the roads covered in satellite images. The proposed approach uses novel U-net and Resnet architectures called U-net++ and ResNeXt. The state-of-the-art model is combined with the proposed efficient post-processing approach to improve the overlap with ground truth labels. The performance of the proposed road extraction algorithm is evaluated on the Massachusetts dataset and it is shown that the proposed approach outperforms the existing solutions which use models from the U-net family.


Introduction
The calculation of the roadway area from openly available satellite images aims to automate the analysis of the total road area covered by satellite images. The task uses open-source satellite images with a specified resolution to calculate the entire road area covered in the image view. We formulate this task as a two-step problem. The first step is the roadway extraction from the given satellite images, while the second step involves calculating the roadway area using pixel resolution.
Zhou et al. [15] proposed an improved variant of U-net, U-net++, which outperformed U-net in medical image segmentation. This paper aims to analyze the performance of this architecture on the roadway extraction problem. We also highlight the significance of using aggregated residual transformations, such as ResNeXt [16], in place of regular residual networks [7]. Semantic models such as U-net [15] generate a noisy pixel map of the roadway mask. The pixel map should be subjected to post-processing to obtain a clean binary mask. In this study, we propose an extensive post-processing method that outperforms the previous methods combined with the deep learning architectures.
We also introduce a simple mathematical formulation to solve the second step of calculating road area from the satellite image. This numerical approach uses pixel ratio and resolution to estimate the final result.

Methodology
This section describes the U-net++ model for roadway extraction in detail. We start by explaining the basic overview of the model followed by the components, U-net++, and ResNeXt block shown in Figure 1. Training strategy, augmentations, and post-processing pipelines are also explained in the later sections.

Model Overview
In order to improve the results of the existing U-net model, Zhou et al. [15] proposed U-net++. The paper introduces a denser U-net model with more skip connections. They hypothesize that features maps should be enriched with information from lower layers before fusing into a decoder network.
Zhou et al. [15] argues that the dense convolutional blocks, whose number of layers depend on the pyramid layer, bring the semantic level of encoder maps closer to that of feature maps waiting in the decoder, and that the optimizer would face an easier optimization problem when the received encoder feature maps and the corresponding decoder feature maps are semantically similar.
Because of skip pathways, U-net++ produces full-resolution feature maps at various semantic levels, allowing us to supervise the model in the deep layers and prune the model.

ResNeXt: Aggregated Residual Blocks
Xie et al. [16] suggested using an aggregated set of residual transformations instead of plain transformations. Cardinality is defined as the size of this set of transformations. Xie et al. argue that increasing this cardinality is a more effective way of gaining accuracy than going deeper or wider. Consider T, a transformation consisting of multiple weighted neural layers. This is then aggregated C times to form F(x) The block is then connected residually. The residual connection is also visualized in Figure 2.
This mathematical relation can be expressed as a layer diagram, as shown in Figure 2. It is important to note that the topology of all the weight layers on the same level is the same.
The ResNeXt architecture exploits the split-transform-merge strategy, along with the usual ResNet's way of repeating layers. It outperforms ResNet on the ILSVRC 2016 Classification task and ImageNet datasets.

Post-Processing: Result Refinement
The U-net++ with ResNeXt backbone outputs a probability map for the road segmentation output. There is a need to convert it to a binary mask to obtain the extraction output. The primitive way involves setting a threshold to generate the mask directly. The approach presented in Section 2.2.9 describes an extensive refinement pipeline in order to clean up the noise and faulty predictions. The post-processing pipeline is the most important contribution in this paper. The detailed post-processing is explained in the later sections.

Experiments and Metrics
This section describes experiment settings and the metrics that were evaluated via experiments.

Dataset
For testing our algorithm on openly available images, we chose the Pix2Pix Maps dataset, which was introduced by Isola et al. [17]. The Pix2Pix dataset was created to train Pix2Pix GAN Models. We use the aerial-to-map part of Pix2Pix.This paper also uses the Massachusetts Roads dataset to test the proposed model and show that it outperforms the previous models of the U-net type.

Pix2Pix Dataset
The Pix2Pix dataset consists of various datasets, such as labels to street scenes, labels to the facade, black and white to color, aerial-to-map dataset, day to night, and edges to photo. We focus on the aerial-to-map dataset in this study. An example of an aerial-to-map dataset is shown in Figure 3.
The dataset is divided into training and validation, consisting of 1096 and 1098 images, respectively. These images are scraped from Google Maps images across the Manhattan region, New York, which makes them openly available.
The proposed U-net++ model takes the input of the satellite image and outputs the binary mask. Therefore, we convert all the map images into binary masks. The image is first converted to grayscale and then thresholded to obtain a binary mask. Various threshold values are explored to obtain the final mask. This process is visualized in Figure 4. These binary masks are overlaid on the original images to visualize the roads in the images, as shown in Figure 5.

Massachusetts Dataset
This paper uses the Massachusetts dataset for testing the roadway extraction model. A lot of previous works on roadway extraction are focused on this dataset [2][3][4]6]. Hence the proposed model is trained on the dataset. The Massachusetts Roads dataset was built by Mihn et al. [18]. The dataset consists of 1171 images in total, including 1108 images for training, 14 images for validation, and 49 images for testing. The size of all the images in this is 1500 × 1500 pixels with a resolution of 1.2 m per pixel.
Massachusetts provides us with binary masks to predict roadways, and we use the same training strategy for this dataset as with Pix2Pix.

Augmentations
Deep neural networks require huge datasets to work with. Large datasets provide generalization capabilities to the models. Both Pix2Pix and Massachusetts datasets have around 1000 images each, and using them directly does not provide generalization capability to our models. We used on-the-fly augmentation on the datasets while training the models. On-the-fly augmentation involves performing random augmentation to the image before training the model. This generates a different image every time the dataset is accessed, the augmentation is generated randomly. Although this does not allow us to reproduce the results exactly, it gives us a huge advantage when compared to manually augmenting the dataset and storing it, as it does not use disk space. Training for a large number of epochs with small random augmentations ensures that the model always converges to approximately the same minimum. The following are the augmentations explored in the training: •

Loss and Metrics
This paper uses a combination of Dice loss and cross-entropy to train the model. Dice loss is defined as one minus the Dice coefficient.
where L BCE represents binary cross-entropy and L DICE represents Dice loss. N is the total number of pixels in the image. y n is the predicted softmax pixel value. y n is the ground truth of the binary mask. y n ∈ {0, 1} y n = 0; not a road pixel 1; road pixel (6) α = 0.75 worked best experimentally. All models are trained using the same cost function. Along with the loss function, various metrics are tracked during the training. Precision is defined as the percentage of correctly classified road pixels in the predicted road mask, while recall is defined as the percentage of matched road pixels in the ground truth. F1score is the harmonic mean of precision and recall. IoU is the intersection-over-union metric, which measures pixel overlap between predicted road map and ground truth images.
Here, TP denotes true positive, FP denotes false positive, and FN denotes false negative.

Training Strategy
The U-net++ model was implemented using the Pytorch framework with the help of segmentation models from the PyTorch library (https://github.com/qubvel/segmentati on_models.pytorch) version 0.2.0. This paper adopts a two-step training strategy to train the model. All models were trained on Nvidia Tesla GPU P100-PCI-E-16GBx1 (NVIDIA Corporation, Santa Clara, CA, USA).
U-net++ takes 0.22 s for inference whereas U-net takes 0.13 s for inference of a single 256 × 256 image (these results are an average of 100 runs over Nvidia Tesla K80 GPU). Hence, Unet++ might not be suitable for real-time application, but provide significant gains for accurate road area estimation.

First Step
For the Pix2Pix dataset, we use randomly resized crops of size 512 to train the model. This allows us to use a maximum training batch size of 8 on our hardware. For the Massachusetts dataset, where the original images were of sizes 1500, the images are resized to 256 for training, but the inference was performed on the original 1500-sized images padded to 1536. The extensive augmentations as described in the Augmentations section are used to provide variations to the dataset. The model is initialized with weights trained on the ImageNet dataset. Adam optimizer is used with a learning rate of 1 × 10 −3 , the model is trained for 20 epochs, and the learning rate is decreased by 0.1 factor on epoch 3, 5, 7, 9, 10, and 12.

Second Step
The model is reinitialized with the best weights from the previous step. The augmentations are decreased to only random cropping, a learning rate of 1 × 10 −5 for 20 epochs, decreasing every 2 epochs by a 0.1 factor, is used for training.
Finally, the weights with the best validation L COMB in the second step are loaded and pass the output to the post-processing pipeline.

Post-Processing Pipeline
Segmentation models output the softmax probabilities that range from 0 to 1. They are the probabilities of the pixel being a part of the road. The ground truths in the dataset are the binary images of 0 s and 1 s. To convert the soft probability map into a binary image, we follow an extensive pipeline to clean up the noisy predictions from shadows, buildings, and open grounds, etc. Our image processing operations are inspired by a paper by Adam Van Etten [19] and the solutions to the Spacenet challenge [20].
In Figure 6, model predictions are visualized. We can see that cleaning up some noise from the images can improve predictions considerably. The pipeline consists of the following stages.

Median Blurring
Median blurring is a non-linear filter-median filters . It replaces the pixel values with the median value available in the local neighborhood of the kernel. Median filters are also edge-preserving [21], hence they help us. Median blurring darkens the non-road-map pixels and highlights the roads. We use a square kernel of size 15 × 15 in our experiments. The median blurring on a sample prediction from Pix2Pix is visualized in Figure 7. Figure 7. Median blurring operation. The left side is that before applying the filter and the right side is that after applying the filter.

Gaussian Adaptive Thresholding
Adaptive thresholding is a technique where a different threshold is calculated for every local region in the image. This technique makes the assumption of an approximately uniform illumination in a smaller image region. The image is divided into sub-images of equal sizes, binarized using a local threshold, and then the results are interpolated to the original image size.
In adaptive Gaussian thresholding, the threshold value is the weighted sum of neighborhood values, where weights are a Gaussian window. We apply the Gaussian threshold on 85 × 85 patches of the image. The Gaussian adaptive thresholding for the sample Pix2Pix image is shown in Figure 8.

Removal of Small Connected Objects
After adaptive thresholding, there are small blobs that act as noise in the image. There is a need to remove all blobs smaller than a fixed size. The blobs are defined on the basis of connectivity.
"Connectivity" determines which elements of the output array belonging to the structure are considered as neighbors of the central element. Elements up to a squared distance of "connectivity" from the center are considered neighbors. Here, we define connectivity as all the non-diagonal elements as being neighbors.
In our work, we remove connected objects that are smaller than an empirically selected value of 4900 pixels. The removal of connected blobs is shown in Figure 9.

Small Kernel Erosion
Erosion erodes away the boundaries of the road in the image. We take a small kernel and slide this 2D kernel. A pixel in the original image (either 1 or 0) will be considered 1 only if all the pixels under the kernel are 1, otherwise, it is made 0.
We use a small square kernel of a 3 × 3 size, so as not to break any connection between the roads. This small kernel erosion process is visualized in Figure 10.

Results
We apply our proposed model and post-processing pipelines to the prepared Pix2Pix dataset. The results are reported in Table 1. We also compare our approach to previous works on the Massachusetts dataset in Table 2. Here, PP denotes the post-processing pipeline. We find that using post-processing gives us an improvement in terms of Precision, F1-score, Dice score, and IoU. The recall, on the other hand, does not improve.

Recall Study
Recall is a measure of our model correctly identifying true positives. Out of all the correct road pixels, how many of those can our model identify? We see that while cleaning up noisy pixels in pre-processing, some of the correct road pixels are affected too, and this decreases recall slightly; however, it greatly increases precision. The net increase in F1-score in Table 1 tells us that an increase in precision has more effect than a decrease in recall.
In Figure 11, we can see the examples where post-processing decreased the recall of the model. The post-processing pipeline cleans up the noise in the image and makes the road network better. This is important, because for the process of road area calculation we need to clean up as much noise as possible. Figure 11. Some example predictions from Pix2Pix datasets with and without post-processing pipeline along with recall scores.

Area Calculation
In the previous sections, we saw an improved deep learning model for roadway extraction. The next step in our two-step process for area calculation is pixel-area estimation.
The resolution of the images is then considered, which is around 0.5 m/pixel for Pix2Pix and 1 m/pixel for the Massachusetts dataset. With the help of this resolution, we calculate the total road area by multiplying it with the road pixel area.
Here, I res = 0.5 × 0.5 = 0.25 m 2 /px 2 . R mask i,j is a binary road mask where 1 denotes road and 0 denotes non-road area. Taking the sum gives us the total road pixel area.
In Figure 12, we can see the road segment in the red color. The total number of red pixels is the road pixel area in the image. The sum comes out to be 48,498 pixels, hence the road area calculated will be 48,498 px 2 × 0.25 m 2 /px 2 = 12,124.16 m 2 . Another example can be seen in Figure 13. Here, the road pixel area is 42,453 px 2 , hence the road area calculated will be 42,453 px 2 × 0.25 m 2 /px 2 = 10,613.5 m 2 . Area calculation is highly dependent on the segmentation accuracy of road pixels from the satellite image. Misclassification errors in the road pixel mask can result in a deviation of the estimated area from the actual value. Figure 14 presents some erroneous predictions from the model. This can be attributed to two reasons: (a) Effects of external factors such as lakes, rivers, trees, and buildings along with their shadows occlude the roadway area and decrease interpretability. (b) Noisy labels in the dataset limit the model's performance (examples in Figure 15).

Related Work
In the previous few years, significant attention has been received and extensive research solutions have been proposed to tackle the problem of roadway extraction. Initial approaches included skeleton-based, road-line extraction techniques, such as in references [24,25]. The road centerline can also be obtained from image processing operations, such as morphological thinning algorithms [26].
Road extraction from openly available satellite images can be considered as an image segmentation or pixel-level classification problem. For example, Song and Mingjun [27] use pixel spectral information and image-segmentation-derived object features for road extraction. Alshehii and Marpu [23] use morphological filtering followed by graph-based segmentation for extracting road networks from high-resolution satellite images.
These described research studies are mainly obtained by data-driven and heuristic methods. The basis of these preliminary studies is unsupervised learning, such as global optimization and graph methods dependent on color features. A drawback of using color features is color sensitivity. If the obtained road maps are of different colors, the unsupervised algorithms will perform poorly. Hence, to accurately extract road networks of various scales from remote-sensing satellite images, new robust methodologies, such as deep learning algorithms, are required.
Recent years have seen great strides in the field of deep learning. Deep neural networks now provide state-of-the-art solutions to many computer vision tasks. Image classification [28], scene recognition [29], and object detection [30] all have deep convolutional networks as their fundamental building blocks.
The field of remote sensing has also seen a rise of deep neural networks in its study. Reference [18] uses synthetic road labels to train neural networks to learn features and generate predictions with broad context. References [22,31] perform object detection and extraction with convolutional neural networks (CNN). These studies outperform the traditional road extraction methods.
Zhong et al. [1] proposed a CNN model and estimated the influence of filter stride, learning rate, input data size, training epoch, and fine-tuning for building and road extraction. They use the Massachusetts Roads dataset for this purpose. Wei et al. [2] proposed a refined CNN for road extraction in aerial images. It consisted of both deconvolutional and fusion layers. Reference [23] used a patch-based CNN for extracting roads and buildings simultaneously. Panboonyuen et al. [3] presented an encoder-decoder deconvolution network using SegNet [4] trained using ELU [5] with eight-times data augmentation.
Zhang et al. [6] proposed a deep residual U-net for road semantic segmentation from high-resolution satellite imagery. The proposed network included residual connections, which allowed fewer parameters and facilitated information propagation to counter the vanishing gradient problem [7]. It was trained and tested on the Massachusetts dataset to show improved results. Many new approaches for road-space extraction were proposed using U-net-based architectures. Reference [8] describes road extraction from high-resolution spatial remote-sensing images using richer convolutional features. Reference [9] describes the dense U-net architecture while reference [10] describes the Y-net architecture. Cascaded endto-end convolutional neural networks were introduced in reference [11]. References [12,13] use U-net models for road-space extraction.

Conclusions
This paper proposed the extraction and calculation of the roadway area from satellite images using an improved version of U-net and ResNet-U-net++ and ResNeXt-for state-of-the art performance. An extensive training strategy was performed with a hybrid loss, and augmentations were introduced to add variations to the dataset. The proposed model was trained on the modified Pix2Pix dataset and Massachusetts dataset. Results were compared on the Massachusetts dataset with the previous models trained on the same, and it is shown that the proposed model improves the existing U-net F1-score baseline by 3%.
This paper also provided a strategy to calculate the area of road covered in the satellite image using pixel-area estimation. The proposed approach made it possible to calculate the roadway area of any satellite image. This framework can be applied to any openly available satellite images to estimate the road area.
The Pix2Pix dataset used to train the model was generated by code and hence introduced noise. High-resolution satellite imagery with road areas labeled should be worked upon for an accurate model.
Various deep learning approaches were proposed after ResNet (apart from ResNeXt), such as Efficientnet [32], NFNet [33], and novel transformer architectures [34]. Using these models as a backbone for U-net or U-net++ is a research area that needs to be further explored. More segmentation models other than U-net family-like RCNN [35] and GANbased approaches [36] remain to be applied on the Pix2Pix dataset. The Pix2Pix dataset used Google Images for its source. Further work on extending the area calculation approach to any random sample of images can be performed to improve accessibility to the experts working in the field of remote sensing.