A Framework Based on Nesting of Convolutional Neural Networks to Classify Secondary Roads in High Resolution Aerial Orthoimages

: Remote sensing imagery combined with deep learning strategies is often regarded as an ideal solution for interpreting scenes and monitoring infrastructures with remarkable performance levels. In addition, the road network plays an important part in transportation, and currently one of the main related challenges is detecting and monitoring the occurring changes in order to update the existent cartography. This task is challenging due to the nature of the object (continuous and often with no clearly deﬁned borders) and the nature of remotely sensed images (noise, obstructions). In this paper, we propose a novel framework based on convolutional neural networks (CNNs) to classify secondary roads in high-resolution aerial orthoimages divided in tiles of 256 × 256 pixels. We will evaluate the framework’s performance on unseen test data and compare the results with those obtained by other popular CNNs trained from scratch.


Introduction
The road transport network is dynamic and complex in nature, and its detection and monitorization has traditionally been a challenging task requiring multiple operators to manually identify objects in aerial images. Since roads are frequently repaired and built, keeping road cartography up to date is a challenging task for public agencies and the process of updating the existing support is often costly and time-consuming because of the very large areas that need to be considered in the process. Furthermore, the relevant authorities responsible for the management of public geographical information periodically updates the existing cartographic support every two to three years.
We believe that recent advances in computer vision can enable the automation of the traditional approach. In the remote sensing field, researchers achieved great results by applying transfer learning from pretrained models on large datasets (e.g., ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [1]) to classify and detect geospatial objects. However, they focused mostly on objects with clearly defined limits that are independent of the background (e.g., airplanes, buildings, ships). What happens when we have to work with more difficult, continuous geospatial objects like secondary roads where the borders are often not delimited (no markings of the edges or the centrelines) and are easily confused with their surroundings ? We decided to approach the task of road extraction by beginning with a road classification subtask in order to make the segmentation operation more efficient. It is known that semantic segmentation is computationally expensive [2], but if we manage to successfully identify the presence of roads beforehand, it will result in a faster and more efficient production model.
In this paper, we focus on classifying secondary roads in aerial orthoimages with a declared spatial resolution of 50 cm. We approach this task by proposing two variants of an ensemble model formed by deep CNNs. The first variant is based on stacking weak learners and applying a combiner algorithm that averages their predictions. The weak learners are modified versions of popular CNNs (VGGNet [3], ResNet [4] and Inception-ResNet [5]) together with a CNN we built especially for this purpose [6]. To find the best configurations of the base models, we trained the networks from scratch (random initialization of weights) and applied transfer learning techniques to re-use the features learned on ILSVRC (initialization from an already trained weight value).
The second variant of the framework comes from one of our previous works [7], where we evaluated the appropriateness of transfer learning techniques for this task and discovered a significant difference in performance metrics (7-10%) when testing the models on areas with different vegetation coverage (drier and greener) separately. We approached this drawback by stacking a specialized model trained on the dataset containing tiles from the area with lower performance metrics, a generic model trained on the entire dataset, and a terrain type classifier, which acts as an arbitrary combiner algorithm that infers the final prediction.
Our contributions are to be applied to geospatial object detection and classification in cartography using remotely sensed data and are summarized as follows: • We design a novel framework based on deep convolutional neural networks to classify secondary roads in high resolution aerial orthoimagery. The solution integrates modified configurations (focusing on improving the computational efficiency) of three of the most popular CNN architectures for computer vision problems, together with a network especially built for this task; • We contrast the performance of state-of-the-art CNN architectures and deep learning techniques in recognizing secondary roads in high resolution aerial imagery under various scenarios; • We study how different architectural tweaks and changes in default hyperparameters affect a base model's performance in the road classification task; • Different from the previous works, we focus on challenging situations, where the borders of the roads are easily confused with the surroundings (secondary transport routes). In addition, the remotely sensed imagery often includes shadows and occlusions, further complicating the classification operation; • We carry our experiments on a new dataset composed of sampled locations with ground truth labels obtained by dividing high resolution aerial imagery in tiles of 256 × 256 pixels and manually annotating them.
The remainder of this article is organized as follows. Section 2 describes works related to road detection and classification using remotely sensed data. Section 3 describes the dataset built for this task. Section 4 introduces the architectures used as weak learners and provides details of our proposed framework. Section 5 describes the experiments carried out to obtain the best possible configurations. In Section 6, we present the performance metrics and conduct an extensive discussion and statistical analysis. Lastly, Section 7 draws the conclusions of our work, the last section being reserved for references.

Related Work
A wave of promising research was incentivised in the remote sensing community following the success of deep CNN architectures in solving computer vision tasks. These networks are designed to receive images as inputs and encode three important architectural features: (1) shared weights-by forcing the neurons of one layer to share weights, the forward pass becomes the equivalent of convolving a filter over the image to produce a new image (the convolution operation), (2) local connectivity-neurons in one layer are only connected to neurons in the next layer that are spatially close due to a smaller kernel size, and (3) representations-by using pooling and ReLU non-linearities, we increase the model's expressive power.
It is important to mention that the weights obtained from training popular CNN architectures on ILSVRC (1.2 million images) are publicly available as pre-trained models. We can apply transfer learning to customize these pre-trained models for a new task and take advantage of the learned feature maps. Transfer learning works by initializing the weights of a network using the weights of an already trained model on a large dataset, ensuring a good convergence most of the time [8]. The level of generality of the representation extracted depends on the depth of the layer in the model: earlier layers extract highly generic feature maps (such as edges, colours, textures), whereas layers that are higher up extract more abstract features. Researchers in the field of remote sensing have successfully applied transfer learning to repurpose the pre-learned features for land use analysis and object detection in satellite imagery data.
A hot research area in this field is related to aerospatial objects detection: In [9], the authors constructed a CNN for airport detection, focusing on harder examples by adding a mining layer to automatically detect examples by their losses, obtaining F1-scores of maximum 0.9567 on the validation set. The authors of [10] propose a framework based on deep CNNs to recognize ten types of aircrafts using key-point annotations of the planes in the training stage, obtaining accuracy levels of maximum 95.6%. In [11], the structure of VGGNet was enhanced by replacing the FC layers with a full CNN to improve the performance metrics in airplane and automobile detection, while reducing the number of parameters.
Another important area of research is related to the task of building detection and mapping using remote sensing imagery. In [12], a fine-tuned version of VGGNet is used to detect buildings and recognize the roof types, obtaining quality rates higher than 83.3%. The authors of [13] evaluated the performance of deep learning for roof segmentation on a dataset containing over 220,000 buildings tagged in remote sensing images with a spatial resolution of 7.5 cm and obtained F1-scores of maximum 0.947. In [14], open-source data is used to conduct a comparative analysis between four different CNN architectures pre-trained on large datasets to extract footprints of buildings across the United States, obtaining precision scores higher than 95%.
Other works, such as [15], focused on transferring knowledge from the image scene classification task to the geospatial object detection task using supervision from scene tags to classify nine different categories (including bridges, ships, tennis courts). The authors of [16] implemented a deep CNN comprised by two convolutional and two max-pooling layers followed by two FC layers to predict labels for images captured by UAVs in real time, achieving accuracy levels of 93.6%. In [17], the authors contrast the capabilities of VGG-16 and ResNet architectures in recognizing land use classes and patterns in different urban environments from satellite imagery, obtaining accuracy levels of maximum 81.0%. The authors of [18] investigated the transfer of activations from CNNs pre-trained on ILSVRC to scene classification tasks by extracting activation vectors from the FC layers and by encoding features into global image features to improve the classification accuracies.
Hutchison et al. were among the first to approach the challenge of road detection in urban environments using deep learning. In [19], they used supervised methods to build a model for this task, and later applied unsupervised training to generate filters that improved the predictions of the model.
In [20], a CNN consisting of five convolution layers followed by a GAP layer is proposed to extract roads from publicly available urban road datasets. The inputs were aerial images of 1500 × 1500 pixels in size with a spatial resolution of one meter, and the performance metrics were as high as 81.6%. In the end, they used spatial features of adjacent pixels to enhance multi-class prediction.
The authors of [21] recognized the complexity of road detection in real-world applications while evaluating the performance of the most popular CNN architectures in road detection and segmentation using satellite imagery. They obtained precision levels of a maximum 71.69% by applying data augmentation, filtering and post-processing techniques.
Inspired by residual learning, the authors of [22] developed a deep-residual network similar to U-Net for the semantic segmentation of roads and tested it on a public roads dataset. The performance metrics increased by 1%, while the number of parameters decreased by 3 4 , when compared to U-Net.
Recently, the authors of [23] proposed RoadNet (composed on three CNN following the VGGNet architecture) to analyse urban scenes and predict road surfaces, edges, and centrelines. The three CNNs were concatenated and used receptive fields of 1 × 1 pixels in size, obtaining F-scores of maximum 93.9%.
In [24], this challenge is approached from a spatio-temporal perspective by adding a temporal processing block on top of existing networks, achieving accuracy levels over 90%. In [25], the authors proposed a model based on VGGNet to learn the features of road boundaries by integrating RGB images street scenes, the semantic contour and the location in a neural network for autonomous driving applications.

Dataset
The variables studied in this paper are pixel values of high-resolution aerial orthophotos. The imagery used was issued by state agencies and has a declared spatial resolution of 0.50 m (one pixel covers an area of 50 × 50 cm on the ground) and a planimetric RMSE ≤ 1 m [26].
These data were captured using calibrated high-resolution digital photogrammetric cameras equipped with 3-band RGB sensors (eight bits per band) and is georeferenced in ERTS89 Geodesic Reference System. The photograms were acquired in optimal meteorological conditions (clear sky with no clouds during specific hours and east-west orientation of the flight to minimize the errors caused by the sun). According to its producer, the imagery is orthorectified to ensure matching between contiguous photograms, radiometrically corrected to make effective use of all the bits (homogeneous intensity and balanced histograms) and has topographic corrections applied using terrestrial coordinates of representative points. Considering the low flight altitude, the effect of the atmosphere conditions is considered negligible; therefore, no atmospheric corrections were applied.
The data needed for the supervised learning process were obtained by labelling these orthoimages divided in tiles using a cartographic viewer based on Web Map Service (WMS) [27]. The task involved performing a visual comparison of the most recent orthoimages available with the existing cartographic support. For consistency reasons, we used the same zoom-level during the operation, each tile covering an area of 128 × 128 meters. We took into account representative areas of Spain (Galicia, Navarre, Balearic Islands, Segovia, Ciudad Real, Albacete, Murcia, Huelva) that may influence a network's capability in identifying secondary roads for the whole national territory. The labelled tiles were merged into the correspondent category (samples extracted from each collection can be seen in Figures 1 and 2). These categories will allow the CNNs to learn about the existence/non-existence of roads in a new given tile.
Remote Sens. 2020, 12, 765 4 of 22 performance metrics increased by 1%, while the number of parameters decreased by ¾ , when compared to U-Net.
Recently, the authors of [23] proposed RoadNet (composed on three CNN following the VGGNet architecture) to analyse urban scenes and predict road surfaces, edges, and centrelines. The three CNNs were concatenated and used receptive fields of 1 × 1 pixels in size, obtaining F-scores of maximum 93.9%.
In [24], this challenge is approached from a spatio-temporal perspective by adding a temporal processing block on top of existing networks, achieving accuracy levels over 90%. In [25], the authors proposed a model based on VGGNet to learn the features of road boundaries by integrating RGB images street scenes, the semantic contour and the location in a neural network for autonomous driving applications.

Dataset
The variables studied in this paper are pixel values of high-resolution aerial orthophotos. The imagery used was issued by state agencies and has a declared spatial resolution of 0.50 m (one pixel covers an area of 50 × 50 cm on the ground) and a planimetric RMSE ≤ 1 m [26].
These data were captured using calibrated high-resolution digital photogrammetric cameras equipped with 3-band RGB sensors (eight bits per band) and is georeferenced in ERTS89 Geodesic Reference System. The photograms were acquired in optimal meteorological conditions (clear sky with no clouds during specific hours and east-west orientation of the flight to minimize the errors caused by the sun). According to its producer, the imagery is orthorectified to ensure matching between contiguous photograms, radiometrically corrected to make effective use of all the bits (homogeneous intensity and balanced histograms) and has topographic corrections applied using terrestrial coordinates of representative points. Considering the low flight altitude, the effect of the atmosphere conditions is considered negligible; therefore, no atmospheric corrections were applied.
The data needed for the supervised learning process were obtained by labelling these orthoimages divided in tiles using a cartographic viewer based on Web Map Service (WMS) [27]. The task involved performing a visual comparison of the most recent orthoimages available with the existing cartographic support. For consistency reasons, we used the same zoom-level during the operation, each tile covering an area of 128 × 128 meters. We took into account representative areas of Spain (Galicia, Navarre, Balearic Islands, Segovia, Ciudad Real, Albacete, Murcia, Huelva) that may influence a network's capability in identifying secondary roads for the whole national territory. The labelled tiles were merged into the correspondent category (samples extracted from each collection can be seen in Figures 1 and 2). These categories will allow the CNNs to learn about the existence/non-existence of roads in a new given tile.  The dataset created contains 18,000 tiles and occupies a disk volume of approximately 2.62 GB. We are working towards making it openly available once we increase its size. Each image has a height of 256 pixels, a width of 256 pixels (for a total of 65,536 pixels = 0.07 Megapixels) and a depth of three colour channels (RGB). Each pixel in the Red/Green/Blue channel is represented by a number between 0 and 255, where 255 represents the maximum brightness (white) and 0 means no brightness (black).
The difference in performance metrics found in [7] made us build additional subsets with the criteria of similar background colours (areas with dense vegetation/Mediterranean areas) for the second variant of the ensemble model (based on specialized models). We also doubled the size of the dataset used in [6] to expose the models to more aspects of the classes, and balanced the classes to avoid the common machine learning problem where a model starts to predict new observations to the majority class to achieve high accuracy. The tiles' distribution can be seen in Table 1.

Framework
Our framework involves model nesting (and more specifically, ensembling) where, given a set of models, we ensemble them using stacking techniques to combine the predictions of the models and build a new model. The reasoning behind using an ensemble is that by stacking multiple models (called base models or weak learners) representing different hypotheses, we can find a better hypothesis that might not be contained within the hypothesis space of the models from which the ensemble is built. In [28], the author identified three main reasons causing this situation: (1) having insufficient input data (statistical), (2) difficulties for the learning algorithm to converge to the global minimum (computational), and (3) representational, where the function ℎ * cannot be represented by any of the hypothesis proposed. The dataset created contains 18,000 tiles and occupies a disk volume of approximately 2.62 GB. We are working towards making it openly available once we increase its size. Each image has a height of 256 pixels, a width of 256 pixels (for a total of 65,536 pixels = 0.07 Megapixels) and a depth of three colour channels (RGB). Each pixel in the Red/Green/Blue channel is represented by a number between 0 and 255, where 255 represents the maximum brightness (white) and 0 means no brightness (black).
The difference in performance metrics found in [7] made us build additional subsets with the criteria of similar background colours (areas with dense vegetation/Mediterranean areas) for the second variant of the ensemble model (based on specialized models). We also doubled the size of the dataset used in [6] to expose the models to more aspects of the classes, and balanced the classes to avoid the common machine learning problem where a model starts to predict new observations to the majority class to achieve high accuracy. The tiles' distribution can be seen in Table 1.

Framework
Our framework involves model nesting (and more specifically, ensembling) where, given a set of models, we ensemble them using stacking techniques to combine the predictions of the models and build a new model. The reasoning behind using an ensemble is that by stacking multiple models (called base models or weak learners) representing different hypotheses, we can find a better hypothesis that might not be contained within the hypothesis space of the models from which the ensemble is built. In [28], the author identified three main reasons causing this situation: (1) having insufficient input data (statistical), (2) difficulties for the learning algorithm to converge to the global minimum (computational), and (3) representational, where the function h * cannot be represented by any of the hypothesis proposed.
A popular form of stacking involves computing the outputs of base models, performing a prediction for each model and averaging their predictions inside the ensemble. In this technique, Remote Sens. 2020, 12, 765 6 of 22 the sub-models contribute equally to a combined prediction, y (x) = 1/M M m=1 y m (x). Firstly, all the algorithms fused into the ensemble are trained using the available data and then a combiner algorithm is used to make a final prediction using the predictions of the other models as inputs. The specific steps are: (1) generate M weak learners, each with its own initial values (by training them separately), and (2) combine these models in an ensemble where we average their predictions for every instance of the dataset to compute y.
We built the ensemble by combining base models as diversely as possible ( Figure 3). We took into consideration the network built in [6], together with modified versions of the most popular CNN architectures for computer vision tasks. This way, we can leverage their strengths and minimize their weaknesses to obtain a classifier with a lower classification error.
Remote Sens. 2020, 12, 765 6 of 22 A popular form of stacking involves computing the outputs of base models, performing a prediction for each model and averaging their predictions inside the ensemble. In this technique, the sub-models contribute equally to a combined prediction, ̅ ( ) = 1/ ∑ ( ) =1 . Firstly, all the algorithms fused into the ensemble are trained using the available data and then a combiner algorithm is used to make a final prediction using the predictions of the other models as inputs. The specific steps are: (1) generate weak learners, each with its own initial values (by training them separately), and (2) combine these models in an ensemble where we average their predictions for every instance of the dataset to compute ̅.
We built the ensemble by combining base models as diversely as possible ( Figure 3). We took into consideration the network built in [6], together with modified versions of the most popular CNN architectures for computer vision tasks. This way, we can leverage their strengths and minimize their weaknesses to obtain a classifier with a lower classification error. Next, we will describe the four base model architectures fused into the ensemble model. Three of these networks are the best performing models on ILSVRC, each based on different architectural patterns (structural blocks and connection schemes). It is important to note that these networks will be modified during the experiments in order to find the best possible configuration for each architecture.

CNN Built from Scratch
For the first base model, we opted for a CNN formed by a stack of convolutional layers containing filters with a receptive field of 3 × 3 followed by max-pooling layers with 2 × 2-pixel windows. We chose this particular architecture for its simplicity, computational efficiency and flexibility.
In a multilayer convolutional network, the input to the second layer is the output of the first layer, and when we use multiple filters on the same image, we carry out the convolution for each of them separately, piling up the results one on top of the other, and combining them into feature maps. The resulting tensor is finally reshaped to flatten out the spatial dimensions into a one-dimensional feature vector that can be fed into the classifier.
At the end of the network, we have two FC layers: the first contains 512 units, while the second contains one unit (where it performs the 2-class classification). In between these FC layers, we added a dropout layer with a rate of 0.5 to avoid overfitting.
In Figure 4, we can see how a tile is processed through this particular base model. Next, we will describe the four base model architectures fused into the ensemble model. Three of these networks are the best performing models on ILSVRC, each based on different architectural patterns (structural blocks and connection schemes). It is important to note that these networks will be modified during the experiments in order to find the best possible configuration for each architecture.

CNN Built from Scratch
For the first base model, we opted for a CNN formed by a stack of convolutional layers containing filters with a receptive field of 3 × 3 followed by max-pooling layers with 2 × 2-pixel windows. We chose this particular architecture for its simplicity, computational efficiency and flexibility.
In a multilayer convolutional network, the input to the second layer is the output of the first layer, and when we use multiple filters on the same image, we carry out the convolution for each of them separately, piling up the results one on top of the other, and combining them into feature maps. The resulting tensor is finally reshaped to flatten out the spatial dimensions into a one-dimensional feature vector that can be fed into the classifier.
At the end of the network, we have two FC layers: the first contains 512 units, while the second contains one unit (where it performs the 2-class classification). In between these FC layers, we added a dropout layer with a rate of 0.5 to avoid overfitting.
In Figure 4, we can see how a tile is processed through this particular base model.

VGGNet-Like Network
The second component of the ensemble is a modified version of VGGNet represented by a stack of convolutional blocks formed by two convolutional layers followed by one max-pooling layer. We propose this configuration for its compact architecture. The chosen kernel size is 3 × 3 and the maxpooling step is performed over a 2 × 2-pixel window. After each convolutional block, we added a dropout layer with a rate of 0.25 for a more aggressive control of the overfitting.
The first component of the classifier is a FC layer with 512 neurons connected to a final single unit layer. In between the classifying layers, we added a dropout layer with a rate of 0.5. We also reduced the number of filters/convolution when compared to the standard network. By applying these changes to the original architecture, we drastically reduced the number of parameters from 169 million to a little over 4.5 million. This architecture is described in Table 2.

Modified ResNet
The Residual blocks (introduced by He et al. in [4]) allow for the training of very deep networks by introducing modules called residual models. These blocks address the degradation problem observed when training deep neural networks by allowing the gradient to flow through the alternate shortcut path, ensuring the movement of information from earlier layers in the model to latter layers.
Our implementation consists of a modified version of ResNet50, where we stack Convolutional and Identity blocks on the residual connections (that allow the flow of information while enabling the learning of an identity function). Compared to the original ResNet50, we reduced the number of filters/convolutions and changed the activation functions from ReLU [29] to PReLU [30] and LeakyReLU [31]. Our top block includes a GAP layer with a 2 × 2-pi×el window (replacing the window area with the average value instead of the maximum value) followed by a Flatten layer and a FC layer with a softmax activation (where the classification is performed). By applying these changes, the number of total parameters dropped from 25.6 million to 5.9 million, resulting in a more efficient computation.
Details about the modified VGG50 network can be seen in Table 3.

VGGNet-Like Network
The second component of the ensemble is a modified version of VGGNet represented by a stack of convolutional blocks formed by two convolutional layers followed by one max-pooling layer. We propose this configuration for its compact architecture. The chosen kernel size is 3 × 3 and the max-pooling step is performed over a 2 × 2-pixel window. After each convolutional block, we added a dropout layer with a rate of 0.25 for a more aggressive control of the overfitting.
The first component of the classifier is a FC layer with 512 neurons connected to a final single unit layer. In between the classifying layers, we added a dropout layer with a rate of 0.5. We also reduced the number of filters/convolution when compared to the standard network. By applying these changes to the original architecture, we drastically reduced the number of parameters from 169 million to a little over 4.5 million. This architecture is described in Table 2. Table 2. Description of the modified VGGNet architecture.

Modified ResNet
The Residual blocks (introduced by He et al. in [4]) allow for the training of very deep networks by introducing modules called residual models. These blocks address the degradation problem observed when training deep neural networks by allowing the gradient to flow through the alternate shortcut path, ensuring the movement of information from earlier layers in the model to latter layers.
Our implementation consists of a modified version of ResNet50, where we stack Convolutional and Identity blocks on the residual connections (that allow the flow of information while enabling the learning of an identity function). Compared to the original ResNet50, we reduced the number of filters/convolutions and changed the activation functions from ReLU [29] to PReLU [30] and LeakyReLU [31]. Our top block includes a GAP layer with a 2 × 2-pi×el window (replacing the window area with the average value instead of the maximum value) followed by a Flatten layer and a FC layer with a softmax activation (where the classification is performed). By applying these changes, the number of total parameters dropped from 25.6 million to 5.9 million, resulting in a more efficient computation.
Details about the modified VGG50 network can be seen in Table 3.

Inception-ResNet
In 2015, Inception-v2 and Inception-v3 [32] proposed a number of upgrades over the original Inception architecture to increase the accuracy and reduce the computational complexity: Inception-v2 introduced factorization (factorizing convolutions into smaller convolutions: traditional 7 × 7 convolution into three 3 × 3 convolutions), while Inception V3 added Batch Normalization-auxiliary (in which the FC layer is also-normalized, not just the convolutions; batch normalization reduces the amount by what the hidden unit values shift around and allows each layer of a network to learn by itself with a higher degree of independency).
Inception-ResNet [5] was introduced the following year, together with the fourth iteration of Inception (v4). Here, the authors focused on making the modules more uniform and simplifying some of them, while adding Residual blocks to enable the gradient flow through shortcuts added to main path. This architecture obtained the highest score at ILSVRC 2016.
Given the increased complexity of this architecture, we only modified the classifier added on top of the convolutional base and applied transfer learning techniques while tweaking the standard hyperparameters. The classifier added on top consists of a GAP layer over a 2 × 2-pixel window, a FC layer with 512 neurons using ReLU activation function, a Dropout layer with a rate of 0.5 and a final FC layer of size 1 using sigmoid activation function.

Ensemble-V2
On the other hand, as stated in the introduction, we wanted to address the significant difference in performance found in [7] when testing the models on data with different vegetation coverage separately. We propose a second variant of the ensemble model ( Figure 5) using a specialized model on the subset only formed by tiles from the area with the lower performance metrics and a generic model trained on the entire dataset. Their predictions will be fed into a terrain type classifier (which will act as an Remote Sens. 2020, 12, 765 9 of 22 arbitrary combiner algorithm) that infers the type of vegetation the tile contains and decides which of the two predictions will be the final one. In other words, the ensemble will use a network to infer the type of vegetation coverage in a tile and, based on that result, will decide the final prediction.
Remote Sens. 2020, 12, 765 9 of 22 network to infer the type of vegetation coverage in a tile and, based on that result, will decide the final prediction.

Experiments
The goal is to learn a classifier that can input an image represented by a feature vector (tensor) X and predict whether the corresponding label is 1 or 0. The learning process is done using convolution operations, which preserve the relationship between pixels by learning image features (moving from discriminating features at a local level in the earlier layers towards generalizing these features at a global level in the later layers). Convolution ( * ) is an operation on two functions [ , ] = ( * )( , ) = ∑ ∑ ( − , − ) ( , ) , where the first argument is referred to as the input (the function I is a 2-D array with two indices ( , ) of the spatial coordinates of pixels), the second argument is referred as the kernel, while the output is referred as a feature map. We can say that is produced by convolving filter of dimensions , across all the input of dimensions , . In the case of RGB images, we have to add one more index of the different colour channels (resulting a 3-D tensor); software implementations usually work in batch mode, so we have to add a fourth axis indexing the samples in the batch (4-D tensors) [33].
In our case, a single training example is represented by a pair ( , ), where is an 3-D feature vector and (the label) is either 0 or 1. Our dataset (table 1) comprises of = 18,000 samples ( (1) , (1) ), . . . , ( ( ) , ( ) ) with dimensions of 256 × 256 × 3. We split our dataset by randomly assigning the tiles into the following three sets: • A full training set of 14,400 tiles (80% of the data) and five training subsets containing 90% of the full training set (12,960 tiles) with their corresponding labels were used to perform the weights initialization. The five subsets represent variations of the training population and were used to repeat the experiments and to conduct a statistical analysis of the performance metrics; • A validation set (10% of the data) was formed by 1800 tiles and used for tuning the model's hyperparameters and comparing how changing them would affect the model's predictive performance; • A test set (10% of the data) containing 1800 tiles to evaluate the performance of the models on unseen data. According to the literature conventions [34,35], we have also considered other data splits, but the initial results were lower (as seen in Table 4) when compared to the 80%-10%-10% split described above, so we decided to carry on the experiments only with the data split allowing for more training data. The other mentioned splits favour higher ratios of validation and testing data and could be the preferable approach when dealing with very large datasets, but are not desired when tackling complex classification tasks with limited datasets.

Experiments
The goal is to learn a classifier that can input an image represented by a feature vector (tensor) X and predict whether the corresponding label is 1 or 0. The learning process is done using convolution operations, which preserve the relationship between pixels by learning image features (moving from discriminating features at a local level in the earlier layers towards generalizing these features at a global level in the later layers). Convolution ( * ) is an operation on two functions where the first argument is referred to as the input (the function I is a 2-D array with two indices (m, n) of the spatial coordinates of pixels), the second argument is referred as the kernel, while the output S is referred as a feature map. We can say that S is produced by convolving filter K of dimensions m, n across all the input I of dimensions i, j. In the case of RGB images, we have to add one more index of the different colour channels (resulting a 3-D tensor); software implementations usually work in batch mode, so we have to add a fourth axis indexing the samples in the batch (4-D tensors) [33].
In our case, a single training example is represented by a pair (x, y), where x is an 3-D feature vector and y (the label) is either 0 or 1. Our dataset (Table 1) comprises of n = 18, 000 samples x (1) , y (1) , . . . , x (n) , y (n) with dimensions of 256 × 256 × 3. We split our dataset by randomly assigning the tiles into the following three sets: • A full training set of 14,400 tiles (80% of the data) and five training subsets containing 90% of the full training set (12,960 tiles) with their corresponding labels were used to perform the weights initialization. The five subsets represent variations of the training population and were used to repeat the experiments and to conduct a statistical analysis of the performance metrics; • A validation set (10% of the data) was formed by 1800 tiles and used for tuning the model's hyperparameters and comparing how changing them would affect the model's predictive performance; • A test set (10% of the data) containing 1800 tiles to evaluate the performance of the models on unseen data.
According to the literature conventions [34,35], we have also considered other data splits, but the initial results were lower (as seen in Table 4) when compared to the 80%-10%-10% split described above, so we decided to carry on the experiments only with the data split allowing for more training data. The other mentioned splits favour higher ratios of validation and testing data and could be the preferable approach when dealing with very large datasets, but are not desired when tackling complex classification tasks with limited datasets. We implemented the framework and the base models presented in Section 4 in the open-source Python deep learning library Keras [36] (with TensorFlow 1.14 [37] as backend) using an NVIDIA 2060 GPU mounted on a system with a 12-core Intel I7-8700 CPU and 16 GB RAM.
In all the experiments, we applied data augmentation techniques on the training set, including random horizontal and vertical flipping of the input images, random rotations, random horizontal and vertical translations and random zooms to avoid overfitting and allow the model to generalize better, given that it never sees the same tile twice. To avoid losing important information around the edges of the tile, we used small augmentation parameters. To fill out the pixels created after these operations outside the boundaries of the input, we used the 'nearest neighbour' interpolation technique. An example of data augmentation applied to a random tile can be seen in Figure 6. It is important to note that the batches of augmented tensor image data used for training were generated in-memory in real-time and were not stored on the disk.   [37] as backend) using an NVIDIA 2060 GPU mounted on a system with a 12-core Intel I7-8700 CPU and 16 GB RAM.
In all the experiments, we applied data augmentation techniques on the training set, including random horizontal and vertical flipping of the input images, random rotations, random horizontal and vertical translations and random zooms to avoid overfitting and allow the model to generalize better, given that it never sees the same tile twice. To avoid losing important information around the edges of the tile, we used small augmentation parameters. To fill out the pixels created after these operations outside the boundaries of the input, we used the 'nearest neighbour' interpolation technique. An example of data augmentation applied to a random tile can be seen in Figure 6. It is important to note that the batches of augmented tensor image data used for training were generated in-memory in real-time and were not stored on the disk. In the experiments, we used stochastic gradient descent to optimize the network's loss (cost) function (a standard binary-class cross-entropy). As activation functions, we generally used ReLU non-linearity for the convolutional layers and the first FC layer and sigmoid for the last FC layer (encoding the probability of a class or the other). In the experiments, we used stochastic gradient descent to optimize the network's loss (cost) function (a standard binary-class cross-entropy). As activation functions, we generally used ReLU non-linearity for the convolutional layers and the first FC layer and sigmoid for the last FC layer (encoding the probability of a class or the other).
For training, we used small learning rates (from 1 × e −6 up to 1 × e −2 ) to ensure a stable learning process and chose a standard number of 100 epochs for an initial observation of the network's behaviour (this number was subsequently increased or decreased depending on the network's convergence). The goal of the experiments was to identify the classifiers with the lowest classification error for each of the architectures proposed.
In the case of the CNN formed by stacked convolutional layers followed by max-pooling layers (presented in Section 4.1), we modified the number of filters/convolutions, the network's depth (to obtain feature maps of different sizes), and we increased the size of the kernel from 3 × 3 to 5 × 5. The optimizer used in all scenarios was Adam [38] (considered to be the fastest to converge [39]) with a learning rate of 1 × e −4 .
In the case of the VGGNet-like architecture (proposed in 4.2), we gradually increased the number of convolutional blocks from three to five and applied a different number of filters/convolution blocks (from [32,64,64] to [32,64,64,64,128]). We also used different optimizers to study their impact on the learning process: Adam with learning rates (lr) of 1 × e −3 /1 × e −4 , the standard SDG [40] (with a learning rate of 1 × e −2 , decay of 1 × e −5 and Nesterov Momentum [41] of 9 × e −1 as proposed by the authors of [3]) and an RMS props optimizer [42] with a learning rate of 1 × e −4 . The classifier was also modified by choosing different number of units in the first FC layer (256 and 512) and by using a GAP layer instead of Flatten before the FC layers [43].
Furthermore, we also applied transfer learning techniques (especially fine-tuning, where we unfreeze a portion of the layers from a frozen convolutional base and retrain them together with the classifier added on top). In [7], we observed that pre-trained models are sensible to changes, their performance changing depending on where we start to update the weights-retraining starting from a network´s early levels lowered the performance (updating the weights damaged the features learnt [44]); therefore, we only fine-tuned the upper convolutional blocks.
In the case of ResNet, we trained the architecture proposed in Chapter 4.3 after modifying the number of layers/convolution and using different activation functions (ReLU, PReLU and LeakyReLU). Given the complex nature of Inception-ResNet, we only applied transfer learning techniques: feature extraction (retraining the classifier added on top) and fine-tuning for the last convolutional blocks with learning rates of 1 × e −3 and 1 × e −6 . Finally, we trained the original networks from scratch for comparison. The training scenarios described above are presented in Table 5.

Ensemble
Variant I: Average Voting Described in Figure 3 -

24
Variant II: Specialized model Described in Figure 5

Results and Discussion
After selecting the weights using the training and validation sets, we evaluated the generalization capacity of the models using the test set containing unseen data (with a support of 900 tiles for each class) and obtained the confusion matrices (or error matrices) of the classification operation (an example can be found in Figure 7). Voting Described in Figure 3 -

24
Variant II: Specialized model Described in Figure 5

Results and Discussion
After selecting the weights using the training and validation sets, we evaluated the generalization capacity of the models using the test set containing unseen data (with a support of 900 tiles for each class) and obtained the confusion matrices (or error matrices) of the classification operation (an example can be found in Figure 7).  From the confusion matrices, we calculated the following performance metrics: precision (the number of true positive predictions divided by the total number of positive class values), recall (the number of true positive predictions divided by the number of true positives and false negatives) and F1 score (indicating the balance between the precision and the recall). The last metric is the Area Under the Receiver Operating Characteristic Curve (ROC-AUC) computed from prediction scores. This measure tends towards 1.0 when only a little precision has to be sacrificed to get a high recall, and towards 0.5 for the opposite case, and shows how much a model is capable of distinguishing between classes.
To statistically analyse the results, we repeated the training and evaluation of the configurations described in Table 5 using the five training subsets (containing 12,960 tiles randomly obtained from the full training set).
After an initial evaluation of the proposed configurations, we selected the best performing ones and stacked them as described in Figures 3 and 5. This way, the first variant of ensemble model averages their predictions of the models resulting from training scenarios 2, 11, 15 and 20. For the second variant of the ensemble, we selected three weak learners: a generic model (configuration 4, trained on the whole dataset) and a specialized model (based on configuration 11 retrained on a dataset containing only tiles from the Mediterranean areas, where it obtained a AUC-ROC score of 0.9697), together with a terrain type classifier trained to infer the ensemble's prediction based on the vegetation coverage in a tile (based on configuration 2, which obtained an accuracy score of 0.9628, with a loss of 0.0989 on the validation set). Here, given the prediction of the specialized model in classifying roads in Mediterranean areas and the prediction of a generic mode, we are left with the one inferred by the classifier trained to detect the type of vegetation coverage in a tile (the generic prediction if the classifier detects green areas or the prediction of specialized weak learner in the opposite case).
The performances of the configurations were afterwards compared using one-way analysis of variance (ANOVA) with the performance metric as the dependent variable and the configurations as the levels of a fixed factor. To test the null hypothesis that the performances of all configurations are the same, the F statistic is computed from the ANOVA table and reported in Table 6 for each of the performance metrics. The alternative hypothesis is that at least one of the configurations has a different performance than the others. Table 6. Mean (M) and standard deviation (SD) of the performance metrics obtained by the configurations described in Table 5. In Table 6, we observe that all p-values are smaller than 0.001; therefore, the configurations are significantly different in each of the performance metrics. However, the analysis of variance F-statistics and their p-values does not reveal which configurations are different from the others when there is a significant difference. To have a detailed comparison of the performance of the configurations in different metrics, Figure 8 shows the boxplots of the performances for different configurations. 0.000 0.000 0.000 0.000 0.000 0.000 1 A p-value smaller than 0.05 implies that null hypothesis is to be rejected at a level of significance of 5% and there is a significant difference in performance between the configurations.

Accuracy
In Table 6, we observe that all p-values are smaller than 0.001; therefore, the configurations are significantly different in each of the performance metrics. However, the analysis of variance −statistics and their p-values does not reveal which configurations are different from the others when there is a significant difference. To have a detailed comparison of the performance of the configurations in different metrics, Figure 8 shows the boxplots of the performances for different configurations.    Configuration 13 has a significantly higher recall and F1-score compared to the other VGGNet architecture configurations. We can see that by fine-tuning VGGNet's pre-trained weights, we obtained the highest single results. One reason for this might be the VGGNet's compact architecture, designed to gradually increase the semantic complexity. It is important to note that VGG16 trained from scratch (configuration 16) only converged when applying He normal (MSRA) initializer [35] to the first FC layer.
Next, to formally analyse the pairwise comparison of performances of the configurations in terms of AUC-ROC (the metric score preferred and considered one of the most appropriate for image classification tasks in [45]), we present the post-hoc test results for AUC-ROC using Tukey's HSD test. Tukey's HSD test statistic compares two configurations at a time using a t-test adjusted for overall variability of the data. The post-hoc tests are designed in such a way that it maintains the level of significance or the probability of type I error at 5% with all pairwise comparisons taken together. Table 7 reports the homogenous subsets of configurations in terms of AUC-ROC score. Configurations within a homogenous subset are not significantly different from each other at a 5% level of significance. For example, configurations 21, 19, 20 and 5 do not have significantly different AUC-ROC scores. However, configurations that are not common in two homogenous subsets are significantly different. For example, configurations 23 and 24 (the ensemble variants) are not significantly different from each other but have significantly higher AUC-ROC compared to all other configurations. These post-hoc test results asserts our observations in Figure 8 boxplots.
On the other hand, by plotting the training time, the number of parameters and the number of epochs before convergence ( Figure 9) and crossing the data with data from not significantly different from each other but have significantly higher AUC-ROC compared to all other configurations. These post-hoc test results asserts our observations in Figure 8 boxplots.
On the other hand, by plotting the training time, the number of parameters and the number of epochs before convergence ( Figure 9) and crossing the data with data from Table 5, we can study the effect of the hyperparameters on the training behaviour. We can observe that the training time was highly dependent on the number of trainable parameters, which in turn is dependent on the depth of the networks, the number filters applied in each layer, and the number of units in the classifier.
Changing the activation functions (configurations [15][16][17] increased the number of parameters from 6 to 17 million, while not significantly improving the results. A higher number of convolutional blocks and a higher number of units in the classifier allowed the networks to better learn the characteristics of the secondary roads with the downside of higher computational needs. Increasing the size of the filters (from 3 × 3 to 5 × 5-configuration 5) did not improve the results and resulted in a more pronounced overfitting behaviour. Using the Adam optimizer resulted in a convergence twice as fast when compared to SDG (configurations 6-9) and 1.5 times faster when compared to RMSprop [46] (configuration 13).
Next, we compared the performances of the configuration grouped on their base architectures using one-way analysis of variance (ANOVA) with the performance metric as the dependent variable and the base architecture as the levels of a fixed factor.
To test the null hypothesis that the performances of all architectures are the same, the -statistic is computed from the ANOVA table and reported in Table 8 for each of the performance metric. The alternative hypothesis is that at least one of the architectures has a different performance than the others. We can observe that all p-values are smaller than 0.001; therefore, the base architectures are significantly different in each of the performance metrics. However, the analysis of variance withstatistics and their p-values do not reveal which architectures are different from the others when there is a significant difference. To have a detailed comparison of the performance of the architectures in different metric, Figure 10 shows the boxplots of the performances for the base architectures. We can observe that the training time was highly dependent on the number of trainable parameters, which in turn is dependent on the depth of the networks, the number filters applied in each layer, and the number of units in the classifier.
Changing the activation functions (configurations [15][16][17] increased the number of parameters from 6 to 17 million, while not significantly improving the results. A higher number of convolutional blocks and a higher number of units in the classifier allowed the networks to better learn the characteristics of the secondary roads with the downside of higher computational needs. Increasing the size of the filters (from 3 × 3 to 5 × 5-configuration 5) did not improve the results and resulted in a more pronounced overfitting behaviour. Using the Adam optimizer resulted in a convergence twice as fast when compared to SDG (configurations 6-9) and 1.5 times faster when compared to RMSprop [46] (configuration 13).
Next, we compared the performances of the configuration grouped on their base architectures using one-way analysis of variance (ANOVA) with the performance metric as the dependent variable and the base architecture as the levels of a fixed factor.
To test the null hypothesis that the performances of all architectures are the same, the F-statistic is computed from the ANOVA table and reported in Table 8 for each of the performance metric. The alternative hypothesis is that at least one of the architectures has a different performance than the others. We can observe that all p-values are smaller than 0.001; therefore, the base architectures are significantly different in each of the performance metrics. However, the analysis of variance with F-statistics and their p-values do not reveal which architectures are different from the others when there is a significant difference. To have a detailed comparison of the performance of the architectures in different metric, Figure 10 shows the boxplots of the performances for the base architectures. Table 8. Mean (M) and standard deviation (SD) of the performance metrics obtained by the configurations described in Table 5  0.000 0.000 0.000 0.000 0.000 0.000 1 A p-value smaller than 0.05 implies that null hypothesis is to be rejected at a level of significance of 5% and there is a significant difference in performance between the configurations.  To formally analyse the pairwise comparison of performances of the architectures in terms of AUC-ROC scores, we present the post-hoc test results using Tukey's HSD test in Table 9. Tukey's HSD test statistic compares two architectures at a time using a t-test adjusted for the overall variability of the data. Architectures within a homogenous subset are not significantly different from each other at 5% level of significance. We can see again that the ensemble architectures do not have significantly different AUC-ROC scores among themselves, but they have significantly higher AUC-ROC scores compared to all other architectures. They are followed by VGGNet, which has a significantly higher AUC-ROC score than ResNet50, Built CNN, VGG-liked CNN and Inception-ResNet, but significantly lower AUC-ROC score compared to Ensemble V1 and V2. VGG-like CNN, CNN and ResNet50 architectures are not significantly different in AUC-ROC among themselves but ResNet50 has a significantly higher AUC-ROC than Inception-ResNet. These post-hoc test results asserts our observations in Figure 10 boxplots.
We can observe significant improvements; the two variants of the ensemble obtain considerably lower error rates when compared to the weak learners evaluated separately. The classifier with the lowest classification errors (Ensemble-V1, based on average voting between four models) obtained an increase in performance metrics by the order of 3-4%. These values are remarkable considering the computational optimization of the weak learners stacked (Figure 10).
Ensemble-V2 ( Figure 5) used a weak learner to detect the vegetation coverage in a tile and, based on that probability, it decided on the use of the prediction provided by the generic base model or the prediction of the base model specialized in classifying roads in Mediterranean areas. This architecture obtained improvements of 2-3% when compared to the single generic model (configuration 11). We believe this value can be further improved by increasing the size of the subset used for training the base model specialized in detecting roads in areas with Mediterranean vegetation coverage (the current 3600 tiles/category can be considered insufficient). These models were exported to hdf5 format and can be later deployed in production.
By constructing an ensemble, the learning algorithms were able to learn about the complexity of the road characteristics by leveraging the weaknesses of each architecture to find a better hypothesis and improve the performance metrics. On the other hand, even though we reduce the risk of choosing the wrong classifier, we noticed a lack of sufficient data, especially when trying to build a specialized model. We consider that, in order to progress correctly in the design and evaluation of CNN architectures, it is essential to have a larger data set taking into account the complexity of the road detection task (geometry, soil types, difference in size).
Regarding the transfer learning operation, we found that by initializing the network from pre-trained weights, the performance metrics were greatly improved (by the order of 8-10%) when compared to random initialization. This enabled a better convergence, even though our task was very different from the original one, and proved the effectiveness of the operation. When using transfer learning it is important to apply stronger regularization to control the overfitting behaviour, which occurs when the model has too many degrees of freedom and starts overcomplicating the true structure of the data [47], resulting in models unable to perform well on testing data.
We have mentioned earlier that the characteristics learned by CNN are a representation of real visual concepts. We can visualize the activations produced by performing several convolutions to understand how the input image was decomposed into the features the network learned. For example, in Figure 11, we can visualize activations produced by one of the configurations and see how it "learned" that a road is probably a straight, continuous line. After running a linear activation through a nonlinear activation function (detector stage, e.g., ReLU), we use a pooling function to modify the output of the layer further and reduce its dimensions. We can see that the networks trained for road classification are able to detect even secondary roads that are almost indistinguishable by humans. In the image below, we can see that the model correctly identified the main road and started to mark a potential second road, that is much more difficult to perceive.
how it "learned" that a road is probably a straight, continuous line. After running a linear activation through a nonlinear activation function (detector stage, e.g., ReLU), we use a pooling function to modify the output of the layer further and reduce its dimensions. We can see that the networks trained for road classification are able to detect even secondary roads that are almost indistinguishable by humans. In the image below, we can see that the model correctly identified the main road and started to mark a potential second road, that is much more difficult to perceive.

Conclusions and Future Lines of Research
Ensembling multiple models proved to be a powerful technique to boost the performance of our deep learning system for continuous objects detection in aerial images. We saw that by nesting two or more weak learners (modified versions of popular CNN architectures), our framework built for the road classification task obtained an increase of 2%-3% in performance (achieving AUC-ROC scores superior to 0.99) when compared to the base architectures. The results prove the effectiveness of model nesting techniques; at the end of the training process, the proposed framework achieved significantly better generalization scores on unseen data. These high performance scores (the first variant of the ensemble reached accuracy levels of 0.9661 when trained on the full dataset) can also be considered an indicator of the appropriateness of approaching the road extraction task with a classification subtask.
The generalization scores show a low level of overfitting and prove that the characteristics needed for recognizing the labels were included in the training dataset. However, we consider that a production model used to detect roads will require a bigger initial dataset with much more variation in the data. We plan to design a crowdsourcing project aimed at generating a larger dataset by deploying a web application in which volunteers can learn about the objective and impact of their contributions.

Conclusions and Future Lines of Research
Ensembling multiple models proved to be a powerful technique to boost the performance of our deep learning system for continuous objects detection in aerial images. We saw that by nesting two or more weak learners (modified versions of popular CNN architectures), our framework built for the road classification task obtained an increase of 2-3% in performance (achieving AUC-ROC scores superior to 0.99) when compared to the base architectures. The results prove the effectiveness of model nesting techniques; at the end of the training process, the proposed framework achieved significantly better generalization scores on unseen data. These high performance scores (the first variant of the ensemble reached accuracy levels of 0.9661 when trained on the full dataset) can also be considered an indicator of the appropriateness of approaching the road extraction task with a classification subtask.
The generalization scores show a low level of overfitting and prove that the characteristics needed for recognizing the labels were included in the training dataset. However, we consider that a production model used to detect roads will require a bigger initial dataset with much more variation in the data. We plan to design a crowdsourcing project aimed at generating a larger dataset by deploying a web application in which volunteers can learn about the objective and impact of their contributions.
Next, we will start working on the segmentation operation, applying it only to the tiles where transport routes were detected. We also plan to develop a post-processing technique to join road geometries from adjacent tiles and apply topological rules to fill out the missing parts and obtain continuous vectors.