Grammar Guided Genetic Programming for Network Architecture Search and Road Detection on Aerial Orthophotography

Featured Application: This system can be applied to automatically design the architecture of deep artiﬁcial neural networks for any given labeled dataset. Abstract: Photogrammetry involves aerial photography of the Earth’s surface and subsequently processing the images to provide a more accurate depiction of the area (Orthophotography). It is used by the Spanish Instituto Geográﬁco Nacional to update road cartography but requires a signiﬁcant amount of manual labor due to the need to perform visual inspection of all tiled images. Deep learning techniques (artiﬁcial neural networks with more than one hidden layer) can perform road detection but it is still unclear how to ﬁnd the optimal network architecture. Our main goal is the automatic design of deep neural network architectures with grammar-guided genetic programming. In this kind of evolutive algorithm, all the population individuals (here candidate network architectures) are constrained to rules speciﬁed by a grammar that deﬁnes valid and useful structural patterns to guide the search process. Grammar used includes well-known complex structures (e.g., Inception-like modules) combined with a custom designed mutation operator (dynamically links the mutation probability to structural diversity). Pilot results show that the system is able to design models for road detection that obtain test accuracies similar to that reached by state-of-the-art models when evaluated over a dataset from the Spanish National Aerial Orthophotography Plan.


Introduction
Two of the main problems with remote sensing information as noted by Risojević et al. [1] are that the high volume of data exceeds by far the capabilities of human analysis (by manual revision) and at the same time is usually crucial to perform classification of said data. Under this category of data, large-scale aerial image processed as orthophotography is a useful source of information for many domains. To give some examples of the broad array of applications we can find, there have been systems developed for the detection of coastline changes [2], snow avalanches [3], fires [4], bodies in disaster sites [5], trees [6], seedlings [7], roofs [8], transmission towers [9,10], vehicles [11,12], photovoltaic arrays [13,14], vegetation and buildings [15]. We focus on road detection on aerial images being it an important subject, among other things, due to the need to constantly update road maps. This importance is clear by the volume of research conducted and applied systems developed on this domain.
As an object of detection, Fortier et al. [16] divided the road characteristics into four main groups: spectral properties (those regarding surface characteristics), geometric (such as width and curvature), topological properties (those derived from their role as part of a transportation network), and contextual (e.g., different types of roads may have different restrictions to their shape, size or materials). Put in the context of aerial image we can find many complications related to those groups or properties some even caused by the geographic context of the roads. One of these problems as shown in Figure 1a,b is due to surrounding terrain having spectral properties similar to the secondary roads shown. Other common problem are occlusions due to their contextual properties (e.g., secondary mountain roads). Here, we show an example of said occlusion at forest areas (Figure 1c-f), on the first case due to the shadows and the rest the trees themselves covering the road. Other problems such as some natural or artificial structures being similar to roads are shown in Figure 1g-o.
Some road detection systems rely entirely on non-neural models: Wang et al. [17] combined texture information extraction with a knowledge-based road extraction and a post-processing step for noise elimination; Mattyus et al. [18] applied Structural Support-Vector Machines (S-SVMs); and Tsochantaridis et al. [19] and Zhou et al. [20] used GraphCut [21]. Although we can find many other systems that rely on road characteristics analysis [16][17][18]20,[22][23][24][25][26][27][28], most recent research has favored the application of deep artificial neural networks (ANNs). We find some systems include ANNs as part of their detection pipeline (e.g., [29][30][31][32][33]) or just preprocess the input images before they are feed to the network as in [34]. It is frequent to rely on a deep ANN to perform the whole road detection/segmentation process by either developing a novel network or using a preexisting architecture. Some researchers use custom network architectures [35,36], others customize well-known models [30,[37][38][39][40], use those well-known models with transfer learning [41,42] or combine them into road-detection ensembles [43].
Systems that apply deep learning techniques to detection on such images can use quite different network dimensions and architectures. We can find ones that go from relatively small convolutional networks, such as the one in [44], to others using models such as the Inception-V3 architecture [45] combined with transfer learning, as in [14] for solar panel detection. To try and solve the need to find a quasi-optimal ANN architecture automatically, Network Architecture Search (NAS) has been extensively researched by using many different approaches but has proven to be a complex task given the dimensions of the search space. Some examples of such approaches include the use of reinforcement learning techniques, as seen in [46][47][48], or the application of evolutive algorithms (e.g., [49][50][51][52][53][54][55][56][57][58][59][60]).
Some systems use grammars to define the rules to obtain valid structures that can encode either the neuron connectivity matrix and some other form of network topology [54,56,57] or other higher level expression of the operations being performed by the network layers and their connectivity [52,53,55,[58][59][60][61]. Some of those systems (e.g., [52,53]) work by generating architecture expressions with high-level operations (such as convolutions, activation functions, dropout, etc.). In [53], the architecture is sequential in term of layers, while Assunçao et al. [52] allowed more complex connectivities (parallel branches inside the network).
The present work is conducted inside the Cartobot project for the Instituto Geográfico Nacional (IGN, Spain) to perform automatic design of deep neural networks for road detection on aerial images, guiding the search with information about useful network structural patterns. This project is of real application on a dataset constructed (by IGN and Cartobot personnel) from Spanish National Plan for Aerial Orthophotometry (PNOA) aerial orthoimages. The two main research goals are: (a) design a grammar that can successfully encode the most typical high-level architectural blocks of deep ANNs; and (b) build a NAS system capable of designing deep ANNs using that grammar for the purpose of successfully detecting roads on aerial images. We divert from the approach of Assunçao et al. [52] in that our grammar is going to define specific high-level blocks, which is discussed below, instead of allowing fully free connectivity for the generation of the blocks with parallel branches. We also work on a dynamic mutation operator linked to the structural diversity of solutions (population) being contemplated at each instant. Figure 1. Examples of potential issues: (a-f) roads with complex contextual terrain or partial occlusions; and (g-o) images not containing roads that could be hard to classify.
We start by reviewing some of the well-known network models (AlexNet [62], VGG [63], GoogleLeNet [64], RestNet [65], Inception-v3 [45], Inception-v4 and Inception-ResNet [66], DPN [67] and ResNeXt [68]) used by top scoring results on the Imagenet Large Scale Visualization Recognition Challenge (ILSVRC) [69] (editions from 2012 to 2017) where many different architectural patterns can be found. As mentioned, by defining a meta-grammar able to condense some of such structural rules, we aim to guide an evolutive NAS system to find networks capable of solving the road detection problem on aerial orthoimages. The problem of road detection (including secondary roads) on such orthoimages has, as explained above, the added difficulty of natural or artificial structures that can be mistaken with roads or transportation infrastructures. Some samples of those formations can be seen in Figure 1g-o with various levels of complexity. On the other hand, many different types of roads are being considered and are required to be detected by the system. We have a set of sample orthoimages with their corresponding numerical topographical cartography at a scale of 1:25,000 (BTN25) to allow for easier identification of the roads by the reader (see Figure 2). Figure 2. Sample images containing roads (a-c,g-i) and their corresponding road cartography (d-f,j-l).

Methods
Our system, VGNet, uses a Grammar Guided Genetic Programming (GGGP) [70] to solve the NAS problem. GGGP is a family of evolutionary algorithms that encode solutions to a given problem as tree structures (constrained by a set of structural rules specified with a grammar) and follows the same basic process of a genetic algorithm with some variations. The reasons for selecting this approach are: evolutionary algorithms easily allow for a high-level of parallelism [55] and the use of a grammar avoids exploring invalid solutions and allows for domain knowledge to be included to guide the search. This kind of systems can find quasi-optimal solutions for a given problem when finding the global optimum is not possible due to computational complexity.
The general GGGP evolutive process follows these steps: 1. We define a coding system to represent the candidate solutions (individuals) for a given problem and a context-free grammar that defines restrictions to their structure. 2. A population (a set) of initial solutions is created following the grammar rules. 3. We check the fitness (a measure of how well they solve the problem at hand) of the individuals to see if we have found an acceptable solution (given some stop criterion or goal to reach) and should stop the process or continue the search. 4. If the stop criterion is not met, we create a new population as follows: Select solutions (called parents), usually by pairs, with regard to their fitness value. Better solutions have higher chances of being selected.
With certain probability, combine the parents (or leave them unchanged) to obtain new solutions (offspring) using a crossover operator. The goal here is to combine the information contained in each parent to try and find better solutions. In our scenario, subtrees are exchanged between the parents to generate the offspring. (c) With certain probability, the offspring individuals are checked to add some random variations (constrained once again by the grammar rules) to improve the exploration of new solutions (new areas of the search space).

5.
We go back to step 3 with the new population we created.
For VGNet, the individuals (candidate solutions) represent an ANN constrained by a context-free grammar of highly used network architectural building blocks. The system has the internal workflow shown in Figure 3. Each individual generated is a derivation tree and represents a neural network architecture codified using an expression language designed to be easily read. We refer to ANNs with more than one hidden layer as deep learning models, or deep neural networks.
The operators used during the evolution process are: • Individual generation: Based on Grammar-Based Initialization Method (GBIM) [71]. GROW [72] was used initially but was later discarded in favor of GBIM. Methods used by initialization and mutation are based around GBIM and GBM with the following modifications. We removed for all non-terminals the restriction of having at least one terminal node only production (hereafter, we refer to those as default productions). This affects the guaranteed uniformity of depth level during random generation studied by the authors of the original method [71]. In addition, when creating a derivation tree, a random depth is chosen from the interval on the interval [min_depth, max_depth] (original GBIM uses the interval [1, max_depth]). These modifications allow for an easier to read meta-grammar. In addition, although the formal study is outside the scope of our present work, we believe that the use of the cited default productions on the meta-grammar can lead to the introduction of a bias in their favor during the derivation tree generation. The following sections detail the codification scheme (Section 2.1), with the detailed meta-grammar in Section 2.1.2. The SDF mutation operator we apply is explained in Sections 2.2 and 2.3 and, finally, the fitness in Section 2.4.

Expression Language
To encode the network structures, we define an expression language capable of simply representing many of the classification architectures found in the state of the art with some exceptions (e.g., DenseNet [76] is not supported due to the syntax we use for parallel processing branches). At this point, we only consider architectures with one input point and one classification output (architectures such as Faster-RCNN [77] with multiple outputs at different levels of the network are not supported in the current version of the system).
The expressions have the following tokens as basic components: • "in", "out": They represent the network input and output, respectively. • "[", "]": Start and end of parallel branches of processing that share the same input (branches are separated by ","). • "direct": Connection between two network points without any operator being applied (residual-like connection).
Both aggregation operators are applied immediately after the closing of parallel processing branches to unify the results.
Some of the traditional operators use a parametric syntax with format operator[-parameter]{0,n}. The specific number of required parameters varies between operators: • "conv-i-j-k l": i (dimension), j (number of generated features), k (stride) and l (normalization used). Currently, only Batch Normalization ("bn") [79] or no normalization are supported by our system. We also performed some tests including both regular ("conv") and separable convolutions ("sepconv").
Some important notes: "flatten" is always present before the first FC layer and all layers use leaky-ReLU [80] activation (with the only exception of the output layer that uses softmax). Regarding the "in", "softmax" and "out" nodes, the input and output dimensions are not specified on the operators themselves since they are problem dependent and configured at library level.

Meta-grammar
To guide the evolutionary search, the allowed network architecture is constrained to: 1. Input layer: Dimensions are given by the images of the dataset being used. 2. Sequence of high-level structural blocks. Each block is chosen from the following set: • Convolutional block. Sequence of convolution, optional batch normalization, dropout and optional pooling.

•
Inception-based blocks. Parallel convolutional branches with their outputs being concatenated (all branches are constrained to apply the same spatial dimensional reduction or none).

•
ResNet-based blocks. Consists of two parallel processing branches. A convolutional sequence and residual connection with their outputs being aggregated by zero padded sum (to allow for different number of features between the two branches). • Inception-ResNet-based blocks. Here, we have an Inception-based block and a residual connection with aggregation of both their outputs via a zero padded sum.
3. Sequence of fully connected layers (leaky-ReLU activation function) alternated with dropouts. Before the first FC layer, a flatten operation is automatically performed. 4. Output layer: softmax layer with one output per class.
It is worth mentioning that the dropout operator presence is forced allowing a value of rate = 0 to account for it being inactive. That way the operator block is always present and can be easily turned on/off via mutation or crossover. Dimensions higher than three are not used for the convolutions and poolings in light of the results obtained by Szegedy et al. [45] using factorized convolutions to replace those with higher dimensions. Limit for number of FC layers exists mainly due to their computational cost compared to the convolutional ones.

Diversity Control via Structural Schema Clusters
Due to the high number of parameters used by the expression language to define the architectures, measuring diversity by checking for unique expressions of population individuals can be misleading. By taking those expressions and ignoring their specific operator parameters, we can group them into clusters of unique abstract structural schemas. Those clusters represent groups of solutions where the connection structure between generic operations is the same. To give a simple example, both these two sub-expressions, "ap-3-1 [ conv-3-128-1 dropout-0.4 , direct ] sum conv-3-256-2 dropout-0.5" and "ap-3-1 [ conv-3-56-1 dropout-0.2 , direct ] sum conv-3-128-1 dropout-0.3", belong to this generic structural schema: "ap [ conv dropout , direct ] sum conv dropout".
The number of schema clusters gives us an idea of how many structurally different architectures are being explored by the current population.

Schema Diversity Factor Mutation
We propose a new grammar-based mutation operator based on Grammar Based Mutation (GBM) [73]. Since GBM only chooses one non-terminal node from the tree for mutation and uses a fixed mutation probability, we propose some modifications to adapt it to our NAS problem. We refer to our variant of GBM as Schema Diversity Factor (SDF) mutation and it differs from GBM in two main aspects: the derivation tree non-terminal nodes are checked for mutation following a width first search and the mutation probability (mP) is calculated for each particular individual derivation tree based around its non-terminal node number and the population diversity (d) following Equations (1)- (3).
where mP is the mutation probability for the individual i and β is a scaling factor that can be set manually or calculated in relation to the population diversity. We use the population structural diversity (d) measured via unique structural schema clusters with Equations (2) and (3).
Values of β belong to the interval [β m , β M ]. During our experiments, the interval was set to [0. 5,2]. This has the effects of increasing the mutation probability when the structural diversity is low in order to improve exploration and lowering it when said diversity is high to allow exploitation. In addition, since we check each node individually for mutation, more than one subtree at a time is allowed to mutate (more than one part of the same network architecture can be changed). Finally, we want a lower probability as the total node number grows to avoid an excessive number of mutations, which is why we adjust it for each individual depending on the number of non-terminal nodes.

Fitness
To obtain an individual fitness, its architecture expression is automatically translated into code and the resulting model is trained from scratch on the training partition of the dataset. For loss, we use either the Root Mean Square Error (RMSE) or the Categorical Cross-Entropy (CCE) over the validation partition to obtain the fitness using the following equation: We use n = 3 to smooth the oscillations of the loss value and obtain an average estimation.

Experimental Setup
Two different servers were used to conduct the evolution tests (Table 1). We restrict the search space of valid architectures for our experiments as follows: • Maximum number of fully connected layers was set to 1 in order to reduce the memory and computational cost of training the network during the evolutive design stage. The maximum number of neurons on those layers was kept small for the same reason (8,16,32).

•
Only convolutions of dimensions (1, 2, 3) were allowed, according to findings in [45] regarding factorized convolutions. The number of features is a set of typical used values in the range [2,256].

•
We allowed the use of the two most frequently used types of pooling (max-pooling and average-pooling).

•
Regarding the stride of convolutions and poolings, we used 1 to keep spatial dimensions and 2 to allow their reduction at certain parts of the architecture (those places are specified inside the meta-grammar restrictions).

•
For the dropout rates, valid values for the drop-rate parameter were set inside the interval [0, 0.5] with increments of 0.1. Here, 0.0 is interpreted as an inactive dropout operator. The higher value 0.5 was set empirically to avoid TensorFlow-Keras warnings obtained for values above 0.5.

•
We allowed the use of no normalization operation or Batch Normalization (BN) to let the design process choose where to place the BN operations. • On all cases during design, the network was trained for its evaluation only for a maximum of 20 epochs (early stop if during 5 there was no improvement on the training metrics).
For the models labeled with the prefix "v2" (version 2), we allowed the following settings (due to higher computational capabilities of the server used for the experiments): • Maximum number of FC layers was 3 instead of 1.

•
Bigger dimensions for FC layers (up to 256). • Separable 2D convolutions ("sepconv") were used in addition to regular convolutions. • BN momentum was empirically set to 0.5 (due to average size of training batches caused by GPU number and available memory).
On all of the experiments, the dataset was partitioned by experiment into train/validation/test subsets with a ratio of 70:20:10. This data partitioning was due to internal intermediate reports being filled with test metrics from that subset. Despite k-fold being used over the whole dataset, for final evaluation at the design step, we used only 90% of the dataset; we report this statistic for full disclosure. Tournament selection used size k = 4 (selectionPressure took values 0.7 or 1 depending on the experiment).

Results
The dataset consists of 17,159 samples of 256 × 256-pixel RGB images, labeled to indicate whether they contain roads (53.5%) or not (46.5%). The images are from the National Plan for Aerial Orthophotometry (PNOA) with original pixel resolutions of 0.25 or 0.50 m. They were taken from a Web Map Tiled Service (OGC:WMTS) of the IGN at zoom level 18 which corresponds to a scale of 1:2.132 so the equivalent pixel size on the ground is 0.597 m. The samples correspond to eight areas of the Spanish geography that can be considered representative: Albacete, Balearic Islands, Ciudad Real, Galicia, Huelva, Murcia, Navarre and Segovia. We refer to this dataset as IGN-PNOA hereafter.
Sample results of a designed network's output for three random sample groups (not used for training or validation) outside of the dataset (orthophotos were from random geographical areas of Spain with no associated cartography available) to check their predictions over unknown data are shown in Figure 4. Under each image of the group is the network predictive output for the class "contains road". In all these examples, the somewhat extreme output values given by the network are due to the use of the softmax activation on the output layer.
We selected the nine highest scoring networks (measured on the validation partition) designed by VGNet; all models were trained from scratch using the train and validation examples. To evaluate their test accuracy, k-fold cross-validation (CV) was used with typical values of k = 5, 10 [81,82]. Regarding k = 10, the authors of [81,83] showed that 10-fold CV obtain similar results to leave-one-out cross-validation (LOOCV) and has lower computational cost. Test metrics are included in Figures 5  and 6 and detailed in Table 2. Their respective accuracy (95% confidence intervals, using the calculation method detailed by Caffo [84]) are in Table 3 to show the expected accuracy for each model. In all cases, the same parameters used for designing the networks (epochs = 200, al pha = 1e −4 , optimizer = Adam and batch size calculated for each model based on GPU memory available) were used to train them on this final stage. Finally, for the two best scoring models in Table 4, additional metrics were calculated on the test partition: precision, recall, F1 score and Area Under the ROC-Curve (AUC). To give an example of one of the designed networks, a high-level architectural overview of model "v2m2" is shown in Figure 7 with sequences of operations grouped into blocks for clarity (as detailed in the figure legend). This particular model contains two Inception-ResNet-like sections at the areas with parallel operator branches. Regarding execution times, on average k-fold (k = 5) evaluation takes 52.8 h on IGN1 server (4× Tesla V100 GPUs).    28-92.53%). In addition, in [43], the mean accuracies for the same dataset are reported, among others, for ResNet-50 (91.7%), VGG (93.9%) and for two different ensembles (94.6% and 95.6%, respectively). Here, ensemble refers to the combination of multiple networks into one meta-classifier, namely their results are combined to improve predictions. Finally, to give a point of common reference for other authors to be able to compare results, we ran preliminary tests (with no additional fine-tuning of any kind) on a typical NAS benchmark, the CIFAR-10 dataset (we set the maximum number of FC layers to 3 on this experiments). Here, the best model (selected by validation loss) designed by VGNet obtained test categorical accuracies that oscillate in the range shown in Table 5. The gap between this results and the current state-of-the-art for CIFAR-10 can be explained by the fact that the grammar, as well as other internal hyperparameters, have been designed for road-detection on the IGN-PNOA dataset (including as mentioned some parameters that are set due to our current hardware constraints). This suggests that application to other problem domains, such as the case of CIFAR-10 or other remote sensing tasks, may need further grammar modifications and fine-tuning to achieve optimal results.

Discussion and Future Work
Our system was able to successfully design full networks that, trained from scratch (without any manual fine-tuning or data augmentation or application of ensembles), yielded for the best scoring model a test accuracy of 90.6-93.1% with kfold (k = 5) and 91-93.8% (with k = 10). Therefore, the GGGP NAS approach shows results similar to well-known architectures. The main disadvantage we find is the time it takes to run the evolutive algorithm. We should note that, as mentioned above, some of the meta-grammar parameters were set in our tests according to hardware constraints (e.g., GPU memory) and therefore should not be interpreted as optimal settings in any case. Exploring ways of efficiently incorporating transfer-learning and data augmentation, fine tuning the system and adapting the grammar for pixel-level semantic labeling architectures are some of the next steps of the Cartobot project. On the other hand, meta-grammars allow new applications of the system as a hypothesis checker using different variations of meta-grammars and comparing the results obtained. For example: Is it better to apply dropout only at a certain stage of the architecture? What is the optimal value for the dropout rate?
We are currently working on methods to speed up the design process, from reducing the number of epochs at early design stages to using caching methods for the convolutional block weights or migrating to training on cloud servers. Research is also being conducted on ways to gain reusable knowledge among different evolution experiments or problem domains. Finally, to improve our method's capability of balancing exploration/exploitation during the NAS, future research will further explore enhancements to the presented SDF mutation operator.