Semantic Segmentation for Aerial Mapping

: Mobile robots commonly have to traverse rough terrains. One way to ﬁnd the easiest traversable path is by determining the types of terrains in the environment. The result of this process can be used by the path planning algorithms to ﬁnd the best traversable path. In this work, we present an approach for terrain classiﬁcation from aerial images while using a Convolutional Neural Networks at the pixel level. The segmented images can be used in robot mapping and navigation tasks. The performance of two different Convolutional Neural Networks is analyzed in order to choose the best architecture.


Introduction
One of the most important tasks in mobile robotics is navigation. Robot navigation addresses the problem of moving a robot from its current position to a desired goal. Commonly, to achieve this task, the robot uses its onboard sensors to know the environment. In [1], the authors presented a method where Unmanned Aerial Vehicles (UAV) can work as a sensor for ground mobile robots, providing information of the environment from a perspective not available from the ground mobile. In Figure 1, we present an example of the information provided by the UAV. An important problem that arises with the use of optical sensors is how to interpret the great amount of data that they provide. Usually, the data that are provided by an optical sensor are classified as traversable or no-traversable terrain. The maps can be three-dimensional (3D), as they are shown in [2][3][4] as well. In other applications, the difficulty of passing through different types of terrain is described with a cost assigned to some areas of the map like the ones that are described in [5,6].
In terms of path planning, this could give more possibilities to generate different paths, ones where the least possible distance is traveled, traversed in less time, or with less energy consumption.
Given the amount of data provided by a visual sensor mounted on a UAV, we require a preprocessing algorithm that extracts the most important information. To solve this problem, we propose an approach that is based on image segmentation. In image segmentation, the image is divided into different entities similar to Figure 2 where a dog is segmented from the image, and the rest is considered background. In our approach, a two-dimensional (2D) map is generated from aerial images using semantic segmentation with a Convolutional Neural Network (CNN). To perform this task, we assign a cost to each pixel in the image according to a classification over twelve classes. Our CNN has a U-shaped architecture that is based on U-net described in [7], which acts as an encoder-decoder, this architecture has shown great performance with limited data sets [8][9][10]. This semantic segmented image can be used to improve a map, since we have a pixel-level segmentation that can generate a more detailed map for a mobile robot navigation task. The rest of the paper is organized, as follows: the related work is presented in Section 2. The architecture of the proposed approach is presented in Section 3. The experimental results of our approach are shown in Section 4. Finally, the conclusions are given in Section 5.

Related Work
The mapping task is commonly related to light detection and ranging (LiDAR) sensors. In some cases, the authors combine these sensors with algorithms to classify sections of the readings as in [11] where are combined with Support Vector Machines (SVM). In [12], the authors construct roadway maps with the KITTI dataset employing LiDar odometry. In [13], with a new LiDAR sensor, the possibility of map natural containers of water is opened. Recently, Independent Component Analysis (ICA) in conjunction with an SVM receives an input formed by LiDAR readings with images in high resolution to classify the land in seven types of plants and one more class for the unclassified, reaching an accuracy of 73.6% using only the information of the LiDAR plus 0.67% more by adding the RGB images of the terrain reaching 74.7% of accuracy. In [14], the authors made a semantic map using a 3D LiDAR sensor with a Multi-Layer Perceptron to classify the point cloud.
There are other approaches that only use images, for example, in [15], the authors used a sub-pixel mapping to deal with low-resolution images, this technique can be combined with other information such as [16,17]. Many of the mapping approaches use satellite images to map the roadways or terrain taken advantage of the files with multi-spectral bands like in [18,19]. Similarly, [20] maps a Chinese city classifying with a cascade topology of minimum distance, maximum likelihood and SVM. In [21], the authors map another city with an SVM.
Mapping the environment towards the robot could be performed with RGB-D images that include depth information [22]. Other works add thermal information to generate the map and later the path for the mobile robot. In [23], the authors created a point cloud labeled with eight classes using Conditional Random Fields (CRF) [24].

Convolutional Neural Networks
Certainly, the use of convolutional neural networks is one important part of the success of deep learning in image processing. Although neural networks operate primarily over a matrix, they also have a biological analogous, CNN captures representations of brain function, where every neuron is stimulated by a part of the visual field then all of the parts overlap to operate on the whole image, as shown in [25][26][27][28].
Convolution is the integral measuring of how much two functions overlap as one passes over the other. In image processing, the first function is the image and the second is the filter or kernel. Because the image is discrete, the convolution can be written as where S is the map of features, I is the input image, K is the kernel or filter, and m, n are the indexes for rows and columns inside the kernel. Figure 3 shows a graphic description of how the output is generated. One advantage of CNNs in image processing is that they share weights. Weight sharing reduces the parameters to learn; therefore, the required memory is lower. Subsequently, the convolutional layer applies the convolution of the kernel over the image or the output of another convolutional layer to produce a linear output that will be passed through a no lineal activation function, such as Sigmoid, Hyperbolic Tangent, Rectified Linear Unit (ReLU), Exponential Linear Unit (ELU), or Leaky ReLU. Other types of layers are the pooling layers that can be Max or Average, where the purpose is to reduce the number of parameters and controlling the overfitting, reducing the size of the network by taking the maximum or the average value of the area where the kernel is passing over the input. A graphic representation is shown in Figure 4, where max and average pooling layers of size 2 × 2 are passed to an array of 4 × 4. In addition to these two most used layers, there are other types, such as "network in network" [29], Flattened [30], Depthwise separable [31,32], spatial separable with asymmetric convolution [33], Deconvolution [34], etc. Some of them will be presented more in detail in Section 3. Given these elements, a CNN can learn features according to the application, in contrast with the hand-crafted features where the features should be designed manually. Finally, to enumerate some applications of CNN: there are image classification [35,36], object detection [37], image denoising [38], image super-resolution [39], and image segmentation, which will be detailed in the next section.

Semantic Segmentation
The main task of semantic segmentation is to understand the image at a pixel level labeling each pixel with a class, some classical approaches, such as [40], which use semisupervised learning of the class distributions and then does the segmentation by means of a multilevel logistic model, and [41] addresses the problem as one of signal decomposition, solving it with an alternating direction method of multipliers to perform separation of the text and moving objects from the background. With respect of CNNs normally is an end-to-end architecture, replacing the fully connected layers in common image classification networks for the layer that upsamples the features working as an encoder-decoder. Subsequently, the output usually has the same size as the input in which every pixel has a class assigned. Representing an improvement with respect to the common object detection or recognition where is bounding box enclosing the objects of interest. In terms of applications, semantic segmentation is used on autonomous vehicles commonly trained with the cityscapes dataset [42], human-computer interaction, or Biomedical Image analysis [7].
One of the most representative architectures for semantic segmentation is Fully Convolutional Network (FCN) [43], where the authors take famous classifiers ALexNet, VGG nets [44], and GoogLeNet [45] to connect them at the end upsampling layers for end-to-end learning by backpropagation from the pixel-wise loss. However, FCN has a problem with the pooling layer that is caused by the loss of information. For this reason, SegNet [46] adds to the upsampling layers a layer with the pooling indexes. The inconvenience is the lack of global context, to overpass this issue some changes have been implemented, such as atrous convolutions [47][48][49] to obtain different sub-region representations. Recent works, including Auto-deepLab [50], which alternates between optimizing the weights and the architecture of the network, Dual Attention Network (DANet) [51] that proposes modules to spatial contextual information and another to channel dimension to finally merge them, CCNet [52] introduces the criss-cross attention module to collect contextual information. Most of the approaches use Resnet101 [53] as the backbone encoder; this is summarized in [54]. Furthermore, approaches with the objective of performing in real-time or in hardware with limited resources namely ESNet [55] and FPENet [56].
Another popular architecture is U-net [7], which has an outstanding performance with datasets containing a short number of images. In the encoder part, it has the typical convolution layers followed by the pooling layers series, whereas, in the upsampling part, it concatenates a cropped copy of the features from the convolutions, providing local information to the global generated in the upsampling. The output layer has as many filters of 1x1 as classes to assign.

Semantic Segmentation Architecture
Our approach is based on the U-Net architecture, as it has shown great performance with limited datasets. The proposed architecture has fewer filters, in our case, the number of filters is reduced by four in each level when compared to the original. Using this architecture, we made another two different architectures, one with depthwise separable convolution (U-Net DS), see Figure 5, and another using spatial separable convolution (U-Net SS), with the objective to have fewer parameters to learn, since the intention is to run the network in the limited hardware of a mobile robot using less memory and speed up the segmentation. These two types of convolution are explained in the following two subsections. It is important to note that the activation functions were changed from ReLu to ELU, since it avoids the dying ReLU problem [57]. For the output layer is a 1 × 1 convolutional layer with twelve channels one channel for every class of interest with a softmax activation function, the classes are: Background, Tile, Grass, Person, Stairs, Wall, Roof, Tree, Car, Cement, Soil, and Injured person.

Depthwise Separable Convolution
Depthwise Separable Convolutional divides the standard convolution into a depthwise convolution and an 1 × 1 convolution named pointwise. The first division takes an input of h × w × m and applies m filters of size f × f × 1, where f is the height and width of the filter and m is equal to the number of channels in the input, as a result, an intermediate output of (h − f + 1) × (h − f + 1) × m is generated. The pointwise part uses n filters of size 1 × 1 × m to generate an output of (h − f + 1) × (h − f + 1) × n, as shown in Figure 6. The disadvantage of doing this separation is that the network has fewer parameters to learn, which means a loss in accuracy, however, the network size is smaller and, thus, it can run faster, suitable for limited hardware, such as mobile devices [58].

Spatial Separable Convolution
This convolution separates the kernel of f × f in two: one vertical and one horizontal, since they normally have dimensions of f × 1 and 1 × f , respectively. This division of the filter reduces the number of multiplications. For example, if f = 3, instead of having nine multiplications in the conventional convolution, with the separation there are six multiplications, three for the vertical and three for the horizontal, Figure 7 shows a graphic description of the separation and how the output has the same dimension as if a filter of f × f were applied to one channel image. The advantage of this separation is that since the number of operations is reduced then the network runs faster. However, the number of parameters is less, and not all of the filters can be separated. In order to not limit the network, these separations were only added in the two in the second and fourth layers.

Training
The proposed architectures were trained during 100 epochs with a batch size of eight, the loss function was the categorical cross-entropy. The accuracy of the model was measured with the Dice function (F1 score), defined as The dataset used contains photos that were taken from lower heights with drones or with cellphones from places, such as roofs or bridges similar to the Inria Aerial Dataset [59], but from a height where is easy to detect a mobile robot. It consists of 2647 images, where 2203 were taken for the training. The images do not have a fixed resolution or shape. They were labeled by an expert in the software labelme [60], see Figure 8. The quantity of images in which each class is present is shown in Table 1 except for the class background that is not labeled. The dataset was augmented with the transformations presented in Table 2. The dataset is available to the reader and the information can be consulted in the Supplementary Material section.   Our approach is compared against the typical U-Net, two networks that are focused on real-time semantic segmentation: the Light-weight Context Guided Network (CGNet) [61], DABNet [62], Additionally, a light version of the high-resolution network (HRNet) [63] with the number of kernels in each block reduced by a half. Finally, against Exfuse [64] with ResNet18 as the backbone encoder.

Results
The networks were tested in a PC with windows as operative system, an AMD Ryzen 7 3750H processor, 16 GB of RAM, and a GPU RTX2060. The dataset for the test contains 444 images.
The results in the F1 score and the IoU, another popular metric, which is the Intersection over Union (IoU) or Jaccard Index defined by are presented in Table 3. It is shown that, as it occurred in the training, the ones that outperformed our approach were Exfuse and DABNet. Exfuse surpassed ours by 0.0363 in Dice 0.0915 in IOU, DABNet got a higher score with a difference of 0.0225 in Dice and 0.0282 in the IoU with respect to the U-Net with DS, but DABNet has an increase in the number of parameters of 96% and Exfuse 3997%. Concerning the frames per second (FPS) the one with the best results was U-Net DS with 17.2725 FPS against DABNet with 11.0042 FPS, which represents that our proposed architecture is 56.96% faster than DABNet and 64.48%. On the side of the other architectures, these were outperformed in all four aspects, called Dice, IOU, parameters, and FPS. These results are easily visible in Figures 11 and 12. Table 3. Results on the test dataset in Dice and Intersection over Union (IoU) scores of each network.  Furthermore, a comparison of the U-Net DS varying the number of kernels in each level was done. Let us say that the number of kernels base is C. Subsequently, in the architecture, this value in each level is as [C, 2C, 4C, 8C, 16C], in our approach C = 16 as Figure 5. The comparison was performed against U-Net DS with C = 32 and C = 64. The one with the best scores in Dice and IOU was the one with C = 32 getting 0.8292 and 0.7457, respectively, being even better than the one using C = 64, this could be due to overfitting. Taking in consideration that memory usage and velocity is the priority, once again U-Net DS using C = 16 had the best trade-off, having just 25.33% of parameters of U-Net DS with C = 32 and 8.93% faster at the cost of losing 6.4% in Dice and 9.1% in IoU scores, see  Additionally, Table 4 presents the results for each class using the U-net with depthwise separable convolutions, which gives a major understanding of the general results. The low score in the injured person class is due to the dataset lacks a considerable amount of examples with injured persons whereas classes, like tile, cement, and grass, have a similar texture or color, or the car class where the majority of the cars have a similar shape just changing the color are easy to learn for the net. Additionally, most of the higher scores match with the number of examples per class, as an example, the class cement is present in 2117 images from 2203 from the training dataset, while classes with low scores, such as Injured Person and Stairs are present in 10 and 91 images, respectively. Qualitative results are shown in Figure 15 with the input, the output of four of the twelve classes, and finally, the map with all of the classes colored. The simplicity of the implementation, the low inference time, and the low hardware requirements such as memory, processor or graphic card are some of the many advantages of the proposal achieved thanks to the architecture and the choice of a reduced number of filters combined with the use of depthwise convolution.   Figure 15. Results on the test, first column the input image. The second column shows the output layer corresponding to the car class. In the third column the output layer for the tree class. The fourth column contains the pixels classified with the person label. Fifth column shows the output for the cement class and, finally, the output with all of the classes presented in the image.

Conclusions
In this paper, the authors proposed a method to extract the most important information from aerial images using a CNN for image segmentation. To solve the problem two different architectures of CNN have been proposed. The experimental results show that the U-net with depthwise separable convolutions is the best architecture for this problem due to it had the best trade-off having fewer parameters that correspond to less memory usage and an increase of 50% in the FPS despite the loss of 2% and 4% in the Dice and IoU scores. This architecture is able to segment the dataset correctly. With these results, an UAV can send aerial images to a mobile robot, which can apply the proposed algorithm to perform the task of mapping the terrain. The results of the algorithm can be used by the path planning algorithm of the mobile robot to perform navigation tasks.

Conflicts of Interest:
The authors declare no conflict of interest.