A Deep Learning Architecture For 3D Mapping Urban Landscapes

: In this paper, an approach through a Deep Learning architecture for the three-dimensional reconstruction of outdoor environments in challenging terrain conditions is presented. The architecture proposed is conﬁgured as an Autoencoder. However, instead of the typical convolutional layers, some differences are proposed. The Encoder stage is set as a residual net with four residual blocks, which have been provided with the necessary knowledge to extract the feature maps from aerial images of outdoor environments. On the other hand, the Decoder stage is set as a Generative Adversarial Network (GAN) and called a GAN-Decoder. The proposed network architecture uses a sequence of the 2D aerial image as input. The Encoder stage works for the extraction of the vector of features that describe the input image, while the GAN-Decoder generates a point cloud based on the information obtained in the previous stage. By supplying a sequence of frames that a percentage of overlap between them, it is possible to determine the spatial location of each generated point. The experiments show that with this proposal it is possible to perform a 3D representation of an area ﬂown over by a drone using the point cloud generated with a deep architecture that has a sequence of aerial 2D images as input. In comparison with other works, our proposed system is capable of performing three-dimensional reconstructions in challenging urban landscapes. Compared with the results obtained using commercial software, our proposal was able to generate reconstructions in less processing time, with less overlapping percentage between 2D images and is invariant to the type of ﬂight path.


Introduction
Three-dimensional reconstruction and visual representation is a broadly studied problem and can be used in many applications such as object recognition and scene understanding. State-of-the-art 3D reconstruction algorithms show important results and propose solutions to the Structure from Motion (SfM) and Simultaneous Localization And Mapping (SLAM) problems with important results and proposals that give a solution to these problems [1][2][3][4][5][6][7][8][9][10][11]. Techniques perform localization and mapping, and 3D reconstruction using active sensors (e.g., LiDAR scanners) and passive sensing (e.g., stereo cameras).
However, in 3D reconstruction, none of these methods perform well on practical scenarios, and given the ambiguous correspondences between pixels and 3D spatial points, projection from 2D to 3D remains remarkably difficult and intuitive, these models are typically incapable of producing reliable matches in regions with repetitive patterns, homogeneous appearance, or large illumination change, a typical problem in photogramatry [12][13][14][15]. The problem more challenging when working with aerial images of external environments.
Nevertheless, with the current advancements in Deep Learning, it is possible to apply different architectures to obtain similar o better results using and combining different configurations of deep neuronal networks (e.g., Autoencoders). For example, using large existing datasets and considering sensor fusion of data, obtained from a stereo camera and an active 3D LiDAR [16][17][18], we are able to perform long-range depth estimation and 3D reconstruction.
In this work, our main interest is that from input 2D aerial images, perform threedimensional reconstructions. Due to the orographic conditions of the state of Oaxaca in Mexico, it is very difficult to obtain data from different areas. Therefore, a three-dimensional model would allow us to have information about the areas of interest. In particular, the conditions of the terrain of the Technological University of the Mixteca, in the highlands of Oaxaca state, Mexico, present variations in height and several areas large amount of homogeneous vegetation. Therefore, a three-dimensional model with point cloud generated with a neural network architecture from only 2D images as input would allow us to obtain important information about the university areas.
Precise 3D reconstruction of the campus is important for several projects, ranging from infrastructure expansion to water recuperation in the roofs and green energy plants location passing for virtual tours proposed to new students. With a surface of 1.0 × 10 6 m 2 and many high buildings in the campus, we need the support of aerial images to cover the dimension of the university and with the information provided from different perspectives, carry out a digital reconstruction.
The motivation that drives us in this direction is an observation in the sense that the proposals in the current state-of-the-art do not focus in the use 2D aerial images. Furthermore, these proposals use information from other sensors like LiDARs [16][17][18]. On the other hand, current advances indicate that it is possible to combine different Deep Learning architectures to perform digital 3D reconstructions such as [19,20]. Therefore, in this work, we propose to use an Autoencoder [21][22][23][24][25] architecture and Generative Adversarial Networks [26][27][28][29][30] with a 2D aerial images sequence as input to the proposed neural network architecture to obtaining as output a three-dimensional reconstruction with point clouds.
Our contribution is summarized in two aspects: 1.
Based on an Autoencoder architecture, we propose a deep residual neural network for the Encoder stage and a GAN network for the Decoder stage. This configuration generates a point cloud using a sequence of 2D aerial images as inputs. The proposed methodology does not need information from other sensors such as LiDARs to deliver reliable results similar to those of commercial software.

2.
This proposal works at different altitudes (100-400 m), at low overlapping percentages between each image (30-80%), and is independent of the flight path used to captured the image sequences of the target area.

Related Works
We reviewed the current state-of-the-art and found different methods to perform 3D structure inference of an object using a single image. These works also attempt to solve the SfM and SLAM problems [31][32][33][34].
Recent works combine Deep Learning techniques to perform three-dimensional reconstructions, using data from the stereo cameras, mono-LiDAR, and stereo-LiDAR cameras, merging merging the data from these sensors to obtain better results to obtain better results [35][36][37][38][39]. However, these proposals are focused on solutions for individual objects and many of them only focus on reconstructing structured environments. On the other hand, only a few works [40][41][42][43][44] focus on the reconstruction of environments with challenging conditions such as high altitudes, textures, etc. In addition, these works do not contemplate the limitations of working only with monocular and aerial 2D images and only a few of them use deep learning techniques.
Commercial software such as Pix4DMapper, Agisoft Photoscan, DroneDeploy, and others are able to perform 3D reconstruction of outdoor environments, but often require specific configurations to guarantee results. For example, a minimum flight height and overlapping percentages between the image sequence.

Network Architecture
The model proposed to infer a complete 3D shape of a landscape and the objects present in the terrain from a 2D aerial images sequence is shown in Figure 1. It consists of an autoencoder configuration, with the main parts being the Encoder, Bottleneck, and Decoder, which are described in detail below. The Encoder has been configured as a Residual Network (ResNet) and is composed of four Residual Convolutional Blocks. This configuration allows obtaining dense feature maps from the input image sequence. At the output of the Encoder, we obtain data correspondences to be able to generate point clouds with the next stage.
On the other hand, and unlike a classic autoencoder architecture, in this proposal the decoder is based on a Generative Adversarial Network (GAN) architecture. Which is composed of a Generator network and a Discriminator network, capable of generating a point cloud from the input image sequence and the correspondences generated in the previous stage. This stage is called GAN-Decoder.

Encoder
The encoder stage is set up with four residual blocks. Each one is designed with two convolutional layers followed by batch-normalization layers and Parametric ReLU [45] as the activation function. The network's layers are shown in Figure 2. These layers are used to extract dense features maps.
To correctly generate the geometry, a fully connected layer and an Image Retrieval layer are appended. To obtain key points and point correspondences between each input image sequence. Furthermore, a max-pooling layer and a fully connected layer are added to apply geometric correction.
Compared to traditional convolutional layers, the residual convolutional layers allow for a more efficient extraction of dense features in aerial images, useful to better describe the objects present in the target scene. While the vector of features extracted by the encoder, Dense Feature Vector DFV, has a size of 1 × 1 × 1024 that will be reshaped and concatenated with the point cloud for the training stage.

GAN-Decoder
The generator proposed is based on a GAN network. It consists of a generator network and a discriminator network. During training, a feature vector from the encoder stage is concantenated with each point of an initial uniformly spaced point cloud, thus making a new feature vector of size 1024 × 1. Figure 2 shows the improvement and concatenation process. The new vector is used by the GAN-Generator. After three FC layers followed by ReLU, the generator ends with a fully connected layer and a max-pooling layer that predicts the final point cloud with a 1024 × 3 shape. The difference between our proposal and other proposed networks is how the models are reconstructed. By using an initial point cloud with real feature maps from aerial images, the GAN-Decoder performs better and point cloud inferences. With that, the GAN-Decoder performs better end point cloud inferences.

Generator Network
The core of the proposed generator network, which is illustrated in Figure 2, tend identical residual blocks are used in the Encoder stage. To improve the precision of generated point cloud vector, two fully connected and two max-pooling layers are added in the sub-pixel convolution block and trained according to the model proposed by Shi et al. [46]. Although, in that model, a point cloud database is used to increase the knowledge of the proposed model.

Discriminator Network
To discriminate real point clouds from generated, the Discriminator Network follows the architectural guidelines summarized by Ledig et al. [47] and Goodfellow et al. [48] by using a LeakyReLU activation function (α = 0.3) and avoiding max-pooling throughout the network, finally we use the sigmoid function to normalize the output of the module. With this configuration, we reduced the complexity of the model and improved processing time.

Implementation Details
The development of the proposed model, the fine-tuning and the transfer learning were performed in Keras and Tensorflow [49]. The Encoder was trained with the images from a previously generated dataset that includes 2000 aerial images distributed as shown in Table 1. It is important to note that the aerial images from this database have been captured in a circular (Circular Mission) path around the target area. With multiple viewpoints of the area flown over by the drone, is expected to have a 360°perspective of the area of interest. In this way, it is possible to perform the most detailed reconstruction of the area of interest with multiple objects and complex backgrounds. The model is trained until that the validation precision stopped increasing. To perform fine-tuning and transfer learning, 1500 images were used for training and 500 for validation. Furthermore, we trained for 50 epochs and used a batch size of 20. The training was carried out in a machine with two NVIDIA RTX 2080Ti graphic cards, Ubuntu 19.04 operating system, and 32 GB of RAM memory.
First, we trained the Encoder for approximately 24 h and obtained a training and validation loss of 0.623 and 0.219 (see Figure 3a), respectively, and an training and validation accuracies of 80.25% and 93.75%, respectively (see Figure 3b).
In the second step, the training of the GAN-Decoder is carried out using Adam's optimizer [50], alternatively updating the Generator and Discriminator network. Furthermore, as the Generator uses convolutional blocks with skip connections, similar to those of the RestNet model and identical to those used in the Encoder stage, we decided to use the blocks and their corresponding weights obtained after training the Encoder.
Additionally, following the work in [29], the Discriminating Networks is trained following using the key points obtained from aerial images generated by the Encoder stage and using the maximization function shown in Equation (1).
where p G is the generator distribution over the input data P, and G θ G generator with with its specific weights and biases denoted by θ G . D represents the discriminator, with a D θ D distribution representing the probability that the data or a point came from the data p G . The discriminator D is trained to maximize the probability of assigning the correct point to the samples taken from G and, in consequence, minimizes the fail probability of samples generated by G. Moreover, to constrain the range of the discriminator point output, we propose to use a sigmoid activation at the end, we found it useful to stabilize the training in our experiments between the residual input points and the point cloud generated [51][52][53][54].
With the above configurations we are able to reduce the complexity of the model and improve processing time. After training the Discriminator in the GAN-Decoder, we obtained a final loss of 0.647 in training and 0.338 in validation (see Figure 3c) and an accuracy of 78.04% in training and 90.63% in validation (see Figure 3d).
Finally, in the third step, we train the complete architecture and obtain a training and validation loss of 0.673 and 0.237, respectively (see Figure 3e), and a training and validation accuracy of 76.68% and 96.88%, respectively (see Figure 3f).

Experimental Results
The proposal was evaluated using quantitative and qualitative measures, and these show how effective the model is for 3D reconstruction of urban landscapes. As the Pix4DMapper is used in professional applications, its results were used as ground truth and they were compared with the results obtained using our proposed model.
Taking into account the fact that most commercial software, such as Pix4DMapper, DroneDeploy, and DJIGO4, have similar requirements to generate valid reconstructions and, therefore, to compare our results, we decided to use two experiments with different overlapping and height configurations. The first configuration uses a Circular Mission path with 80% overlapping and the second one uses a Grid Mission path with 50% overlapping. For both experiments, the images were taken at 4K resolution and at a height of 150 m.
The results for the first experiment are shown in Figure 4, where the first column shows the results obtained with Pix4DMapper and the second one shows the ones obtained with the proposed methodology. Each row contains the reconstructions of different target areas. The point clouds obtained show similar results, but the Pix4DMapper software generates more accurate point clouds using this configuration.
Analyzing the results, it is possible to observe that the proposed model generates a point cloud that represents both the compound shapes and textures and objects as trees, hallways, vehicles, and buildings. The generated point cloud is uniformly distributed and incorporates a considerable part of the selected area and the objects present. The results have shown that it is possible to recognize the features of each object in the scene. However, there are clear spaces separating the areas, which could be relatively important for the application, however, the results are visually acceptable, and compared to those obtained by commercial software they are valid enough.
Furthermore, a quantitative analysis of the point cloud generated is performed using the Chamfer [55,56] distance (2) and Earth Mover's Distance EMD [57,58] Equation (3) as similarity metrics. The analysis of this evaluation allows recognizing the similarity of the reconstructions (the generated one and that of the commercial software) through the distance between points. In Equations (2) and (3),P represents the point cloud generated with our proposal and P represents the one generated using the commercial software. To obtain the Chamfer distance, we find the sum of the squared distances obtained from each point and its closest neighbor. The chamfer distance is smooth and continuous in parts, and the search process is independent for each point. The lower value, the better and more accurate the similarity will be between the two point clouds. In the case of EDM, the bijection φ :P → P is employed. In EMD, each point fromP corresponds to one unique point in P. In this way, it enforces a point-to-point assignment between the two point clouds. Table 2 shows the results of the distance between the point cloud generated by the proposed methodology and the one generated by the commercial software and the low d Cham f er distances indicate their close similarity.  2 (3) Table 2. Comparison between the reconstruction results obtained using Pix4DMapper and the results obtained using our proposal. The Metrics are computed on 1024 points. Additionally, the results are computed using 1400, 700, and 300 aerial images with 80%, 50%, and 30% overlapping percentage, respectively. The distances with no value indicate an a overflow of the data.

Selected Area d Cham f er d EMD t proposal t Pix4DMapper
Overlapping = 80% The results of the three-dimensional reconstruction obtained with Pix4DMapper and the results obtained with our proposal are shown in Table 2. To determine the similarity distance d Cham f er and d EMD , between the results with the commercial software and our proposal, are used 1024 points samples, with 1400, 700, and 300 aerial images an overlapping percentage of 80%, 50%, and 30%, respectively, and 150 m height.
The results show that the similarity is very close in percentages of 80% (the minimum percentage that commercial software requires to guarantee results). However, for lower overlapping percentages, commercial software cannot perform a three-dimensional reconstruction, while with the proposed methodology it may do. This causes the similarity metrics to begin to increase (see Figure 5).
Moreover, the processing time of the proposed methodology (t proposal ) is considerably reduced compared to commercial software (t Pix4DMapper ). With this proposed, the point cloud generated from aerial images will provide high-quality reconstruction in textures, meshes, and volumetric structures present in the several objects in the urban landscapes. Furthermore, similar enough to those obtained by Pix4DMapper but with less processing time.
In the second configuration, we use the same target areas used with the previous configuration. In this configuration we used the Grid Mission path and the results obtained are shown in Figure 5. The first column shows the results obtained with the Pix4DMapper and the second column shows the results obtained using our proposed method.
In contrast to the previous configuration, the results obtained from this experiment show a clear improvement in 3D reconstruction, when using our proposed methodology. Pix4DMapper and other commercial software require special configurations, such as the flight path and camera configurations. If the configurations are slightly changed, as shown in our tests, the results become unfavorable. In comparison, the methodology presented in this paper is robust to these types of factors. This proposal does not depend on any especial configurations to be able to generate clear and legible point clouds. It is worth noting that our proposal is able to reconstruct areas that were not in the training dataset (areas outside the university campus) and Figure 6 shows an example. The performance of the proposal is shown by obtaining three-dimensional reconstructions using point clouds. The first column shows the results from Pix4DMapper and the second column shows the results from the methodology proposal. On the other hand, we carried out tests at heights varying from 300 m to 500 m. The results show that it is necessary to improve the architecture to be able to perform three-dimensional reconstructions at heights greater than 300 m. Figure 7 shows the reconstruction of an area at 150 m ( Figure 7a) and 400 m (Figure 7d). From these results, we can see that, using the images taken at 400 m, the proposed method will not generate enough points to perform 3D reconstruction of the target area. These results show that our methodology performs 3D reconstructions at different high altitudes. However, at altitudes greater than or equal to 300 m, it presents some difficulties to perform three-dimensional reconstructions.

Discussion and Conclusions
In this work, a novel deep neural network architecture was presented for the generation of point clouds from aerial images of urban and natural landscapes. A notable contribution are the Autoencoder settings. The classical configuration was adapted with a residual network in the Encoder stage and a GAN network for the Decoder stage, which has been called GAN-Decoder. Using this architecture, it was possible to obtain results similar to those obtained using commercial software and, in some aspects, even superior.
The proposed methodology is robust to variations of flight configurations for image acquisition. Fundamentally, this methodology does not depend on and does not need a special flight over the interest zone for the acquisition of information. Moreover, it is possible to obtain enough valid results with an inferior overlapping percentage in the images acquired and in less processing time.
The results of this proposal are compared with the results obtained with commercial software. Having a Pix4D Mapper license allowed us to validate and compare our results. Our proposed methodology is robust to variations in flight configurations for images acquisition. Fundamentally, this methodology does not depend on and does not need a special flight over the interest zone during the acquisition of information. Moreover, we are able to obtain results in less processing time and using images with less overlapping percentage.
Additionally, most works presented in the literature focus on three-dimensional reconstruction of controlled environments or solid and individual objects and uses fuse stereo images and point cloud from LiDAR sensors. In comparison, our proposal has focused on the three-dimensional reconstruction of urban landscapes just using from an aerial images sequence. In addition, we take advantage of the potential of GANs to distinguish true and false data without training data with many annotations. Therefore, we improved a discriminating network will make the generator network improve its data generation process. We have thoroughly trained a discriminate, based on the Adam's optimizer [50] and a maximization function (1), until the GAN can no longer distinguish true and false data. With which we have been able to solve the problem with the GAN models is its ability to trust the exit data.
However, working in extreme conditions such as heights above 300 m and exploring areas with homogeneous textures remains a challenge.

Data Availability Statement:
The data that support the findings of this study are available from the corresponding author, upon reasonable request.