Deep Learning Classification of 2D Orthomosaic Images and 3D Point Clouds for Post-Event Structural Damage Assessment

Efficient and rapid data collection techniques are necessary to obtain transitory information in the aftermath of natural hazards, which is not only useful for post-event management and planning, but also for post-event structural damage assessment. Aerial imaging from unpiloted (gender-neutral, but also known as unmanned) aerial systems (UASs) or drones permits highly detailed site characterization, in particular in the aftermath of extreme events with minimal ground support, to document current conditions of the region of interest. However, aerial imaging results in a massive amount of data in the form of two-dimensional (2D) orthomosaic images and three-dimensional (3D) point clouds. Both types of datasets require effective and efficient data processing workflows to identify various damage states of structures. This manuscript aims to introduce two deep learning models based on both 2D and 3D convolutional neural networks to process the orthomosaic images and point clouds, for post windstorm classification. In detail, 2D convolutional neural networks (2D CNN) are developed based on transfer learning from two well-known networks AlexNet and VGGNet. In contrast, a 3D fully convolutional network (3DFCN) with skip connections was developed and trained based on the available point cloud data. Within this study, the datasets were created based on data from the aftermath of Hurricanes Harvey (Texas) and Maria (Puerto Rico). The developed 2DCNN and 3DFCN models were compared quantitatively based on the performance measures, and it was observed that the 3DFCN was more robust in detecting the various classes. This demonstrates the value and importance of 3D datasets, particularly the depth information, to distinguish between instances that represent different damage states in structures.


Introduction
One of the emerging approaches for aerial image collection is to utilize the unpiloted (or unmanned) aerial system (UAS), commonly known as a drone [1][2][3]. Following natural hazard events, the data collection is often limited by time and site accessibility imposed by precarious structures, debris, road closures, curfews, and other restrictions. However, UAS imagery enables first responders and emergency managers to perform effective logistical planning, loss estimates, and infrastructure assessment for insurance adjusters, engineers, and researchers [4]. UAS with an onboard camera enables assessors to collect numerous images from large areas efficiently as well as to reconstruct the three-dimensional (3D) scene via three steps, including Scale Invariant Feature Transform (SIFT), Structure-from-Motion (SfM), and Multi-View Stereo (MVS). Here, SfM is generated using two-dimensional (2D) aerial images [5]. The SfM derived point cloud has relative accuracy at the

Studies Used 2D Images for Detection and Classification
The task of object detection or classification of a set of images has been investigated by various studies using CNNs with different architectures. Among all proposed methods, transfer learning has become one of the most popular techniques. Transfer learning corresponds to the process of fine-tuning the upper layers of a pre-trained model based on a new dataset for a newly proposed task(s) [8]. Models developed based on a transfer learning strategy demonstrated not only improved performance in comparison to other models but also that such models could be developed more efficiently. In an early study within the area of deep learning, Bengio discussed transfer learning algorithms and their effectiveness in classifying new instances based on pre-trained models and demonstrating the process through numerous examples of transfer learning [8]. One of the most referenced studies was performed by Krizhevsky et al. [9]. The authors within this study developed a CNN model through transfer learning from the subset of the ImageNet network [10]. Krizhevsky et al. modified the fully connected layers to accommodate classifying the new labels [9]. The modified network architecture contained eight layers to maximize the correct label under prediction distribution. The authors reported that the validation of the results showed test error rates of 17% to 37.5% for the used datasets. Another application of transfer learning was studied by Oquab et al. [11]. The authors developed various CNN models for the task of visual recognition based on transfer learning from the pertained ImagetNet model [10]. Within this study, the training images are mainly comprised of centered objects with a clear background in the image, and authors reported that the model was able to classify images with a high level of accuracy after an extended training process. Different CNN performs distinctively as the network architecture varies. As a result, Shin et al. listed a few popular image-based classification transfer learning networks, such as CifarNet, AlexNet, and GoogleNet [6]. These network performances were compared based on a set of medical images via transfer learning. It was concluded that transfer learning has been consistently beneficial for classification experiments.
Various studies have investigated the application of CNN models for post-event assessments using aerial images. For example, Hoskere et al. proposed post-earthquake inspections based on UAS imagery and CNN models [12]. Within this study, the authors developed a fully convolutional Drones 2020, 4, 24 3 of 19 network to semantically segment images into three classes of pixels. The developed model was able to segment the images with an average accuracy of 91.1%. More recently, Xu et al. studied the post-earthquake scene classification task using three deep learning methods. These methods included a Single Shot MultiBox Detector (SSD), post-earthquake multiple scene recognition (PEMSR) based on transfer learning from SSD, and Histogram of Oriented Gradient along with Support Vector Machine (HOG+SVM) [13]. Within the proposed method, the aerial images were initially classified into six classes, including landslide, houses, ruins, trees, clogged, and ponding. The dataset was created via web-searched images of the 2014 M W 6.5 Ludian earthquake (China), which were later preprocessed and degraded into 300 × 300 pixels and manually classified into the six classes aforementioned. The authors reported that the PEMSR model demonstrated a higher efficiency of 0.4565 s comparing to HOG+SVM of 8.3472 s as well as higher accuracy. In their work, the transfer learning strategy also improved the overall accuracy and performance, although the average processing time was slightly higher than the SSD method. Moreover, in addition to the effect of transfer learning in the increasing accuracy and performance of 2D CNN models, Simonyan and Zisserman pointed out the CNN performance improvement can be achieved by increasing the network depth [14]. As a result, Gao and Masalam developed a deep 2D CNN based on transfer learning from the VGGNet model for Structural Health Monitoring (SHM) and rapid post-event damage detection [15]. The 2D image-based SHM used red, green, blue (RGB) information based unsupervised training algorithms and was able to obtain 90% accuracy for binary classification.

Studies Used 3D Point Clouds for Detection and Classification
With the rapid development of technologies to collect remotely sensed 3D point clouds and the growing application of these data in various fields of civil engineering, many researchers have proposed various methods to analyze 3D point clouds, in particular for routine inspections or post-event data collection and analyses [16,17]. The datasets here are considered to be non-temporal, which is a single post-event only dataset that does not utilize change detection from a baseline (or pre-event) dataset. For example, Axia et al. proposed a workflow to classify an aerial 3D point cloud into damaged and undamaged classes [18]. Within the proposed workflow, Axia et al. estimated a normal vector for each point within the point cloud data as the key damage sensitive feature and identified the variation of these normal vectors with respect to a global reference vector. Lastly, the study used a region growing approach based on the variation of normal vectors to classify the point cloud. Axia et al. reported that while the proposed method can classify the collapsed structures, the developed method may misclassify the partially damaged structures.
In general, one of the main steps in point cloud analysis workflows is to classify the points into a set of predefined classes. As a result, multiple workflows have been introduced to classify point clouds through machine learning and more recently deep learning techniques. Hackel et al. introduced one of the most successful workflows to classify dense point clouds of urban areas into multiple classes [19]. These classes include building façades, ground, cars, motorcycles, traffic signals, and pedestrians. Within this study, authors extract a series of features for each point based on various neighboring sizes using eigendecomposition, the height of points, and first and second statistical moments and used random forest learning algorithm to classify each point. The proposed method results in the main overall accuracy of 95%. More recently, Xing et al. used the Hackel et al. workflow as a basis and developed a more robust workflow by adding a series of features computed based on the difference of normal vectors for better identification [19,20]. Their study demonstrated a 2% improvement on average.
Recently deep learning method applications become more widespread to analyze 3D datasets in science and engineering fields. Various deep learning-based workflows have been developed to classify the 3D point cloud datasets. The main advantage of deep learning-based learning algorithms over the more traditional learning algorithms (e.g., artificial neural networks) is the capability of the algorithms to learn the feature extractors from the input data directly. Therefore, deep learning algorithms, in particular CNN architectures, eliminate the need for engineering feature extractors based on the geometry of the objects within the dataset and background. One of the early studies to investigate the application of deep learning for 3D point cloud classification was performed by Prokhorov [21]. Prokhorov proposed a 3D network architecture similar to CNN to classify point cloud of various objects by converting the point cloud data into 3D grid representations. The developed 3D CNN network had one convolutional layer, one pooling layer, and two fully connected layers, which was followed by a 2-class output layer. The weights or parameters within the convolutional layers were pre-trained using lobe component analysis and were updated using the stochastic meta-descent method [22]. Following this study, Maturana and Scherer proposed a 3D CNN for object recognition similar to that of Prokhorov [22,23]. The proposed 3D network had two tandem convolutional layers, one max-pooling layer, and one fully connected layer, which was followed by the output layer. In contrast to the study conducted by Prokhorov [22], Maturana and Scherer did not pre-train the developed network while the network resulted in a performance on par or better than the network proposed by Prokhorov. This highlights that the developed 3D CNN network was able to extract features during the training process effectively.
Recently, Hackel et al. introduced a point cloud classification network based on 3D CNN architecture. The proposed network accepts five occupancy grid models with different resolutions for each instant as input and has five convolutional layers in parallel with an organization similar to VGGNet, which followed by a series of fully connected and one output layers [14,24]. The authors have reported a maximum overall accuracy of 88% and an intersection over the union value of 62% for datasets collected from urban environments. This work classifies the scene into classes of natural terrain, high vegetation, low vegetation, buildings, hardscape, vehicles, and human-made terrains. More recently, Zhang et al. proposed a network to semantically segment point clouds based on a model that consists of three distinct networks [25]. The first network encodes the point cloud into 2D instances. The second network consists of a series of fully connected and max-pooling layers, which are followed by convolutional layers. Finally, the third and last network goal converts the 2D encoded data into 3D grid models, which semantically classify the voxels in the grid and creates a bounding box for each detection object. The authors reported that the experimental results demonstrate an overall improvement in accuracy of 10% in comparison to the network developed by Maturana and Scherer [25].

Knowledge Gap
Previous studies have explored the application of CNNs in post-natural hazard event assessment using aerial images. Both deep learning-based methods and unsupervised learning were implemented for 2D and 3D datasets, while the difference between 2D and 3D datasets in deep learning has yet to be fully understood by quantitative comparisons. As reviewed, the majority of studies developed to analyze the 3D point clouds for post-event applications were created based on traditional methods. In contrast, the applications of deep learning models developed based on transfer learning for 2D aerial images was investigated in various studies. However, due to the lack of depth information, limitations of the damage and structural component recognition still exist. As a result, this study investigates the application of deep learning-based models using 2D images and 3D SfM derived point clouds corresponding to the same post-event scenes.

Introduction to Hurricane Harvey and Maria
Within this study, three othomosiac image and point cloud datasets were collected in the aftermath of Hurricanes Harvey and Maria. Hurricane Harvey made landfall on 25 August 2017 on the coastline of Texas. Hurricane Harvey was a Category 4 hurricane and produced wind gusts over 215 km/h, and storm surges as high as 3.6 m. This incident resulted in the destruction of more than 15,000 partial damage, 25,000 residential and industrial structures, as well as other critical infrastructure in coastal communities, including the towns of Rockport and Port Aransas [26]. Hurricane Maria made landfall on 19 September of 2017 in Puerto Rico. Hurricane Maria was classified as Category 5 hurricane and produced wind guests over 280 km/h, and storm surges as high as 2.3 m, which makes it the most severe natural hazard event recorded in history to affect Puerto Rico and other Islands in the region [27]. As a result of this extreme event, the power grid of Puerto Rico was significantly damaged, a major dam for the Guajataca reservoir sustained critical structural damage, and more than 60,000 buildings were damaged [28].

Data Collection Method
To carry out the data collection task of the selected areas, a medium-size drone with an onboard camera was deployed. A DJI Phantom 4 UAS collected high-resolution aerial images with an onboard camera. The selected flight paths were fully controlled autonomously with the Pix4dcapture application on a handheld tablet. The data collection in Puerto Rico produced 4077 images in 7 flights, which covered approximately 1.75 km 2 area with a 53.5 m elevation change. The Texas Salt Lake dataset contained 1379 images in 2 flights with 0.75 km 2 area coverage with an elevation range of 9.3 m. The Texas Port Aransas site had 1424 images collected from four flights, with a 0.88 km 2 area coverage with an elevation range of 1.9 m. The collected images were further processed using SfM workflow, which used a series of two-dimensional images with sufficient overlap to generate 3D point cloud and further processed orthomosaic datasets of the surveyed area [1]. The SfM derived point clouds for the two sites are shown in Figures 1-3. Other key characteristics of these datasets are presented in Table 1. Rico and other Islands in the region [27]. As a result of this extreme event, the power grid of Puerto Rico was significantly damaged, a major dam for the Guajataca reservoir sustained critical structural damage, and more than 60,000 buildings were damaged [28].

Data Collection Method
To carry out the data collection task of the selected areas, a medium-size drone with an onboard camera was deployed. A DJI Phantom 4 UAS collected high-resolution aerial images with an onboard camera. The selected flight paths were fully controlled autonomously with the Pix4dcapture application on a handheld tablet. The data collection in Puerto Rico produced 4077 images in 7 flights, which covered approximately 1.75 km 2 area with a 53.5 m elevation change. The Texas Salt Lake dataset contained 1379 images in 2 flights with 0.75 km 2 area coverage with an elevation range of 9.3 m. The Texas Port Aransas site had 1424 images collected from four flights, with a 0.88 km 2 area coverage with an elevation range of 1.9 m. The collected images were further processed using SfM workflow, which used a series of two-dimensional images with sufficient overlap to generate 3D point cloud and further processed orthomosaic datasets of the surveyed area [1]. The SfM derived point clouds for the two sites are shown in Figures 1-3. Other key characteristics of these datasets are presented in Table 1.

Dataset Classes
Within this study, each dataset was segmented manually into one of the following seven classes: Undamaged structures, partially damaged structures, completely damaged structures, debris, roadways, terrain, and vehicles. An earlier study by Mohammadi et al. informs the classification used here [7]. However, the scope of damaged structures is expanded where the instances are modified into two damaged structure classes based on the level of damage sustained during the event, which relates to the degree of damage sustained. A structure that sustained partial damage includes any building that does not represent any physical changes while the roof of the structure is covered by tarps, which are typically blue or red. Completely damaged structures are buildings that underwent physical changes due to the event, such as roof damage without tarp coverings with visible structural components such as beams, columns, or walls. If a structure is collapsed such that no structural component can be identified, the structure is classified as debris. The class of debris consists of everything that is not in its native state. Debris, in general, is comprised of rooftop shingles, fallen trees, downed utility or light poles, and other wind-blown objects. Terrain incorporates any region that is comprised of grass, low-height vegetation, water, sand, trees, exposed soil, fences, or utility poles. Note that any nonbuilding structural objects that are represented by a cylindrical shape (e.g., utility and light poles) are considered as terrain [29]. Lastly, the vehicle class corresponds to objects used for the transportation of people or goods. This includes cars, SUVs, trucks, carts, recreational vehicles, trailers, construction vehicles (e.g., excavators), or any water-borne vessels that can be propelled on water by oar, sail, or engine.

Dataset Preparation for 2D Images
The process of creating image instances was started by creating an orthomosaic image of the entire scene. This was done using Pix4Dmapper. Afterward, the orthomosaic image was segmented into a series of to 256 × 256 images. As a result, approximately a total of 18,000 images were created from the Puerto Rico dataset, the Salt Lake dataset resulted in a total of 60,000 images, and Port Aransas dataset was divided into 120,000 segmented images. The next step within the preparation image instances was to assign a label to each 256 × 256 image based on the seven classes mentioned in Section 3. Within this study, the image classes are determined by the most prominent object that is visible in the image. Moreover, the Salt Lake and Puerto Rico datasets were used for model development, and the Port Aransas dataset was used to test and validate the developed models.

2D Convolutional Neural Network Architecture
Pre-trained CNNs have advantages due to their relative stability during the training process, efficiency, and higher performance over various diverse tasks. Among the various networks available to select for transfer learning, AlexNet and VGGNet were selected as a basis to develop the 2D CNN models in MATLAB 2020a. These two networks were pre-trained by millions of images for more than 1000 classes. These selected networks for transfer learning, on the other hand, represented different architectures. AlexNet model was developed in 2012 and was the first CNN model to perform well on the ImageNet database, and it still performed consistently well on diverse datasets [9,30]. This network contained five layers, including convolutional and max-pooling layers, and two fully connected layers, as illustrated in Figure 6. The developed model based on AlexNet had identical architecture to the AlexNet network; however, within the fully-connected layers, the dropout regularization method was applied to combat the overfitting while training [31]. The input images were also augmented through rotation and reflection processes to reduce the generalization error of the models. The second CNN model was developed based on transfer learning from VGGNet from 2014. The VGGNet model had 16 convolutional and max-pooling layers, which was followed by the fully connected layers, as shown in Figure 7. These small filter sizes (i.e., 3 × 3 kernels) in VGGNet captured and learned the small details of input instances while larger filter sizes of the network (i.e., 5 × 5) permitted the network to extract features that corresponded to larger regions. Development of the networks based on transfer learning permitted to modify the previously learned feature extractors of these networks according to a new task using a smaller number of training images and epochs [30].  During the training process of models developed based on the transfer learning strategy, the 256 × 256 image instances were rescaled to 227 × 227 and 224 × 224 for AlexNet and VGGNet, respectively. In addition, the batch size, which represents the number of images input into the network at once, was set to 64 [15]. While the number of epochs was originally set as high as 2000, the training was terminated when the computed losses reached a plateau to combat the overfitting. The learning rate was set as 0.01 for both networks. Besides these parameters, the remaining hyperparameters were kept identical to the original networks [32]. Note that the training images for both AlexNet and VGGNet are identical in order to compare the results, and because of the augmentation process, approximately over 10,000 images within seven classes are used within the network training. The training performance was evaluated by computing the losses and validation accuracy. Generally, the training of AlexNet contained approximately 300 iterations, while VGGNet has a higher number of approximately 500 iterations. Both developed networks resulted in optimized accuracy for seven classes, which is 88.7% of AlexNet and 91.0% of VGGNet. Figure 8 shows the confusion matrices for the developed networks, which demonstrates that both networks were able to detect the majority of class terrain with high-level accuracy.
The evaluation of the results indicated that the model struggled to learn differences in some of the original classes, particularly as related to structural damage assessment. Consequently, a selected number of classes were merged to reduce the total number of classes from the original seven to specific five and four classes. This was done to demonstrate if the model was able to distinguish between a structural class in general, and an improvement was noted. However, none of the networks were able to learn the other classes, including the partially damaged structure, completely damaged structure, and debris, due to significant similarities between partially damaged and completely damaged structures within the segmented orthoimages.  During the training process of models developed based on the transfer learning strategy, the 256 × 256 image instances were rescaled to 227 × 227 and 224 × 224 for AlexNet and VGGNet, respectively. In addition, the batch size, which represents the number of images input into the network at once, was set to 64 [15]. While the number of epochs was originally set as high as 2000, the training was terminated when the computed losses reached a plateau to combat the overfitting. The learning rate was set as 0.01 for both networks. Besides these parameters, the remaining hyperparameters were kept identical to the original networks [32]. Note that the training images for both AlexNet and VGGNet are identical in order to compare the results, and because of the augmentation process, approximately over 10,000 images within seven classes are used within the network training. The training performance was evaluated by computing the losses and validation accuracy. Generally, the training of AlexNet contained approximately 300 iterations, while VGGNet has a higher number of approximately 500 iterations. Both developed networks resulted in optimized accuracy for seven classes, which is 88.7% of AlexNet and 91.0% of VGGNet. Figure 8 shows the confusion matrices for the developed networks, which demonstrates that both networks were able to detect the majority of class terrain with high-level accuracy.
The evaluation of the results indicated that the model struggled to learn differences in some of the original classes, particularly as related to structural damage assessment. Consequently, a selected number of classes were merged to reduce the total number of classes from the original seven to specific five and four classes. This was done to demonstrate if the model was able to distinguish between a structural class in general, and an improvement was noted. However, none of the networks were able to learn the other classes, including the partially damaged structure, completely damaged structure, and debris, due to significant similarities between partially damaged and completely damaged structures within the segmented orthoimages. During the training process of models developed based on the transfer learning strategy, the 256 × 256 image instances were rescaled to 227 × 227 and 224 × 224 for AlexNet and VGGNet, respectively. In addition, the batch size, which represents the number of images input into the network at once, was set to 64 [15]. While the number of epochs was originally set as high as 2000, the training was terminated when the computed losses reached a plateau to combat the overfitting. The learning rate was set as 0.01 for both networks. Besides these parameters, the remaining hyperparameters were kept identical to the original networks [32]. Note that the training images for both AlexNet and VGGNet are identical in order to compare the results, and because of the augmentation process, approximately over 10,000 images within seven classes are used within the network training. The training performance was evaluated by computing the losses and validation accuracy. Generally, the training of AlexNet contained approximately 300 iterations, while VGGNet has a higher number of approximately 500 iterations. Both developed networks resulted in optimized accuracy for seven classes, which is 88.7% of AlexNet and 91.0% of VGGNet. Figure 8 shows the confusion matrices for the developed networks, which demonstrates that both networks were able to detect the majority of class terrain with high-level accuracy.
The evaluation of the results indicated that the model struggled to learn differences in some of the original classes, particularly as related to structural damage assessment. Consequently, a selected number of classes were merged to reduce the total number of classes from the original seven to specific five and four classes. This was done to demonstrate if the model was able to distinguish between a structural class in general, and an improvement was noted. However, none of the networks were able to learn the other classes, including the partially damaged structure, completely damaged structure, and debris, due to significant similarities between partially damaged and completely damaged structures within the segmented orthoimages.

Dataset Preparation for 3D Point Clouds
Raw and unstructured point clouds were typically incompatible with CNN architectures. This was due to the issue that point clouds generally lack a grid structure, unlike images. Consequently, the raw point cloud instances were converted into volumetric or occupancy gird models, which were 3D arrays. Occupancy grid models provided a suitable data structure for point clouds that can be used within robust CNN learning models. To convert the point clouds instances to occupancy grid models, a method as proposed by Mohammadi et al. was used [7]. Within this study, initially, the point cloud instances were created by slicing the labeled point cloud dataset into roughly 10 m × 10 m segments. Then, the coordinates within each segment, which consisted of objects with various labels, were processed to have only positive values and normalized [7]. Afterward, each segment was downsampled based on the selected occupancy grid dimensions. Within this study, the occupancy grid model of 64 3 was used as it results in a sampling of 10 to 16 cm for 10 m × 10 m segments, which was a sufficient resolution to perform per building damage assessment in the aftermath of wind storm events [17]. Lastly, an extra-label corresponding to the empty cells within the 3D arrays was assigned to each instance and denoted as neutral. This allowed the network not only to learn the label instances but also to learn the geometry of the output based on the input instances as well since occlusion or gaps in point clouds are common.

D Fully Convolutional Network Architecture with Skip Connections
The model developed to learn 3D point cloud instances was guided by the previous work of Long et al. and, as discussed in Mohammadi et al. [7,33]. However, the authors reported that developed 3D FCN required a large number of training iterations to achieve an acceptable level of accuracy. As a result, the 3D FCN architecture was modified within this study with skip connections, such that the network can recover the most useful features during the training process at a faster rate [25,34]. The 3D FCN was implemented in TensorFlow v1.15 within this study, and the developed 3D FCN model had an overall general architecture similar to that presented in Mohammadi et al. [7]. In summary, the network was comprised of an input layer that accepted three 3D arrays corresponding to red, green, and blue channels. In addition, the network consisted of an encoding part and decoding parts. The encoder was comprised of 6 3D convolutional layers. The decoding segment of the network consisted of a total of 6 3D transpose convolutional layers. Note that the network did not use any max-pooling layers. Lastly, the output layer was a single-occupancy grid model, each of which cells

Dataset Preparation for 3D Point Clouds
Raw and unstructured point clouds were typically incompatible with CNN architectures. This was due to the issue that point clouds generally lack a grid structure, unlike images. Consequently, the raw point cloud instances were converted into volumetric or occupancy gird models, which were 3D arrays. Occupancy grid models provided a suitable data structure for point clouds that can be used within robust CNN learning models. To convert the point clouds instances to occupancy grid models, a method as proposed by Mohammadi et al. was used [7]. Within this study, initially, the point cloud instances were created by slicing the labeled point cloud dataset into roughly 10 m × 10 m segments. Then, the coordinates within each segment, which consisted of objects with various labels, were processed to have only positive values and normalized [7]. Afterward, each segment was downsampled based on the selected occupancy grid dimensions. Within this study, the occupancy grid model of 64 3 was used as it results in a sampling of 10 to 16 cm for 10 m × 10 m segments, which was a sufficient resolution to perform per building damage assessment in the aftermath of wind storm events [17]. Lastly, an extra-label corresponding to the empty cells within the 3D arrays was assigned to each instance and denoted as neutral. This allowed the network not only to learn the label instances but also to learn the geometry of the output based on the input instances as well since occlusion or gaps in point clouds are common.

3D Fully Convolutional Network Architecture with Skip Connections
The model developed to learn 3D point cloud instances was guided by the previous work of Long et al. and, as discussed in Mohammadi et al. [7,33]. However, the authors reported that developed 3D FCN required a large number of training iterations to achieve an acceptable level of accuracy. As a result, the 3D FCN architecture was modified within this study with skip connections, such that the network can recover the most useful features during the training process at a faster rate [25,34]. The 3D FCN was implemented in TensorFlow v1.15 within this study, and the developed 3D FCN model had an overall general architecture similar to that presented in Mohammadi et al. [7]. In summary, the network was comprised of an input layer that accepted three 3D arrays corresponding to red, green, and blue channels. In addition, the network consisted of an encoding part and decoding parts. The encoder was comprised of 6 3D convolutional layers. The decoding segment of the network consisted of a total of 6 3D transpose convolutional layers. Note that the network did not use any max-pooling layers. Lastly, the output layer was a single-occupancy grid model, each of which cells represented the label of the input point cloud instance (Figure 9). Skip connections added the output of the convolutional layers within the encoder to the corresponding input of transpose convolutional layers in the decoder. The skip connections conceptually helped the network to recover the fine details in the prediction and reduce any gradient vanishing issues. Figure 9 illustrates the skip connections by arrows.
Drones 2020, 4, x FOR PEER REVIEW 13 of 21 represented the label of the input point cloud instance (Figure 9). Skip connections added the output of the convolutional layers within the encoder to the corresponding input of transpose convolutional layers in the decoder. The skip connections conceptually helped the network to recover the fine details in the prediction and reduce any gradient vanishing issues. Figure 9 illustrates the skip connections by arrows. Figure 9. The developed 3D fully convolutional network with skip connections pipeline.
The developed 3D FCN with skip connections was optimized based on stochastic gradient descent, and the cells that contained labels except neutral classes were waited by a factor of 2.0 while updating the learnable parameters to increase boost training and reduce the convergence time. The model was trained on instances from the Salt Lake and Puerto Rico datasets. To further improve the network for generalization, the training instances were augmented by randomly rotating each instance two times. This resulted in a total of 10,958 training instances. In addition, it was observed that network convergence improved as the number of mini-batches increased from 64 to 256. Therefore, the model was trained based on the minibatch size of 256. To evaluate the training process, three performance measures were also calculated, including precision, recall, and cell accuracy in addition to loss, as shown in equations below: where Cii represents the diagonal of the confusion matrix, which also corresponds to true predictions, ∑ ≠ denotes to the false negatives, ∑ ≠ denotes the false positive predictions, ∑ represents the total count of true predictions, and ∑ ∑ represents the total count of all predictions. Table 4 demonstrates the result of these performance measures for the developed model during training for a total of 2500 epochs, and Figure 10 shows the training losses, which was measured based on mean squared error (MSE). Lastly, Figure 11 demonstrates the confusion matrix for the trained model. The training results demonstrated that while the model had learned the geometry of the input instances with a high level of accuracy, cell accuracy of 98.1%, it cannot distinguish the discrepancy between partially damaged structures, completely damaged structures, debris, and vehicles. Lastly, the developed network with skip connections demonstrated massive The developed 3D FCN with skip connections was optimized based on stochastic gradient descent, and the cells that contained labels except neutral classes were waited by a factor of 2.0 while updating the learnable parameters to increase boost training and reduce the convergence time. The model was trained on instances from the Salt Lake and Puerto Rico datasets. To further improve the network for generalization, the training instances were augmented by randomly rotating each instance two times. This resulted in a total of 10,958 training instances. In addition, it was observed that network convergence improved as the number of mini-batches increased from 64 to 256. Therefore, the model was trained based on the minibatch size of 256. To evaluate the training process, three performance measures were also calculated, including precision, recall, and cell accuracy in addition to loss, as shown in equations below: where C ii represents the diagonal of the confusion matrix, which also corresponds to true predictions,  Table 4 demonstrates the result of these performance measures for the developed model during training for a total of 2500 epochs, and Figure 10 shows the training losses, which was measured based on mean squared error (MSE). Lastly, Figure 11

2D CNN Experiment
The developed 2D CNN networks have demonstrated a significant difference between training and testing performance measures. The network accuracy during training reached 88.7% and 91.0% for AlexNet and VGGNet, respectively, while lower accuracy was demonstrated in testing. This could be caused by the limitation of the 2D CNN classification based only on RGB information with the lack of depth information. Moreover, this demonstrates that the network was not able to learn useful features to distinguish between different classes. To further investigate this thought, the model developed based on transfer learning form VGGNet was trained using five and four classes, where the classes related to the structures were grouped together. As VGGNet demonstrates a better performance, this model was selected as the network for a more detailed performance investigation. The combined classes represent more general object classes than the original seven class instances. To reduce the classes to five, the classes comprising of completely damaged and partially damaged were merged to create the class named damaged. Similarly, to reduce the total number of classes to a total of four classes, the classes completely damaged structures, partially damaged, and undamaged structures were combined to create a general class of structures. Identical parameters and architecture were used to train the new networks based on the reduced number of classes. It was observed that the training has improved in terms of accuracy, which turned out to be 92.0% and 94.6% for the five and four classes, respectively. Original confusion matrix of seven classes is shown in Figure 12. As for testing results, confusion matrices of merged five and four classes are shown in Figures 13 and 14, respectively. In the end, the VGGNet transfer learning using four classes has a significant improvement in both training accuracy and testing performance as expected. However, this model is not ideal for the targeted structural damage classification following natural hazards events. This is because the structural classes were combined, and the VGGNet training (in all models) cannot reliably distinguish between undamaged, partially damaged, and completely damaged structures. Instead, the general object classification of structure, roadway, terrain, and vehicles was proved to perform well. The improved performance when the classes combined demonstrate that depth information within 3D point clouds is critical to classify damaged structures from undamaged structures automatically.

D FCN Experiment
Similar to the 2D CNN network, the 3D FCN model was developed and trained based on Salt Lake and Puerto Rico instances and was tested on the Port Aransas instances. To create the testing dataset, a procedure similar to that of creating the training dataset was followed; however, the testing instances were not augmented. Figure 15 shows the confusion matrix for testing on the Port Aransas dataset, and Table 5 provides the performance measures for each class.

3D FCN Experiment
Similar to the 2D CNN network, the 3D FCN model was developed and trained based on Salt Lake and Puerto Rico instances and was tested on the Port Aransas instances. To create the testing dataset, a procedure similar to that of creating the training dataset was followed; however, the testing instances were not augmented. Figure 15 shows the confusion matrix for testing on the Port Aransas dataset, and Table 5 provides the performance measures for each class.
The 3D FCN network prediction results on the test dataset demonstrated an overall similar performance measure in comparison to resulted performance measures observed during the training process. The overall cell accuracy of the network was 97.8%. The network was able to predict the class of terrain instances with a high level of accuracy while this was unexpected as the general terrain within the testing dataset in terms of texture and geometry differs in comparison to the training dataset. This suggests that the model was able to learn features that can generalize well between datasets with moderate to low similarities. A similar pattern in detecting the classes of partially damaged structures, completely damaged structures, debris, and vehicles to that of training was observed. Authors expect that by performing extended training and using more learnable parameters, the network will learn features to distinguish between these classes with a higher level of accuracy. testing.

D FCN Experiment
Similar to the 2D CNN network, the 3D FCN model was developed and trained based on Salt Lake and Puerto Rico instances and was tested on the Port Aransas instances. To create the testing dataset, a procedure similar to that of creating the training dataset was followed; however, the testing instances were not augmented. Figure 15 shows the confusion matrix for testing on the Port Aransas dataset, and Table 5 provides the performance measures for each class.

Comparison of 2D CNN and 3D FCN
The detection accuracies of the 2D CNN models were consistently lower than the performance obtained based on the 3D FCN network, which is 91.0% for the 2D model and 97.8% for the 3D model. Comparing the structural damage detection performance, 3D FCN demonstrated a superior advantage over 2D models developed based on various class numbers. Keen advantages of 2D CNN and images relate to the smaller number of learnable parameters and reduced data sizes in comparison to the 3D FCN model. While 2D CNN performance improved from 92.0% to 94.6% in general object classification such as structures, terrain, roadway, or vehicles. This basic detection was not adequate for damage detection between structural classes like completely damaged, partially damaged, debris, and other classes. These results demonstrate a significant classification limitation when it is solely based on RGB information (which corresponds to 2D images) in comparison to RGB with depth information (which corresponds to 3D point clouds). Consequently, 3D FCN performs with a marked improvement in structural damage detection when comparing to 2D CNN.

Conclusions
Aerial image data collection provides an efficient technique to collect perishable data following a natural hazard event. Both 2D orthomosaic images and the 3D point clouds can be obtained and processed for analyses and automated classification. This study compared post-event site damage classification using 2D and 3D datasets following two separate hurricanes from 2017 using 2D CNN and 3D FCN. The 2D CNN was conducted via transfer learning of two pre-trained networks: AlexNet and VGGNet. The inputs to the 2D CNN networks are 2D segmented images for unsupervised training and outputs the label for each image segment. The 3D FCN was conducted using aerial image-derived point clouds. Within the FCN method, point clouds are semantically classified into various classes. To keep the parameters consistent, both 2D CNN and 3D FCN have identical classes initially. To further demonstrate the 2D CNN classification performance, a reduction and combination of the classes were used for performance evaluation. The combination was aimed to eliminate the narrative of the classes of damage detection, which combines the structural damage and undamaged classes together.
Within the reduced numbers of classes for 2D CNN training, the accuracy improved at the cost of reducing and eliminating the classes corresponding to structural damage. The accuracy improvement demonstrates the 2D deep learning classifications are ideal for general object detection such as terrain, structures, vehicles, roadway, etc. However, they have a demonstrated limited learning capability to predict distinct structural characteristics from undamaged, partially damaged, and completely damaged as well as debris. However, this limitation was overcome when using a 3D point cloud dataset in deep learning, which contains both RGB and depth information. The model developed based on 2D data was only able to learn the dominant class (i.e., terrain) effectively. This results in lower precision and accuracy for other classes for both training and testing phases. On the contrary, the model developed based on 3D point clouds was able to learn other classes in addition to the dominant class. Classification within damage detection is known to be a class imbalance scenario, where the instances that represent damaged or debris are often the minority class that follows a random and unique geometric and color patterns.
Comparing the training durations, 2D CNN requires a significantly shorter time from a few hours to a day, while 3D FCN requires numerous days. The 2D CNN training accuracy achieved 88.7% and 91.0% for seven classes, and the highest accuracy was achieved by VGGNet training using four classes of 94.6%, while 3D FCN training accuracy was as high as 97.8%. However, when it comes to testing accuracy, the 2D CNN has significantly lower accuracy compared to 3D FCN. The accuracy decrease in the 2D dataset is expected due to the lack of depth information. Classification on 2D images is RGB based only, which can be influenced by the object surface reflection, sunlight, existence of shadows, etc. Moreover, despite 3D dataset preparation and network development is more time consuming, higher accuracy and reliability can be guaranteed. This is especially true when classifying the location and severity of damage following natural hazard events.