UAV-Based Bridge Inspection via Transfer Learning

As bridge inspection becomes more advanced and more ubiquitous, artificial intelligence (AI) techniques, such as machine and deep learning, could offer suitable solutions to the nation’s problems of overdue bridge inspections. AI coupling with various data that can be captured by unmanned aerial vehicles (UAVs) enables fully automated bridge inspections. The key to the success of automated bridge inspection is a model capable of detecting failures from UAV data like images and films. In this context, this paper investigates the performances of state-of-the-art convolutional neural networks (CNNs) through transfer learning for crack detection in UAV-based bridge inspection. The performance of different CNN models is evaluated via UAV-based inspection of Skodsberg Bridge, located in eastern Norway. The low-level features are extracted in the last layers of the CNN models and these layers are trained using 19,023 crack and non-crack images. There is always a trade-off between the number of trainable parameters that CNN models need to learn for each specific task and the number of non-trainable parameters that come from transfer learning. Therefore, selecting the optimized amount of transfer learning is a challenging task and, as there is not enough research in this area, it will be studied in this paper. Moreover, UAV-based bridge inception images require specific attention to establish a suitable dataset as the input of CNN models that are trained on homogenous images. However, in the real implementation of CNN models in UAV-based bridge inspection images, there are always heterogeneities and noises, such as natural and artificial effects like different luminosities, spatial positions, and colors of the elements in an image. In this study, the effects of such heterogeneities on the performance of CNN models via transfer learning are examined. The results demonstrate that with a simplified image cropping technique and with minimum effort to preprocess images, CNN models can identify crack elements from non-crack elements with 81% accuracy. Moreover, the results show that heterogeneities inherent in UAV-based bridge inspection data significantly affect the performance of CNN models with an average 32.6% decrease of accuracy of the CNN models. It is also found that deeper CNN models do not provide higher accuracy compared to the shallower CNN models when the number of images for adoption to a specific task, in this case crack detection, is not large enough; in this study, 19,023 images and shallower models outperform the deeper models.


Introduction
The importance of bridges as a key element in civil infrastructures is getting more attention due to many recent bridge collapses [1,2]. Regular inspections for high numbers of bridges in conventional inspection procedures are unsafe, costly, time-consuming, and labor-intensive [3,4]. UAVs are among the best alternatives for conventional inspections, predominantly due to the higher safety of workers, lower cost, and tangible scientific improvements. Several studies have shown the successful implementation of UAVs in bridge inspection [2,[5][6][7][8]. Moreover, recent advances in the field of artificial intelligence (AI), especially in machine and deep learning, have offered suitable solutions in many different practical fields such as bridge inspection. AI coupling with various data that are acquired with unmanned arial vehicles (UAVs) could enable bridge inspections to become automated. The key to the success of automated bridge inspection is a model capable of detecting failures from UAV images and films.
UAV-based bridge inspection images are usually inspected by human inspectors to identify any possible failure. This process is time-consuming, costly, and dependent on the knowledge and experience of inspectors, which are prone to objectivity and inaccuracy. To address these issues, a great deal of research has been devoted to developing methods and techniques to extract information regarding any possible failure from visual inputs. These methods and techniques mostly fall into two categories: image processing and deep learning-based failure detection. A wide variety of image processing methods and techniques are used in crack detection, including morphological approaches, digital image correlation, wavelet transform, median filtering, threshold methods, random structured forests, the photogrammetric technique, the recognition technique, and edge detectors such as the Canny, Sobel, Gabor, and Prewitt [9][10][11][12][13][14][15][16]. Recently, convolutional neural networks (CNNs) have gained a lot of attention and are widely used in the crack detection task [15,[17][18][19][20]. For instance, Shengyuan Li and Xuefeng Zhao designed a CNN architecture of binary-class output for crack detection by modifying the AlexNet architecture [17]. More recently, CNN models have been used via transfer learning, which is learning a new task through the transfer of knowledge from a related task that is already learned. Transfer learning is used in a wide range of classification tasks, such as synthetic aperture radar (SAR) target classification [21], plant classification [22], and molecular cancer classification [23]. Transfer learning is also used in crack detection; for example, Kasthurirangan Gopalakrishnan et al. [24] used the VGG16 pre-trained model for crack detection and compared its performance with other machine learning classifiers like random forest (RF) and support vector machine (SVM). Cao Vu Dung and Le Duc Anh [25] used three different pre-trained CNN models with the VGG16-based encoder for crack segmentation and classification.
Although these studies show the usefulness of transfer learning in a wide range of classification tasks, there are still specific challenges regarding how to use it. For example, there is a huge difference in the number of images that these CNN models are already trained in: there are more than 14 million images in the ImageNet dataset [26], and the number of images that the last part of these CNN models were trained on for adoption to the specific task, in this case the crack detection task, was almost 19,023. This is what makes transfer learning important; if there is no such difference, transfer learning seems irrelevant as the CNN models can be trained directly on that big crack dataset. Thus, there is a trade-off between the number of trainable parameters that CNN models need to learn for each specific task and the number of non-trainable parameters that come from transfer learning. Therefore, selecting the right number of trainable parameters is a challenging task, and as there is not enough research in this area, it will be studied in this paper. Moreover, in the previously mentioned studies, CNN models for crack detection are mostly built from scratch or on top of a well-known architectures such as AlexNet and VGG, and the potential of existing state-of-the-art CNN models in UAV-based bridge inspections via transfer learning is rarely exploited. In this study, state-of-the-art CNN models are used via transfer learning to explore their real potential for UAV-based bridge crack detection. UAV-based bridge inspection images require specific attention to establish a suitable dataset as the input of CNN models that are trained on homogenous images, but in the real implementation of CNN models in UAV-based bridge inspection images there are always heterogeneities in the images, such as natural and artificial effects like different luminosities, spatial positions, and colors of the elements in an image. Therefore, in this study the effects of such heterogeneities on the performance of the CNN models Sustainability 2021, 13,11359 3 of 27 via transfer learning are examined as well. This research study attempts to answer the following main questions: • How can drone-based bridge inspection benefit from the CNN models via transfer learning? • How much do heterogeneities and noises between training and testing datasets affect the CNN models via transfer learning? • How much transfer learning must be used?
To answer these questions, the weight of transfer learning in eight different CNN models was adjusted and evaluated in different datasets from drone-based bridge inspection considering the noises.
The remainder of this paper is structured as follows: in Section 2, the methodology and specific steps that help to identify the most suitable CNN model for UAV-based bridge inspection via transfer learning are discussed in detail; the results are presented Section 3; Section 4 provides the discussion; and finally, Section 5 delivers the conclusions of this paper.

Methodology
The specific steps that help to identify the most suitable CNN model for UAV-based bridge inspection through transfer learning are shown in Figure 1 and explained in the following sections. Moreover, these specific steps establish the methodology to answer the research questions presented in Section 1 of this study.

STEP 1-Establishing Databases Regarding Bridge Failures and the CNN Model's Evaluating Metrics
In this section, three datasets are established: (1) the SDNET2018 dataset, (2) heterogeneous UAV-based bridge inspection dataset, and (3) homogeneous UAV-based bridge inspection dataset. These datasets are different regarding the presence of some heterogeneities and noises, as will be explained.

STEP 1.1-Establishing Training, Validation, and Test Datasets
The SDNET2018 is an annotated image dataset for training, validation, and benchmarking of artificial intelligence-based crack detection algorithms most suitable for concrete that contains over 56,000 cracked and non-cracked images [27]. Since there are many more non-cracked images than cracked images, to have a nearly balanced dataset in this study, 8484 crack images and 10,540 non-cracked images (total of 19,024) were used for training the trainable part of each CNN model and 20% of them were used for validation (1902) and testing (1902). Moreover, various transformations (such as rotation, zooming in, cropping, flipping, and so on) were performed on the original images in this study. The transformed images, also called data augmentation, were used for training, validation, and testing of the CNN models. A learned model may be more robust and accurate as it is trained on different variations of the same image, but the generated images are not different from the original ones. In this study, the number of training images was 15,219, with 5 different transformations of each original image (including shear range = 0.2, zoom range = 0.2, width shift range = 0.2, height shift range = 0.2, and horizontal flip), so the total number of unique images increased in the whole training across all epochs (not per epoch) to 76,000 images, and the same went for the validation and testing data.

STEP 1.2-Establishing UAV-Based Bridge Inspection Datasets
The output of this step are the images that are categorized into two different datasets. One includes images with more noises and heterogeneities, and the other datasets include images with less noises. More details about establishing these UAV-based bridge inspection datasets are presented in the following sections.  A DJI Matrice 100 with a Zenmuse Z3 aerial zoom camera and 7× zoom capacity was used to carry out the UAV-based bridge inspection; for more information, see [20]. As a result, images with 3000 × 4000 resolution were obtained; these high-resolution images allow for the capture of cropped images with high qualities (see images in Figure 3).
UAV-based bridge inception images require specific attention to establish a suitable dataset as the input of CNN models that are trained on homogenous images. However, in the real implementation of CNN models for UAV-based bridge inspection images, there are always noises and heterogeneities in the images. Some sources of these heterogeneities are edges of the same element with the same materials, the conjunction of different elements or different materials of the bridge, the presence of backgrounds and shadows from surrounding objects, and the shadows of other elements of bridge constructions as shown in Figure 4.  A simple cropping strategy was performed to categorize the elements with identified cracks as a crack class, and cropped images from that element were captured and resized to the required input size for each CNN model. This simple cropping strategy was done with minimum effort; thus, there were more noises and this UAV-based bridge inspection dataset was therefore called the "heterogeneous dataset". As a result, 308 images belonging to 2 categories, crack and non-crack images (154 images for each category), were obtained in this heterogeneous dataset; a sample of this dataset is shown in Figure 5c,d. Then, the cropping was performed with more attention and effort to reduce heterogeneities in the images so all the backgrounds and the separation of different materials from the images were removed. Shadows and edges were impossible to remove through cropping alone; thus, there are a few images with shadows and edges in the second dataset, which is called the "homogeneous dataset". As a result, 400 images belonging to the 2 categories, crack and non-crack images (200 images for each category), were obtained in this homogeneous dataset; a sample of this dataset is shown in Figure 5a,b. This allows for the evaluation of whether there is a meaningful difference in the performance of CNN models on UAV-based bridge inspection datasets and those images that CNN models are trained with.

STEP 1.3-The CNN Models' Performance Metrics
The following metrics were used in this study to compare the performance of different CNN models. Accuracy is the ratio of correctly predicted observations to the total observations: where true positives (TP) are the correctly predicted positive values, true negatives (TN) are the correctly predicted negative values, false positives (FP) are when the actual class is no and the predicted class is yes, and false negatives (FN) are when the actual class is yes but the predicted class in no. Precision is the ratio of correctly predicted positive observations to the total predicted positive observations: Recall (sensitivity) is the ratio of correctly predicted positive observations to all the observations in the actual class: The F1 score is the weighted average of precision and recall and takes both false positives and false negatives into account: Since there was a little unbalancing in the dataset, as there were 8484 crack images and 10.540 non-cracked images (cracked images are 0.805 less), each performance metric was weighted accordingly as follows: Weighted Score = 0.805 × Cracked class score + 0.195 × Noncracked class score (5) where cracked class score is the score that the crack class gets in term of accuracy, precision, recall, and F1 score and non-cracked class score is the score that the non-crack class gets in term of accuracy, precision, recall, and F1 score However, macro average is not weighted and therefore can be calculated as follows: Macro Score = 0.5 × Cracked class score + 0.5 × Noncracked class score (6) Moreover, the performance of the CNN models on the test datasets can be presented by a confusion matrix that, in the case of the two classes, crack and non-crack, consists of two rows and two columns that report the number of false positives, false negatives, true positives, and true negatives. This allows for a more detailed analysis than the mere proportion of correct classifications and all metrics can be calculated from these confusion matrices.

STEP 2-Selecting and Training the CNN Models for Crack Detection
Before selecting a pre-trained CNN model, for training the trainable part of CNN models, two issues need to be considered. Firstly, adopting complete transfer learning for the specific task, in this case crack detection, without any trainable parameters causes CNN models to over-parameterize to the point that it is just interpolating the data. Secondly, the low number of UAV-based bridge inspection images encourages the use of data augmentation such as rotation, vertical and horizontal flipping, and zoom in to create a bigger dataset. However, many images were produced as the result of data augmentation from the low number of UAV-based bridge inspection images. Therefore, there are similarities in these images; by fitting the models a few hundred times, each time with a different realization of the same image, and then predicting a new image, CNN models are not able to correctly predict these new inputs. In other words, these issues also cause overfitting and lead to extremely high-variance predictions that can be seen when training accuracy reaches 100% and its accuracy in the validation and test datasets is much lower. To deal with these issues, and since low-level features are extracted in the last layers of networks, the last part of each CNN model was trained on the established training dataset. As a result, during training, models were not saturated, and the pre-trained models were adopted for the crack detection in UAV-based bridge inspections.

STEP 2.1-Selecting the CNN Models
Architecture engineering is a significant part of neural network development, starting from the seminal work of AlexNet [28], which is often regarded as the starting point of deep learning (DL). Since then, many improvements have been made in the development of neural networks such as ZFNet [29], VGG [30], Inception [31], ResNet [32], DenseNet [33], network in network (NiN) [34], wider residual network [35], networks with stochastic depth [36], Xception [37], fractural networks [38], squeeze networks [39], and so on. A survey of the recent architectures of deep convolutional NEURAL networks can be seen in [40]. These architectures are continuously developed manually by human experts, but recently the architecture engineering of neural networks is crossing into the area of dynamic architecture engineering. In this area, neural networks can build themselves from basic building blocks (different combinations of different layers such as a dense layer, convolutional layer, de-convolutional layer, max pooling, batch normalization, and so on) based on the number of input images and their resolution, mostly through reinforcement learning. NASNet [41], MnasNet [42], and EfficientNet [43] have used neural architecture search (NAS), which utilizes a recurrent neural network (RNN) as the controller to search variable-length architecture space to compose neural network architectures. Using dynamic architect engineering for building CNN models for crack detection is time-consuming and expensive; for instance, in NASNet a search through several convolutional cell candidates takes some days with the use of 500 GPUs, resulting in 2000 GPU-hours [41]. Therefore, the developed CNN models can be used via transfer learning for real problem solving such as crack detection in UA-based bridge inspections. The pre-trained CNN models, trained on more than 14 million images from ImageNet, that were chosen in this study are shown in Table 1 along with some of their specifics (the architecture of some of these CNN models is presented in Appendix A). As can be seen from Table 2, Resnet152v2 and NASNetLarge have more total number parameters than the other CNN models; in this study, these two CNN models are called heavy models and the rest are called non-heavy models. In this study, the CNN models were initially trained with the same number of trainable parameters and then the effects of transfer learning and the noises were evaluated to find the optimized amount of transfer learning. Moreover, floating point operations (FLOPs) is a measure of how many addition and multiplication operations are needed to run the CNN model. It can be considered as a proxy of CNN model speed regardless of the system specifics in which it is training. In general, with an increase in input shape, the number of input channels, and the number and size of convolution filters, the FLOPs will increase as well, but it has an inverse relation to the number of strides [32,44]. For instance, the first layer of VGG16, which has 64 filters of 3 × 3 size with a stride and padding of 1 on an input image size of 224 × 224 with three channels (RGB) requires almost 3.7 billion FLOPs, and in total VGG16 requires 30.7129 billion FLOPs, which is almost 9 times more than ResNet50 requires [32,44]. (see Table 1 for FLOPs comparison of CNN models).

STEP 2.2-Training the CNN Models
The basic notion is to keep trainable parameters for each model the same to evaluate firstly how CNN models are adopted for a specific task, in this case crack detection. Moreover, as the pre-trained models were based on ImageNet and were aimed to identify 1000 classes, in this study classification layers were adopted for the two classes (cracked and non-cracked) and classification layers were the same for all CNN models. Thus, the number of parameters was different from the original architectures due to adopting classification layers for the two classes. In addition, an Adam optimizer with the same parameters is used for all CNN models. All the CNN models were trained on a MacBook Pro 16-inch, with a 2.4 GHz 8 core 9th generation Intel Core i9 processor, 64 GB 2666 MHz DDR4 RAM, and an AMD Radeon Pro 5600 M graphics card with 8 GB of HBM2 memory. Figure 6 shows the accuracy of the CNN models in the training and validation datasets in 100 epochs. The difference between training and validation accuracy was not significant in all CNN models and none of the models overfit or saturates during the training. The accuracy of CNN models on the validation dataset is a more reliable indicator of how the CNN models perform on unseen data. Looking at the accuracy of CNN models in the validation dataset, four CNN models with the highest accuracy in the training dataset also have the highest accuracy in the validation dataset, though in a different order. EfficientNetB4 and DenseNet210 have the highest accuracy, 96%, followed by ResNet50 and VGG16 with 95% accuracy. Interestingly the worst accuracies in the validation dataset belong to two heavy CNN models, ResNet152v2 and NASNetLarge. Regardless of the available number of images for training, system specifics, CPU, and GPU power that CNN models train on, the number of parameters (trainable and non-trainable), FLOPs, and the number of layers are three key factors affecting the training time of CNN models. As can be seen from Table 2, the NASNetLarge is the slowest CNN model with the highest number of parameters, layers, and FLOPs. However, in the case of Resnet models (Resnet50 and Resnet152v2), it seems they are faster due to the presence of a skip connection that lets the gradient signal travel back directly to early layers via these skip connections. ResNet50 has the lowest training time with 71,700 s for 100 epochs of training and the second-fastest CNN model is Xception with 77,915 s.

STEP 2.2.1-The Performance of the CNN Models on the Sdnet2018 Test Dataset
The results confirm that the four CNN models (EfficientNetB4 and DenseNet210 with 96% accuracy and ResNet50 and VGG16 with 95% accuracy) with the highest validation accuracy also have the highest accuracy in the SDNET2018 test dataset (DenseNet210 with 97% accuracy and EfficientNetB4, ResNet50, and VGG16 with 95% accuracy). Figure 7 shows the confusion matrices for each CNN model; it can be seen that DensNet201 predicts 1040 non-crack images correctly out of 1054 non-crack images, and it correctly predicts 797 crack images out of 848 crack images, while it incorrectly predicts 51 crack images as non-crack images. More details about the performance of each CNN model on the SDNET2018 test dataset, such as the precision, recall, and F1 score of each model, can be seen in Appendix B.

STEP 2.2.2-The Performance of the CNN Models on the UAV-Based Bridge Inspection Datasets
In this section, the performance of these CNN models will be examined in two UAVbridge inspection datasets to see how they perform in practice. Table 3 summarizes the recall metric comparison of the CNN models on both of the UAV-based bridge inspection datasets. More details about the performance of each CNN model on both of the UAV-based bridge inspection datasets, such as the precision, recall, and F1 score of each model, can be seen in Appendix B. In the heterogeneous dataset, VGG16 and ResNet152v2 performed better in comparison to the other models, with an average accuracy of 69% followed by Resnet50 with 63% average accuracy and NASNetLarge with an average accuracy of 61%. As can be seen from Table 4, the CNN models' performance decreased substantially in the heterogeneous dataset compared to the SDNET2018 test dataset, an average accuracy decrease of 32.6% for all the CNN models. This is due to the heterogeneities existing in the UAV-based bridge inspection dataset. It is expected that the accuracy of the CNN models will be much higher on the homogeneous UAV-based bridge inspection dataset. However, the CNN models' performance does not increase substantially in this dataset compared to the heterogeneous dataset, with a 4.875% increase on average for all the CNN models. The best-performing CNN models on the heterogeneous UAV-based bridge inspection dataset were VGG16, ResNet152V2, and ResNet50 with 69%, 69%, and 63% accuracy, respectively. Also, the best-performing CNN models on the homogenous dataset were Xception, ResNet50, and ResNet152V2 with 74%, 70%, and 69% accuracy, respectively. Moreover, the best-performing CNN models on the SDNET2018 test dataset were DenseNet201 with 97% accuracy, and VGG16, ResNet50, and EfficientNetB4 with the same 96% accuracy. It can be realized from these results that in each specific dataset the performance of the CNN models was different; for example, DenseNet201, which outperformed the other CNN models in the SDNET2018 test dataset, was not successful in the UAV-based bridge inspection datasets, especially in the heterogeneous dataset. Therefore, regarding the datasets in this study deeper networks did not provide higher accuracy, and shallower networks like VGG16, ResNet50, and Xception performed better than heavier or deeper networks such as NASNetLarge with 1039 layers.

STEP 2.3-Select the Best Model for Future Inspections
Due to lack of research in selecting the optimized amount of transfer learning, a wellestablished methodology for finding the optimized amount of transfer learning is missing. Therefore, we found the range of the optimized amount of transfer learning by trial and error as will be explained in the next section. After finding the optimized amount of transfer learning for the available training dataset and evaluating the performance of the CNN models, the best model can be selected for the crack detection task in future drone-based bridge inspections. At the end, the UAV-based bridge inspections will be added to establish a training dataset so the next time the training dataset is richer, and more failure modes of the bridges can be identified. For example, in each inspection, failures other than cracks can be identified and classified, and therefore the CNN models can be trained to classify more and more failures.

Results
This study shows that, with the same number of trainable parameters regarding the available training dataset, the shallower networks like VGG16, ResNet50, and Xception perform better for the UAV-based bridge inspection datasets. To evaluate the effects of transfer learning and to find the optimized amount of transfer learning, more experiments were conducted, and the results are presented in the following sections.

The Effects of Transfer Learning
It is not possible to have CNN models with the exact same number of trainable parameters because of their architectures. Therefore, it is not obvious if the difference between 2-3 million trainable parameters significantly affects the accuracy of CNN models. To explore such effects, the number of trainable parameters for ResNet50 was increased from 10,580,322 to 17,540,706 (65.8% increase). Table 5 shows that ResNet50 with more than 65% more trainable parameters had a 0.0007 higher accuracy in training (0.07%) but 0.0053 lower accuracies (0.55%) in the validation. Thus, the effects of 65% more trainable parameters on training and validation accuracy in ResNet50 were negligible. The same was done for ResNet152v2 as a representative of the heavy CNN models; it can be seen from Table 5 that the effects are also negligible in this heavy model. Thus, it can be concluded with high confidence that changes in the range of three million trainable parameters are negligible for the training and validation accuracy of heavy and non-heavy models. Moreover, a lower number of trainable parameters means lower training time. Here, ResNet50 with 17,540,706 trainable parameters took 209.72 min longer to be trained. However, the change in the number of trainable parameters needs to be performed carefully and verified on the UAV-based inspection unseen dataset and test, which will be discussed in the following sections in more detail. The performance of these models was also evaluated on UAV-based datasets and the results are shown in Table 6.
Decreasing transfer learning increased the precision of ResNet50 significantly from 63% to 79% in the UAV-based bridge inspection heterogeneous dataset. Conversely, ResNet152V2 precision was decreased significantly by increasing the number of trainable parameters from 70% to 60%. Interestingly, this difference is not as significant in the UAV-based bridge inspection homogenous dataset. The precision of ResNet50 increased from 72% to 77%, while the ResNet152V2 precision decreased by only 2% from 70% to 68% compared to a 10% decrease in the heterogeneous dataset. Therefore, it cannot be concluded that by increasing the number of trainable parameters (without transfer learning) the accuracy of CNN models will increase. In the case of ResNet152V2, its performance decreases in both UAV-based bridge inspection datasets; therefore, it can be concluded that the transfer learning is relevant in UAV-based bridge inspection.

The Effects of Transfer Learning and Heterogeneities
The performance of these CNN models regarding decreasing the transfer learning part by increasing the number of trainable parameters and decreasing the noises in UAV images can be evaluated. Decreasing the heterogeneities and noises in UAV images gave a 7% increase of accuracy in ResNet50 and there was no gain in accuracy for ResNet152v2, as shown in Table 7. On the other hand, decreasing the noises in UAV images did not increase the accuracy in ResNet50 with 17,540,706 million trainable parameters. There was a 7% increase in accuracy of ResNet152v2 with 16,942,434 million trainable parameters by decreasing the heterogeneities. The results confirm that ResNet152v2 as a representative of heavy CNN models and ResNet50 as a representative of non-heavy CNN models have completely different behavior regarding noises in the datasets and the number of trainable parameters. As was shown, the performance of ResNet152v2 deteriorated with the increasing number of trainable parameters, and this had a more negative effect when the dataset was heterogeneous, with a 9% decrease in the accuracy, and a less negative effect when the dataset was homogenous, with only a 2% decrease in accuracy. On the other hand, decreasing transfer learning weight had a positive effect in ResNet50 when the UAV-based bridge inspection dataset was heterogeneous, with a 15% increase in the accuracy, and a 7% increase in accuracy when the dataset was homogenous. Therefore, one might speculate that transfer learning is irrelevant in non-heavy CNN models. To test if transfer learning is relevant or not in non-heavy CNN, models two of these CNN models, Resnet50 and VGG16, were trained without using transfer learning on the training dataset, and the results are shown in Table 8. The results show that transfer learning is very important for UAV-based bridge inspection datasets in Resnet50 as its accuracy drops significantly by 21% and 15% in the heterogeneous and homogenous datasets, respectively. Therefore, it can be concluded that for CNN models with similar parameters to Resnet50, transfer learning improves their performance. On the other hand, VGG16 has only 19 layers, the shallowest CNN model, and transfer learning does not improve its performance and its accuracy improves 12% and 15% in the heterogeneous and homogenous datasets, respectively. Also, it is obvious in transfer learning that the higher the training parameters, the higher the training time; for instance, without transfer learning it takes 134.707 s (37.42 h) more time in Resnet50 and 196.252 s (54.51 h) more time in VGG16 training.
Due to the probabilistic nature of the CNN models' prediction, to see if there was significant variance, the predictions of the CNN models were repeated 10 times; it was confirmed that there were no such meaningful variances in the CNN models' predictions. For instance, the variance in prediction accuracy of Resnet152v2 in one of the UAV-bridge inspection datasets was 0.000068 which is not significant (for more details, see Appendix C).

Selecting the Optimized Amount of Transfer Learning
As was shown for Resnet152V2, going from 10,574,178 to 16,942,434 trainable parameters decreased its accuracy on average by 5.5% in both UAV-based bridge inspection datasets. Thus, the optimized number of trainable parameters for Resnet152V2 would be equal to or less than 10,574,178. For Resnet50, increasing the number of trainable parameters from 10,580,322 to 17,540,706 increased the accuracy by 11% in both UAV-based bridge inspection datasets. Further, increasing the number of its trainable parameters to 24,124,770, its accuracy decreased by 18% in both UAV-based bridge inspection datasets. Thus, the optimized number of trainable parameters for Resnet50 is between 17,540,706 and 24,124,770. On the other hand, for VGG16, the optimized number of trainable parameters is 14,862,498 without transfer learning.
The maximum performance of VGG16 is 81% and 83% in the two UAV datasets with no transfer learning. However, in the case of Resnet50, its highest performance was at between 17,540,706 and 24,124,770 trainable parameters. Thus, Resnet50 and other CNN models whose trainable parameters are not at the maximum possible number can reach at least the same performance as VGG without transfer learning. Therefore, it can be concluded, for small training datasets (19,023 in this study), the shallowest CNN model, VGG16, without transfer learning performs as well as Resnet50 with transfer learning, with 81% and 83% accuracy in the UAV-based bridge inspection heterogeneous and homogenous datasets, respectively.

Visualizing the Performance of the CNN Models
Examples of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) images for Resnet50 10,580,322 and Resnet152V2 10,574,178 models are shown in Figure 8 (for the rest of CNN models in this study see Appendix A2). As can be seen, Resnet152V2 incorrectly predicted six crack images as non-crack images while Resnet50 correctly identified five of them as a crack and was only unable to identify one of the crack images as it included edges and dark and light backgrounds; these kinds of images cannot be handled by most of the CNN models. Moreover, it seems that Resnet152V2 identifies shadows as cracks, as it incorrectly identified three images with shadows as crack images while they are non-crack elements, while Resnet50 identified two of them correctly. It also can be seen that Resnet50 incorrectly identified a 90 • edge without shadows as a crack, but the existence of some lines in conjunction with some parts of the concrete might be the reason.

Visualizing the Performance of the CNN Models
Examples of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) images for Resnet50 10,580,322 and Resnet152V2 10,574,178 models are shown in Figure 8 (for the rest of CNN models in this study see Appendix D). As can be seen, Resnet152V2 incorrectly predicted six crack images as non-crack images while Resnet50 correctly identified five of them as a crack and was only unable to identify one of the crack images as it included edges and dark and light backgrounds; these kinds of images cannot be handled by most of the CNN models. Moreover, it seems that Resnet152V2 identifies shadows as cracks, as it incorrectly identified three images with shadows as crack images while they are non-crack elements, while Resnet50 identified two of them correctly. It also can be seen that Resnet50 incorrectly identified a 90° edge without shadows as a crack, but the existence of some lines in conjunction with some parts of the concrete might be the reason.  A feature map, or activation map, is the output activations for a given filter, as going further in the network reduces the size of filters; therefore, more details are captured from the input. The first feature map could, for example, looks for curves, then the next feature map could look at a combination of curves that build circles, and the next feature map could detect an object from lines and circle features. Table 9 shows the feature maps of CNN models from the first, middle, and last parts A feature map, or activation map, is the output activations for a given filter, as going further in the network reduces the size of filters; therefore, more details are captured from the input. The first feature map could, for example, looks for curves, then the next feature map could look at a combination of curves that build circles, and the next feature map could detect an object from lines and circle features. Table 9 shows the feature maps of CNN models from the first, middle, and last parts of the network. For example, DensNet201 has 707 layers (including all layers such as concatenation, ReLU, batch normalization, and convolutional): 64 feature maps of layer 67 (conv3-block2-concatenation) are shown as the first part, 64 feature maps of layer 477 (conv4-block48-concatenation) are shown as the middle part, and finally, 64 feature maps of layer 705 (conv5-block32-concatenation) are shown as the last part of the network. This image includes most of the heterogeneities, like the presence of edges, different materials, and dark and light backgrounds, and most of the CNN models were not able to handle this many heterogeneities and incorrectly classified it. Table 9. Feature maps of first, middle, and last parts of each model.

Discussion
This study can be categorized in the transfer learning from pre-trained models by fine tu methods based on the categorization of Weiss et al. [45]. As the source domain, trained on Im target and are simmilar as this study for crack detection. In this category, research in a wide ra using transfer learning. For example, Prajapati et al. used 251 images to train a VGG16 pretrain tion and they achieved an accuracy of 88.46% [46]. In related research, a dataset consisting o images was used by Kucuksubasi and Sorgucb to train an InceptionV3 pretrained model in GPS UAVs crack detection [47]. Their results confirm that InceptionV3 can reach 97.382% accura

Discussion
This study can be categorized in the transfer learning from pre-trained models by fine tuning in homogeno methods based on the categorization of Weiss et al. [45]. As the source domain, trained on ImageNet for object target and are simmilar as this study for crack detection. In this category, research in a wide range of tasks has ach using transfer learning. For example, Prajapati et al. used 251 images to train a VGG16 pretrained model for dent tion and they achieved an accuracy of 88.46% [46]. In related research, a dataset consisting of 582 crack images images was used by Kucuksubasi and Sorgucb to train an InceptionV3 pretrained model in GPS-denied environme UAVs crack detection [47]. Their results confirm that InceptionV3 can reach 97.382% accuracy with an unseen ainability 2021, 13, x FOR PEER REVIEW 4 5 6

Discussion
This study can be categorized in the transfer learning from pre-trained models by fine tuning in homogenous transfer methods based on the categorization of Weiss et al. [45]. As the source domain, trained on ImageNet for object detection, target and are simmilar as this study for crack detection. In this category, research in a wide range of tasks has achieved high using transfer learning. For example, Prajapati et al. used 251 images to train a VGG16 pretrained model for dental disease c tion and they achieved an accuracy of 88.46% [46]. In related research, a dataset consisting of 582 crack images and 458 n images was used by Kucuksubasi and Sorgucb to train an InceptionV3 pretrained model in GPS-denied environments for aut UAVs crack detection [47]. Their results confirm that InceptionV3 can reach 97.382% accuracy with an unseen test datas 13, x FOR PEER REVIEW 24 of 40

Discussion
This study can be categorized in the transfer learning from pre-trained models by fine tuning in homogenous transfer learning methods based on the categorization of Weiss et al. [45]. As the source domain, trained on ImageNet for object detection, which is target and are simmilar as this study for crack detection. In this category, research in a wide range of tasks has achieved high accuracy using transfer learning. For example, Prajapati et al. used 251 images to train a VGG16 pretrained model for dental disease classification and they achieved an accuracy of 88.46% [46]. In related research, a dataset consisting of 582 crack images and 458 non-crack images was used by Kucuksubasi and Sorgucb to train an InceptionV3 pretrained model in GPS-denied environments for autonomous UAVs crack detection [47]. Their results confirm that InceptionV3 can reach 97.382% accuracy with an unseen test dataset. These

Discussion
This study can be categorized in the transfer learning from pre-trained models by fine tu methods based on the categorization of Weiss et al. [45]. As the source domain, trained on I target and are simmilar as this study for crack detection. In this category, research in a wide ra using transfer learning. For example, Prajapati et al. used 251 images to train a VGG16 pretrai tion and they achieved an accuracy of 88.46% [46]. In related research, a dataset consisting o images was used by Kucuksubasi and Sorgucb to train an InceptionV3 pretrained model in GPS UAVs crack detection [47]. Their results confirm that InceptionV3 can reach 97.382% accura

Discussion
This study can be categorized in the transfer learning from pre-trained models by fine tuning in homogeno methods based on the categorization of Weiss et al. [45]. As the source domain, trained on ImageNet for object target and are simmilar as this study for crack detection. In this category, research in a wide range of tasks has ach using transfer learning. For example, Prajapati et al. used 251 images to train a VGG16 pretrained model for dent tion and they achieved an accuracy of 88.46% [46]. In related research, a dataset consisting of 582 crack images images was used by Kucuksubasi and Sorgucb to train an InceptionV3 pretrained model in GPS-denied environme UAVs crack detection [47]. Their results confirm that InceptionV3 can reach 97.382% accuracy with an unseen ainability 2021, 13, x FOR PEER REVIEW 4 5 6

Discussion
This study can be categorized in the transfer learning from pre-trained models by fine tuning in homogenous transfer methods based on the categorization of Weiss et al. [45]. As the source domain, trained on ImageNet for object detection, target and are simmilar as this study for crack detection. In this category, research in a wide range of tasks has achieved high using transfer learning. For example, Prajapati et al. used 251 images to train a VGG16 pretrained model for dental disease c tion and they achieved an accuracy of 88.46% [46]. In related research, a dataset consisting of 582 crack images and 458 n images was used by Kucuksubasi and Sorgucb to train an InceptionV3 pretrained model in GPS-denied environments for aut UAVs crack detection [47]. Their results confirm that InceptionV3 can reach 97.382% accuracy with an unseen test datas 13, x FOR PEER REVIEW 24 of 40

Discussion
This study can be categorized in the transfer learning from pre-trained models by fine tuning in homogenous transfer learning methods based on the categorization of Weiss et al. [45]. As the source domain, trained on ImageNet for object detection, which is target and are simmilar as this study for crack detection. In this category, research in a wide range of tasks has achieved high accuracy using transfer learning. For example, Prajapati et al. used 251 images to train a VGG16 pretrained model for dental disease classification and they achieved an accuracy of 88.46% [46]. In related research, a dataset consisting of 582 crack images and 458 non-crack images was used by Kucuksubasi and Sorgucb to train an InceptionV3 pretrained model in GPS-denied environments for autonomous UAVs crack detection [47]. Their results confirm that InceptionV3 can reach 97.382% accuracy with an unseen test dataset. These

Discussion
This study can be categorized in the transfer learning from pre-trained models by fine tuning in homogenous transfer learning methods based on the categorization of Weiss et al. [45]. As the source domain, trained on ImageNet for object detection, which is target and are simmilar as this study for crack detection. In this category, research in a wide range of tasks has achieved high accuracy using transfer learning. For example, Prajapati et al. used 251 images to train a VGG16 pretrained model for dental disease classification and they achieved an accuracy of 88.46% [46]. In related research, a dataset consisting of 582 crack images and 458 non-crack images was used by Kucuksubasi and Sorgucb to train an InceptionV3 pretrained model in GPS-denied environments for autonomous UAVs crack detection [47]. Their results confirm that InceptionV3 can reach 97.382% accuracy with an unseen test dataset. These results agree with this study's results, as it was shown for small datasets (less than 20,000) that non-heavy models have better performance by finding the optimized amount of transfer learning. In these studies, the importance of transfer learning is confirmed for each specific task (as per the first research question in this study proposed in Section 1) as follows: • How can drone-based bridge inspection benefit from the CNN models via transfer learning? • How much do heterogeneities and noises between training and testing datasets affect CNN models via transfer learning? • How much transfer learning must be used?
More than, showing the usefulness of transfer learning in drone-based bridge inspection. This study was an attempt to evaluate the effects of heterogeneities and noises between training and testing datasets and find the exact amount of transfer learning required regarding each specific task and the available datasets. Therefore, by having different noises and available training datasets, more experiments can help to find the thresholds in which shallow CNN models outperform the deeper CNN models to select the most suitable CNN model for real applications.

Conclusions
The lack of UAV-based bridge inspection images makes transfer learning important in automatic failure detection as a key element of civil infrastructure. It was shown that there is a relationship between the number of available training images for each task and the optimized amount of transfer learning. Based on the available 19,023 datasets in this study, the optimized range of transfer learning for some CNN models was achieved. For example, for Resnet152V2, it would be equal to or less than 10,574,178, and for Resnet50 it is between 17,540,706 and 24,124,770. VGG16, with only 19 layers, performs better without transfer learning. By choosing the optimized amount of transfer learning, at least 81% accuracy is achievable via transfer learning in non-heavy models such as Resnet or InceptionV3. On the other hand, the results confirm that the noises inherent to UAV-based bridge inspection significantly affect their performance, with an average accuracy decrease of 32.6% in all CNN models. With a simple cropping strategy on UAV-based bridge inspection images with minimum effort and without preprocessing images, CNN models can identify cracks element from non-crack elements with 81% accuracy. Moreover, with a better cropping strategy and a little more effort, CNN models can identify cracks elements from non-crack elements with 83% accuracy. Therefore, by simply choosing the optimized amount of transfer learning, the advantages of CNN can be utilized with minimum effort to reduce human involvement and increase the accuracy of conventional visual inspections of bridges. An automated cropping strategy combined with CNNs' transfer learning models enables UAV-based bridge inspections to be partially automated. As the number of images increases, they can be used in training CNN models to boost their performance and include more bridge failure modes.

Conflicts of Interest:
The authors declare no conflict of interest.