Data-Driven Structural Health Monitoring and Damage Detection through Deep Learning: State-of-the-Art Review

Data-driven methods in structural health monitoring (SHM) is gaining popularity due to recent technological advancements in sensors, as well as high-speed internet and cloud-based computation. Since the introduction of deep learning (DL) in civil engineering, particularly in SHM, this emerging and promising tool has attracted significant attention among researchers. The main goal of this paper is to review the latest publications in SHM using emerging DL-based methods and provide readers with an overall understanding of various SHM applications. After a brief introduction, an overview of various DL methods (e.g., deep neural networks, transfer learning, etc.) is presented. The procedure and application of vibration-based, vision-based monitoring, along with some of the recent technologies used for SHM, such as sensors, unmanned aerial vehicles (UAVs), etc. are discussed. The review concludes with prospects and potential limitations of DL-based methods in SHM applications.


Introduction
Civil infrastructures are prone to a significant loss of functionality due to structural deficiencies that are primarily caused by material deterioration and loadings from earthquake, wind, vehicle, or ambient vibrations. In the United States, on a grade scale of A (excellent condition) to F (unacceptable condition), the overall score was as low as D+ for infrastructures, and C+ for bridges with an estimated $123Bn for retrofitting [1]. The report states that 7.5% of bridges rated structurally deficient and mostly below standard, with many elements approaching their end of service life. Furthermore, more than 30% of the approximately 617,000 highway bridges in the US need immediate attention due to deteriorating conditions [2].
During recent decades, ensuring life safety and the need to reduce inspection costs have emerged as the top priorities for practicing engineers and researchers. Therefore, the significance of cost-effective structural health monitoring (SHM) to ensure long-term structural integrity and safety levels has been highlighted on many platforms [3][4][5][6][7]. Various types of emerging SHM methods have the potential to streamline periodic inspections and minimize the direct and indirect costs that are associated with undesired failure of aging infrastructure in addition to conventional inspection and non-destructive evaluation (utilizing impact echo, ultrasonic surface wave, ground-penetrating radar, electrical resistivity, etc.) techniques [8]. At the center of any SHM method and application lies sensors and sensor data (observable response). Recent advancements in sensor and communication  Various alternative DL models have been recently proposed, such as Deep Convolutional Neural Networks [25], Deep Boltzmann Machines [26], Deep Belief Network [10], Recurrent Neural Networks [27], Auto-encoders [28], and Generative Adversarial Networks (GANs) [29], etc. The number of new ML algorithms has been increasing; however, a mind map of the frequently used algorithms, including deep learning (shown in dark shaded color), are presented in Figure 2.
Inspired by the significant advances in computer vision, researchers have recently attempted to solve civil engineering problems by adapting the vision-based deep learning methods. DL-based SHM techniques have been used for: general SHM [30], multi-level damage detection, corrosion detection [31], concrete surface bughole recognition [32], concrete crack detection [33], pavement crack detection [34], acoustic emissions source detection [35], etc. One common objective of these proposed approaches is to avoid traditional visual inspections by providing modern, economic, safe, fast, and autonomous methods that are suitable for any type and scale of structures [3,36,37]. Several disadvantages have been reported for direct methods in image-processing for crack detection. The main problem with the majority of the algorithms in the available literature is that the models are custom-made for certain datasets, which may have lower performance in real-world applications due to challenging circumstances that involve weather, temperature, camera position and quality, shadow and light, etc. [38]. In addition, such approaches highly depend on the selected pre-processing methods, including edge detection.
The following sections present the state-of-the-art applications of DL in SHM to provide a baseline for future studies. Investigations with similar goals and tools are compared in terms of model architecture, datasets, as well as their performance.

Vibration-Based SHM through DL
Numerous vibration-based damage assessment methods have been developed with particular applications in SHM while using ML techniques. DL has introduced new horizons in vibration-based data-driven SHM for large-scale structures and facilitated the acquisition and processing of large sets of data from different types of sensors [3,4,39]. Most of the conventional methods to localize damage, such as radiography or ultrasonic methods, require prior knowledge of the approximate location of damage [36,40,41]. Identifying candidate damage locations can be time-consuming, costly, and difficult. However, the vibration-based methods are founded on the premise that damages (physical changes) cause corresponding changes in vibration characteristics (especially modal shapes, frequencies, and damping) [40], and they may be used to identify the location of damage from measured response data. Kong et al. provide a detailed review of these techniques [42].
In general, vibration-based methods can be classified into two broad categories, namely model-based (parametric) and non-model-based (non-parametric). Model-based techniques usually require computational models and associated assumptions about the structural system as one key element. They usually yield good accuracy, but, in real-world applications, the essential and accurate information about the structural system might not necessarily be available. The difficulties in developing reliable computational models lead to the next category, so-called, non-parametric methods. These methods essentially perform post-processing of response (sensor) data to identify damages without any prior assumption regarding the structural system.
ML techniques contributed to both of these categories. They are usually used to extract the modal parameters in the scope of non-model-based methods [43,44]. The traditional ML methods involve two phases in non-model-based methods. The first phase is feature extraction, in which sensor data (e.g., acceleration) are used to extract effective features, thereby eliminating the cumbersome manual feature extraction process. The second phase is a classification procedure that identifies the location and/or level of damage [45]. Support Vector Machines (SVM), Probabilistic ANNs (PNN) [46][47][48], Fuzzy ANNs (FNN) [49], and Extreme Learning Machine Networks (e.g., online sequential) [50] are some of the popular methods that are used for vibration-based SHM.
Abdeljaber et al. [51] presented an approach for damage identification while using output-only response data. Training data were generated for various simulated damage (loose bolt) cases from measured acceleration response. After training separate CNNs for each damage case, the probability of damage (PoD) indicator was defined. Examining undamaged, single damage, and multiple damage cases, it was shown that specific cases were accurately identified with a 0.54% average error. In a similar study, three-dimensional wireless sensors were used to record the acceleration response [52]. However, this approach required a large amount of response data that were associated with various permutations of loose connections that rendered the application impractical. Abdeljaber et al. [53] introduced a new approach that only required two states of damage, namely undamaged and fully-damaged cases, in order to alleviate the drawbacks. The computational procedure was similar to the previous investigation, but this method could only determine the general condition of the structure. Lin et al. [54] trained a CNN simulating FE model of a simply supported beam and considering noisy and noise-free states. It was demonstrated that the response frequency bands, vibration modes, and their combination were learned by a deep neural network as essential characteristics that were identifiable from sensor data. Wang and Cha [55] proposed an unsupervised method using an acceleration signal that was obtained from an intact laboratory-scale three-dimensional (3D) steel bridge. The response signal vectors were normalized, and then the continuous wavelet transformation (CWT) and Fast Fourier Transformation (FFT) were applied, which were then fed to a two-dimensional (2D)-CNN autoencoder to extract essential features. Ten One-Class Support Vector Machines (OC-SVM) were used as novelty-detectors corresponding to the sensors. The location of sensors with the highest novelty rates was considered to be the approximate location of loose-bolt damage. Wavelet packet transform (WPT) of vibration signals, as well as vision data, are efficient in damage localization, according to the studies that were conducted by Pan et al. [20,21] and Pan and Yang [56]. A parallel configuration of CNN ( Figure 3) can be used for a robust damage localization and damage intensity estimation, where the time-domain data are processed using the one-dimensional (1D) CNN (upper branch) and the time-frequency-domain (WPT) or vision data are processed using 2D CNN (lower branch), and the feature maps are concatenated in the end for classification or regression. Details of the CNN configuration in Figure 3 are available from Azimi and Pekcan [57].
wavelet transformation (CWT) and Fast Fourier Transformation (FFT) were applied, which were then fed to a two-dimensional (2D)-CNN autoencoder to extract essential features. Ten One-Class Support Vector Machines (OC-SVM) were used as novelty-detectors corresponding to the sensors. The location of sensors with the highest novelty rates was considered to be the approximate location of loose-bolt damage. Wavelet packet transform (WPT) of vibration signals, as well as vision data, are efficient in damage localization, according to the studies that were conducted by Pan et al. [20,21] and Pan and Yang [57]. A parallel configuration of CNN ( Figure. 3) can be used for a robust damage localization and damage intensity estimation, where the time-domain data are processed using the one-dimensional (1D) CNN (upper branch) and the time-frequency-domain (WPT) or vision data are processed using 2D CNN (lower branch), and the feature maps are concatenated in the end for classification or regression. Details of the CNN configuration in Figure 3 are available from Azimi and Pekcan [58].  Figure 3. A multi-headed deep neural network for different input data.
Bao et al. [30] proposed a framework for anomaly detectiosn that is inspired by human vision and thinking. First, the measured acceleration response data were transformed to grayscale and then fed into deep convolution neural networks (DCNNs) after manual labeling. The method reached a global accuracy of 87.0% for one-year data testing that can be used for real-time SHM, alarming systems, and off-line assessment. Tang et al. [59] enhanced this method by working with different forms of response data. First, two images of time and frequency domains were stacked with, respectively, red and green channels into dual-channel images. In contrary to the previous work using imbalanced data (i.e. with the unequal number of samples for different classes), herein data were balanced and its impact was investigated. The method achieved 93.5% accuracy and outperformed the former approach. Wu and Jahanshahi [60] presented a study on the application of CNN for linear and nonlinear structural dynamic response estimation and identification. They examined single-and multi-degree of freedom systems and indicated that, in some cases, trained convolution kernels and convolution layers can be interpreted as numerical integration or dominant Bao et al. [30] proposed a framework for anomaly detectiosn that is inspired by human vision and thinking. First, the measured acceleration response data were transformed to grayscale and then fed into deep convolution neural networks (DCNNs) after manual labeling. The method reached a global accuracy of 87.0% for one-year data testing that can be used for real-time SHM, alarming systems, and off-line assessment. Tang et al. [58] enhanced this method by working with different forms of response data. First, two images of time and frequency domains were stacked with, respectively, red and green channels into dual-channel images. In contrary to the previous work using imbalanced data (i.e., with the unequal number of samples for different classes), herein data were balanced and its impact was investigated. The method achieved 93.5% accuracy and outperformed the former approach. Wu and Jahanshahi [59] presented a study on the application of CNN for linear and nonlinear structural dynamic response estimation and identification. They examined single-and multi-degree of freedom systems and indicated that, in some cases, trained convolution kernels and convolution layers can be interpreted as numerical integration or dominant frequency extraction operators, respectively. Furthermore, they compared the results that were obtained by the MLP technique with CNN and showed that the latter approach is more accurate and robust against noise-contaminated input data. Oh et al. [60] proposed a CNN model to estimate the response of tall buildings under wind excitations. They showed that integration can be approximated by convolutional layers without a max-pooling layer. They provided a physical interpretation of the trained convolutional layers indicating their ability of noise filtering, eliminating irrelevant information, and preserving the dominant frequency. Recently, Khodabandehlou et al. [61] applied 2D CNN for the overall assessment of concrete bridges while using shake-table tests of a one-fourth scale highway bridge. They implemented a 2D CNN for damage classification that was trained using 40 sets of experimental acceleration records and tested on eight new sets. Four system-level damage states namely intact, minor, moderate, and extensive were quantified and accurately predicted by the proposed model. Li and Sun [62] applied CNN to the damage detection of an experimental cable bridge model, which compared the performance of CNN with those of random forest, support vector machine, k-nearest neighbor, and decision tree methods showing at least 15% outperformance in the accuracy. In general, these methods demonstrated that CNNs require a large amount of data for training. These necessary data can be generated from finite element (FE) simulations, or from the measured response (acquired via sensors when available). In FE simulations, an exhausting amount of response data associated with different damage states can be easily generated, which yields accurate and high-level damage identification subject to the accuracy of the simulation models. In other words, it should be taken into consideration that these data differs from the recorded sensor data because of the uncertainties and noise. The recorded response data are usually utilized for level 1 (presence of damage) identification through an unsupervised scheme. Pathirage et al. [63] introduced an approach to remove these drawbacks based on autoencoders. The first-order sensitivity-based method was used for matching the FE model and the real model [64]. Subsequently, the calibrated FE model was utilized to extract frequencies and mode shapes as training data. They proposed a two-step framework comprising dimension reduction and relationship learning DNNs. In the first step, a deep autoencoder was used to extract salient features and its outputs fed into a damage identification network. A pre-training scheme was used to find the optimal weights. They considered the uncertainty effect in the FE model as well as noise. It was shown that the proposed approach was more accurate than the traditional ANN. Pathirage et al. [28] added a pre-processing stage and introduced a three-step method with data pre-processing, sparse dimensionality reduction, and relationship learning steps. This framework was similar to the former method and it demonstrated efficient and acceptable performance. Recently, Teng et al. [65] similarly utilized the simulation data of modal strain for training a CNN and verified its performance while using experimental response data from a steel frame. They achieved 100% accuracy in damage localization of several single-and multi-damage scenarios. In most of the studies, the geometric location of sensors was not considered in the input data structure. Providing a solution to feed this information is of sufficient importance to influence accuracy. For this purpose, Sajedi and Liang [66] developed a grid environment methodology for the real-time damage segmentation in large scale civil infrastructures. They used a fully convolutional encoder-decoder neural network that was trained by cumulative intensity measures as the input and damage states of nodes as output. Their proposed approach yielded global accuracies of 96.3% and 93.2% for the detection of damage location and severity in a FE model, respectively.
Several researchers attempted to employ other types of sensor data or use alternative features since the acceleration response signal is highly prone to noise. Li et al. [67] collected deflection data of a scaled-down model bridge through a fiber-optic gyroscope. They fed these data as input to a 1D-CNN to classify its damage as four classes comprising an intact class and other three damaged states. They examined the performance by cross-validation and demonstrated that the CNN had at least 15.3% accuracy advantage over other traditional techniques, such as random forest, support vector machine (SVM), k-nearest neighbor (KNN), and decision trees (DT). Lopez-Pacheco et al. [68] proposed a new frequency domain convolutional neural network (FDCNN) for damage identification based on the Bouc-Wen hysteretic model to increase robustness against noise. The FDCNN utilized spectral pooling operator, attenuated the noise in measurements, and was trained four times faster than similar time-domain networks. Moreover, it was demonstrated that energy dissipation could be captured by FDCNN, which allowed for higher diagnosis accuracy.
Hung et al. [69] developed a hybrid framework combining 1D-CNN and Long-Short Term Memory (LSTM) network for damage detection. This network directly receives raw time-series data and determines the presence of damage. It was shown that, with a low level of noise, the proposed network could provide accurate detections. Ding et al. [70] created a sparse Deep Belief Network (DBN) based on Restricted Boltzmann Machines (RBM) and trained by incomplete modal data that were extracted from FE simulations. The introduced network could successfully predict the damage location and severity with acceptable accuracy, even in multi-damage cases and in the presence of noise. Although their method showed better performance than swarm intelligence techniques, it is noted that the latter techniques require fewer finite element simulations and they are able to adapt to different types of structural systems. Therefore, instead of directly identifying the damage attributions, DNN can Sensors 2020, 20, 2778 9 of 34 be used for other purposes, such as denoising, to enhance the performance of other available techniques. Fan et al. [71] introduced a modified version of Residual Convolutional Neural Network (ResNet) with dropout, skip connection, and sub-pixel shuffling modules to denoise acceleration response signals. They tested the trained network on extensively contaminated data that were measured from a TV Tower and observed that the suggested approach could successfully identify the modal properties of the structure.

Vision-Based SHM through DL
The majority of the conventional infrastructure inspection techniques are based on visual assessments (i.e., crack existence, location, and width) that rely on experts' insight and experience, which may not always be reliable [72]. Besides, such techniques are costly and time-consuming [73]. The vision-based inspections can be performed by inspecting raw images (by human and without post-processing), by image enhancement or applying basic image processing filters to magnify and detect edges to accelerate the inspection, and by autonomous image processing tools that require computers and ML algorithms [74,75].
The main goal of the studies in the computer vision field concerned with information extraction from image data to automatically recognize the real-world (as visual cortex functions). Such efforts were initiated decades ago with the aim of detection of edges, and it has been continuously developed to problems with complex image patterns, such as facial recognition, vehicle, and pedestrian detection [39]. Jahanshahi and Masri [76] showed that morphological operations are not the only techniques in image processing-based damage detection. Other approaches include: binarization [77], image correlation [78,79], edge detectors [36,[80][81][82], percolation model [83], fractal analysis [84], etc.
Most of the efforts in computer vision are concentrated in developing end-to-end learning algorithmswhile using artificial intelligence, particularly through DCNNs that was capable of achieving more than 95% of accuracy on image-based classification problems [85]. In addition, the application of CNNs has been extended to pixel-level labeling within an image to detect and localize different objects of interest while using nonlinear filters and feature maps [27]. The recent advances in computer-vision have brought more attention to such technologies in SHM as one of the most effective tools in the non-contact assessment of deflection [86,87], corrosion [88], concrete spalling [89,90], concrete and pavement cracks [33,91,92], fatigue detection [93], and surface and subsurface damages [94,95]. Dorafshan et al. carried out a comprehensive study [96] to discuss the performance of vision-based image processing techniques in SHM, which employ artificial intelligence.

Crack Detection through Vision-Based DL
Infrastructures, particularly aging concrete structures, are prone to the formation of cracks due to changing loading conditions, corrosion, etc. Cracks in concrete or road pavements usually appear as lines with random orientations and intensity. Usually, these lines are darker and connected, and a simple crack detection can be carried out while using properly prescribed thresholds. Generally, two approaches have been used by researchers in vision-based SHM, particularly in crack detection, the image binarization method [97], and the sequential image processing method [83]. Binarization techniques for transforming images into black and white pixels, or cracked and sound pixels, [98], as well as mathematical morphology [99] can facilitate and improve the accuracy of the detection process due to the nature of cracks.
Prior to the introduction of DL in crack detection, the traditional approaches have been using pixel groups with similar color levels. The earlier generation of heuristic methods for vision-based crack detection in concrete structures was based on edge detection algorithms by applying filters, such as Roberts, Prewitt, Sobel, and LoG (in the spatial domain), or Butterworth and Gaussian (in the frequency domain) [100]. Figure 4 compares the results of applying different filters for detecting cracks on a concrete surface. pixel groups with similar color levels. The earlier generation of heuristic methods for vision-based crack detection in concrete structures was based on edge detection algorithms by applying filters, such as Roberts, Prewitt, Sobel, and LoG (in the spatial domain), or Butterworth and Gaussian (in the frequency domain) [101]. Figure 4 compares the results of applying different filters for detecting cracks on a concrete surface. Abdel-Qader et al. [81] showed that the fast Haar transform has higher accuracy (86%) as compared to the other filters, such as Canny and Sobel, with 76% and 68% of accuracies, respectively. The image dataset that was used in this study, as well as the classification criteria, were further improved by Dorafshan et al. [97]. In general, major ML-based problems include three techniques: classification, localization, and segmentation. Figure 5 illustrates the frequent crack detection approaches: classification [25], object localization [23], and pixel-level segmentation [102]. Using the classification method, the dataset is labeled as cracked, non-cracked (sound). In the crack localization method, the cracks within each input image are labeled with bounding boxes. In the pixel-level segmentation method, the pixels are classified as cracked and non-cracked [103]. Dorafshan et al. discussed a comparison of different edge detection methods and performance of different filters have been discussed by [97]. Based on ANN-based image processing methods, several researchers highlighted potential applications of autonomous crack detection techniques. Jahanshahi and Masri [104] proposed ML-based models using SVMs for concrete crack detection, based on morphological features. The crack width was measured by identifying the centerline of (a) (c) (b) Abdel-Qader et al. [80] showed that the fast Haar transform has higher accuracy (86%) as compared to the other filters, such as Canny and Sobel, with 76% and 68% of accuracies, respectively. The image dataset that was used in this study, as well as the classification criteria, were further improved by Dorafshan et al. [96]. In general, major ML-based problems include three techniques: classification, localization, and segmentation. Figure 5 illustrates the frequent crack detection approaches: classification [25], object localization [23], and pixel-level segmentation [101]. Using the classification method, the dataset is labeled as cracked, non-cracked (sound). In the crack localization method, the cracks within each input image are labeled with bounding boxes. In the pixel-level segmentation method, the pixels are classified as cracked and non-cracked [102].
pixel groups with similar color levels. The earlier generation of heuristic methods for vision-based crack detection in concrete structures was based on edge detection algorithms by applying filters, such as Roberts, Prewitt, Sobel, and LoG (in the spatial domain), or Butterworth and Gaussian (in the frequency domain) [101]. Figure 4 compares the results of applying different filters for detecting cracks on a concrete surface. Abdel-Qader et al. [81] showed that the fast Haar transform has higher accuracy (86%) as compared to the other filters, such as Canny and Sobel, with 76% and 68% of accuracies, respectively. The image dataset that was used in this study, as well as the classification criteria, were further improved by Dorafshan et al. [97]. In general, major ML-based problems include three techniques: classification, localization, and segmentation. Figure 5 illustrates the frequent crack detection approaches: classification [25], object localization [23], and pixel-level segmentation [102]. Using the classification method, the dataset is labeled as cracked, non-cracked (sound). In the crack localization method, the cracks within each input image are labeled with bounding boxes. In the pixel-level segmentation method, the pixels are classified as cracked and non-cracked [103]. Dorafshan et al. discussed a comparison of different edge detection methods and performance of different filters have been discussed by [97]. Based on ANN-based image processing methods, several researchers highlighted potential applications of autonomous crack detection techniques. Jahanshahi and Masri [104] proposed ML-based models using SVMs for concrete crack detection, based on morphological features. The crack width was measured by identifying the centerline of (a) (c) (b) Dorafshan et al. discussed a comparison of different edge detection methods and performance of different filters have been discussed by [96]. Based on ANN-based image processing methods, several researchers highlighted potential applications of autonomous crack detection techniques. Jahanshahi and Masri [103] proposed ML-based models using SVMs for concrete crack detection, based on morphological features. The crack width was measured by identifying the centerline of cracks in their study. Using the abovementioned techniques, an automated vision-based crack detections framework was proposed by Yeum and Dyke [104] for bridge inspection.
Only a limited number of studies have attempted to compare the performance of recently developed crack detection methods by other researchers [96,97,105]. In addition, most of the recent studies have not clearly described the accuracy and classification criteria, including true positive (TP) metrics for reproducibility of the results. Furthermore, a comparison of several studies from a broadly different range of datasets [106], as well as comparisons using small datasets or the idealized datasets that were collected in laboratory conditions [96] do not reflect the merits of one method over another. Dorafshan et al. [37] and Talab et al. [83] proposed an automatic crack detection using the OTSU threshold [107] and image filtering. Such methods were later improved by implementing terrestrial laser scanning, which has three main steps: shading correction, crack detection, and mapping [108], and could be implemented in an automated manner using robotic systems yielding up to 95% accuracy [109].
Deep CNNs have been consistently developed by researchers in the computer vision field; Rawat and Wang present more details regarding the background of the CNN developments in image classification [110]. Following such significant achievements, several studies adapted CNN to detect surface and subsurface cracks in pavement and concrete. Earlier studies used DCNN to classify concrete or pavement surfaces by sliding-window method, but, recently, semantic segmentation through an end-to-end pipeline using fully convolutional networks (FCN) has attracted attention [111,112]. This approach has been employed to tackle challenging classification problems in different fields, including SHM. Recently, Chen and Jahanshahi [113] developed an enhanced CNN-based crack detection method while using a Naïve Bayes data fusion scheme for the extracted data from video frames. Kim et al. [73] proposed a faster CNN-based model to determine pixel-wise location while using image binarization.
Detecting cracks in tunnels is of vital importance. They might be a potential sign of a hidden danger that can pose serious threats to users or even become a trigger to the catastrophic collapse. On the other hand, identifying tunnel cracks is challenging, because there are many noise patterns in the tunnel images. Therefore, developing automated accurate methods for monitoring their surfaces can effectively enhance safety and decrease the potential costs. Li et al. [114] created a database of 60,000 tunnel crack images for training, testing, and comparing different crack segmentation networks. According to their study, by introducing clique block and attention mechanisms into U-net, it can significantly outperform basic U-net, fully convolutional networks (FCN), SegNet, and multi-scale fusion crack detection (MFCD) for detecting cracks in tunnel noisy images.
Soloviev et al. [115], Li et al. [116], Tong et al. [117], and Fan et al. [118] demonstrated the use of DCNNs to detect and recognize cracks as defects with quantifiable properties in applications for crack detection on pavement surfaces (e.g., crack length and size). Fan et al. [119] proposed a CNN-based multi-label classifier by improving the positive-to-negative ratio of samples. In another study by Wang et al. [120], they proposed a CNN model with three blocks of convolutional layers followed by two FC layers, consisting of 1,246,240 trainable parameters in total, which could detect surface cracks from the subdivided images of asphalt pavement. Tong et al. [117] developed another two-stage CNN-based model to also detect asphalt pavement crack length. A fast pavement crack detection network (FPCNet) was developed by Liu et al. [121] using encoder-decoder configurations.
The majority of studies in the literature validated the performance of DL models implemented in laboratory conditions with image datasets of intact and cracked surfaces, which still has limitations in addressing the real-world conditions. The acquired surface images may be contaminated with noise, shadow, dust, or extra brightness, which requires more robust and intelligent techniques for classification. Depending on the applications, such practical challenges have been addressed in several studies. Kim and Cho [122] defined a five-class crack detection problem using a large volume of images collected from the Internet as well as their augmentations. Their study considered field conditions to tackle real-world limitations that are associated with several uncertainty factors, as well as the inability in employing contextual information, such as the nature of materials, structural components, and the region of interest (ROI). Cha et al. discussed the feasibility of autonomous DL-based methods for crack detection, and Cha and Choi [123] proposed CNN-based classifiers and applied a sliding window method on 256 × 256 RGB images of concrete surfaces to detect cracks. Their proposed methods achieved an accuracy of 97% for concrete image datasets when considering different light intensities associated with variable weather conditions. Jang et al. developed a DL-based crack detection method [124] using hybrid images of combined vision and infrared thermography of macroand micro-cracks. They observed that the hybrid images made the network robust against varying operational conditions such as shadow, dust on the surface, rust, etc. Moreover, they developed a sticking-type UAV that can be utilized in the inspection of large reinforced concrete civil infrastructures. Jang et al. [125] devised a ring-type robot for crack evaluation of circular bridge piers in a controlled manner. This robot provides fast scanning and high-quality raw images for crack detection. Feeding these images, they trained a CNN that was able to precisely segment the crack maps on the piers. The proposed system could identify images with 97.47% recall and 90.92% precision, according to the experimental results.
In typical region-based classification, or object detection, a bounding box is created around the region of interest (e.g., cracks, spalling, components, etc.) [96]. For example, Ali et al. [126] proposed a modified cascade face detection method that uses the Viola-Jones algorithm for crack detection on concrete walls while using bounding boxes around the region of the crack. This method was modified by Ramana et al. [127,128] to automatically detect loosened bolts in steel structures with higher efficiency when compared to the earlier studies using hand-crafted features [129].
Yeum et al. used region-based CNNs (R-CNN) [130] for post-event evaluation of buildings with an accuracy of nearly 60%; however, this technique requires further developments to also include multiple damage scenarios. Xu et al. proposed a fast R-CNN approach [131] to detect different damage types in concrete structures as well as damage locations using bounding boxes. Fast R-CNN and Faster R-CNN were developed by Girshick [132] and Ren et al. [133], respectively. Another newly-developed region-based segmentation technique is Mask R-CNN [134] that segments images into objects, which can be used for crack detection, concrete spalling, and rebar detection. Cha et al. proposed a faster R-CNN for detecting multiple damage types, and Cha et al. developed the method to localize multiple damage types, including steel and bolt corrosion and delamination. One of the main drawbacks of the regular CNN approaches for detecting cracks is their deficiency in specifying out-of-plane cracks. Deng et al. [135] recently embedded deformable modules into various R-CNN and fully convolutional networks to overcome this drawback. When comparing the suggested technique with regular networks, they observed that the modified approach not only improves the detection accuracy of out-of-plane cracks, but also enhances the accuracy for other cases.
Other studies proposed pixel-level classification methods [33] to provide more precise information regarding the path and intensity of cracks. In most of the published research, the binary classification problems include distinguishing 'crack' and 'non-crack' regions or pixels. For more precise classification, Dung and Anh [33] proposed semantic segmentation to also identify path and density,. The typical object detection models attempt to fit a bounding box around the ROI [85], and semantic segmentation methods [136] or pixel-level classification [101], should be used to precisely delineate damage level, shape, and location. For pavement crack detection problems, Zhang et al. [27,137] proposed CrackNet, an efficient model based on R-CNNs. Xu et al. [93] developed a DL-based fatigue crack identification technique for long-span steel box girder bridges using deep CNN, as well as a framework for steel crack detection while using restricted Boltzmann machine [138] with high accuracy. Hoskere et al. proposed a pixel-wise DCNN with a parallel configuration and a fully CNN (FCN) [139,140] to localize and classify different damages, including concrete cracks, spalling, exposed rebars, corrosion, fatigues cracks, and asphalt cracks.

Structural-Component Recognition and Change Detection through Vision-Based DL
It is essential to perform a global-level inspection, as well as the structural component recognition process, before moving closer to the details, to understand the relationship between damage and safety of structures [86]. However, recent studies in DL-based SHM have not fully addressed this concern. Even the video-based crack detection models do not interpret the impact of damage in a global context. Yeum et al. [141] proposed a CNN-based technique to classify civil infrastructure images by recognizing regions of interest. Gao and Mosalam used similar object-detection techniques [142] to classify structural components as well as damage types. A Faster R-CNN algorithm was used by Liang [143] to automatically detect structural components of the RC bridge system using boundary boxes.
Narazakia et al. [144] used both global and close-up views to train two recurrent neural networks (RNNs) while using a single image-based pre-trained FCN for structural component recognition from video image data. The simple RNN and ConvLSTM units in their models could learn memories of the focus region of the video. The ground-truth labels for the video frames of flying UAV were synthetically created using a game engine [85]. Their overall goals were associated with the recognition, localization, and structural component classification from complex scenes. Each input image is automatically or semi-automatically down-sampled using convolutional layers and then up-sampled to generate the segmented image that is similar to the ground truth, as shown in Figure 6.
infrastructure images by recognizing regions of interest. Gao and Mosalam used similar objectdetection techniques [144] to classify structural components as well as damage types. A Faster R-CNN algorithm was used by Liang [145] to automatically detect structural components of the RC bridge system using boundary boxes.
Narazakia et al. [146] used both global and close-up views to train two recurrent neural networks (RNNs) while using a single image-based pre-trained FCN for structural component recognition from video image data. The simple RNN and ConvLSTM units in their models could learn memories of the focus region of the video. The ground-truth labels for the video frames of flying UAV were synthetically created using a game engine [86]. Their overall goals were associated with the recognition, localization, and structural component classification from complex scenes. Each input image is automatically or semi-automatically down-sampled using convolutional layers and then upsampled to generate the segmented image that is similar to the ground truth, as shown in Figure 6. Alcantarilla et al. proposed street-view (ground-level) change detection [147] using deconvolutional networks. Using a CNN model, Stent et al. [148] proposed a change detection method for tunnels. The main assumptions in these studies were that the cracks are connected slender and darker lines on concrete surfaces [84].

Applications of UAVs and Portable Smartphones for DL-based SHM
Deep learning networks facilitated the damage identification task by automating the process and achieving acceptable levels of accuracy. However, in some cases, inspectors do not have access to all parts of structures to acquire image data (for vision-based approaches) or sensors data (for vibrationbased methods). This is one of the main difficulties in structures, such as tall buildings, bridges, and heritage structures [149]. Drones were proposed as tools for inspecting such structures to overcome these difficulties [150]. Drones, Unmanned Aerial Vehicles (UAVs), or Unmanned Aerial Systems (UASs), are classified based on their level of automaticity, size, and other capabilities. They minimize the need for physical labor in addition to being time-saving, cost-effective, safe, available, and accurate. In recent years, different studies have been conducted in order to provide a framework for using UAVs, showing their applicability and address some of their disadvantages. When considering the rapid developments in drone industries, nowadays, utilizing intelligent UAVs for wireless data acquisition is not considered as a future technology.
The main developers of UAVs for bridge and other structural inspections are the departments of transportation (DOT) and the universities in the USA [86,[151][152][153][154]. Along with the developments in wireless data transmission techniques, several studies have been conducted that utilized UASs technologies to broaden vision-based inspection in SHM [14,155,156], as well as vibration-based techniques [157]. proposed street-view (ground-level) change detection [145] using deconvolutional networks. Using a CNN model, Stent et al. [146] proposed a change detection method for tunnels. The main assumptions in these studies were that the cracks are connected slender and darker lines on concrete surfaces [83].

Applications of UAVs and Portable Smartphones for DL-Based SHM
Deep learning networks facilitated the damage identification task by automating the process and achieving acceptable levels of accuracy. However, in some cases, inspectors do not have access to all parts of structures to acquire image data (for vision-based approaches) or sensors data (for vibration-based methods). This is one of the main difficulties in structures, such as tall buildings, bridges, and heritage structures [147]. Drones were proposed as tools for inspecting such structures to overcome these difficulties [148]. Drones, Unmanned Aerial Vehicles (UAVs), or Unmanned Aerial Systems (UASs), are classified based on their level of automaticity, size, and other capabilities. They minimize the need for physical labor in addition to being time-saving, cost-effective, safe, available, and accurate. In recent years, different studies have been conducted in order to provide a framework for using UAVs, showing their applicability and address some of their disadvantages. When considering the rapid developments in drone industries, nowadays, utilizing intelligent UAVs for wireless data acquisition is not considered as a future technology.
The main developers of UAVs for bridge and other structural inspections are the departments of transportation (DOT) and the universities in the USA [85,[149][150][151][152]. Along with the developments in wireless data transmission techniques, several studies have been conducted that utilized UASs technologies to broaden vision-based inspection in SHM [14,153,154], as well as vibration-based techniques [155].
Since DL-based SHM is an emerging technology, the capability of UAVs for robust real-time evaluation is still a challenge. Kim et al. [122] demonstrated the feasibility of UAV-based inspection of a concrete retaining wall. They analyzed videos (two frames/second) that were recorded from approximately 2m from the concrete surface. By embedding image processing and computer vision systems in UAVs, instant crack detection tasks could be safely done with minimum costs and maximum greater accuracy [156]. For example, Jang et al. [157] mounted a hybrid image scanning system (HIS) combining the laser thermography cameras and vision sensors in order to detect concrete cracks.
For bridge inspections, Kim et al. [158] deployed a high-resolution camera on a commercial UAV to collect images for crack detection and generating damage map of a concrete bridge.
Kang and Cha [14,159] developed an autonomous UAV system for SHM while using ultrasonic beacons to replace the role of GPS that performs poorly in partially covered places, such as under bridge decks. In addition, Huynh et al. [160] used a UAV for the quasi-real-time inspection of connection bolts on a full-scale girder bridge. Dorafshan et al. [75] examined the performance of different UASs for detecting cracks in steel bridges and concluded that instability in GPS-denied and windy environment might pose major challenges for UAS-assisted inspections.
UAVs were also employed to create image datasets for masonry heritage structures [161]. The images collected by UAVs are sometimes noisy or have relatively low contrast; in addition, the unavailability of GPS signals in indoor environments or under bridges interrupts their performance [162]. Duarte et al. [163] discussed the performance of the networks when considering multi-resolution images derived from satellite, manned, and unmanned aerial vehicles. Hoskerre et al. [140] proposed a framework to convert UAV data to DL-based condition-aware models for automating and accelerating post-earthquake inspections. They trained three networks for building information, the presence of damage, and the types of damage. Moreover, they suggested an approach for modal identification of the structures while using videos recorded from different parts of a structure through a divide and conquer strategy [164]. Because of the growing attention to smartphone applications, several studies used them as inexpensive tools for SHM [102]. Images taken by smartphones have been utilized for the identification of various damages, such as pavements [165], bolt loosening [160,166], volumetric damages [167], and concrete cracks. Zhao et al. [168] developed a mobile-based method for measuring the forces in cables of cable-stayed bridges.
Using the framework of Core ML as well as the Xcode, Li, and Zhao [169,170] integrated a trained CNN model into a developed smartphone application in order to detect the presence of concrete surface cracks on a bridge with 99% accuracy. Furthermore, using CNN, Wang et al. [171] developed a real-time efflorescence and spalling detection of historic brick masonry buildings. Based on 99 × 99 RGB images that were acquired by low-cost smartphone sensors, Zhang et al. [34] developed a deep CNN model with six convolution layers for automatic crack detection on road surfaces, which was a binary classification task. Pauly et al. proposed a deeper CNN model [172] to enhance the performance of CNNs based on the 99 × 99 RGB images. Maeda et al. [173] prepared a large-scale dataset while using a smartphone that was installed on a car dashboard, which used images to develop an end-to-end public application so-called 'RoadDamageDetector', which classifies different types of road damages.

Transfer Learning (TL) through Pre-Trained Models
When the dataset is relatively small and there is a pre-trained network that has already been trained on a larger dataset, an efficient way is to fine-tune the existing network for the new similar classification task. Using transfer learning techniques (TL), the training time can be minimized by transferring the coefficients from the base model instead of starting with randomly assigned weights. The depth of the network has a direct relationship with the number of training parameters. Therefore, deep networks require a considerably long time and a large amount of data for training. Transfer Learning (TL) can alleviate this issue by providing prior knowledge (weights) that was obtained from a similar problem [174]; therefore, fine-tuning can be carried out with a lower computational cost and fewer data samples. Figure 7 illustrates four typical TL strategies thatare based on the target data size in its similarity to the source domain. For a CNN model with convolutional layers in series, followed by fully connected (FC) layers, a common practice is to fine-tune the last FC layers, or replacing them with new layers. Therefore, the convolutional layers are frozen and skipped during the training.
The application of TL is an emerging area in SHM, and novel studies are being carried out in classification problems in vision-based SHM [142]. Pre-trained CNN models can even be used for new problems with completely different output classes. In pavement crack detection, TL has been proven to be an efficient approach for improving the accuracy of the classification problem [175]. For example, Gopalakrishan et al. [176] were able to perform crack detection on Hot-Mix Asphalt, as well as Portland Cement Concrete, through the TL technique. They used the VGG16 network that was trained on ImageNet data. Dorafshan et al. [96] compared the performance of a fully trained AlexNet model with the same AlexNet, but in transfer learning and no-training modes for concrete crack detection tasks. Perez et al. [177] applied a pre-trained VGG-16 for localization of building deteriorations that stem from dampness such as stain, peeling, and crazing. Özgenel and Sorguç [178] conducted a comprehensive study with an emphasis on the dataset size, number of training epochs, number of convolution layers, as well as the trainability and transferability features of each CNN-based pre-trained models. Wu et al. [179] designed an efficient DCNN that was developed using transfer learning (of VGG16 and ResNet18) and Taylor expansion-based network pruning. The network pruning technique refers to removing the least important neurons and filters of a belief network. They showed that the proposed approach reduces memory demands and inference time. This technique can be applied to decrease the need for a huge amount of training data without losing performance in damage detection. Table 1 provides further examples of TL applications using the popular pre-trained DNNs for SHM problems. The most popular pre-trained DNNs that have been frequently used for SHM problems are summarized in the following: AlexNet: AlexNet [18], one of the earlier DL models, has been developed to classify objects in the images, and it won the ImageNet [180] classification competition in 2012. It has five convolutional and max-pooling layers, three fully-connected layers, and a 1000-way softmax output layer (25 layers in total). When considering the concrete surface cracks as objects, AlexNet can be fine-tuned for crack detection purposes through transfer learning [96,122]. AlexNet can be loaded to Matlab or Python using the dedicated toolboxes.
VGG: VGG16 [17] is a deeper version of AlexNet, which itself has six different configurations namely A, A-LRN, B, C, VGG16, and VGG19. VGG16 and VGG19 are the popular versions with 16 and 19 layers, and 138 and 144 million parameters, respectively.
Inception: the kernel size is related to the distribution of salient information. The large and small kernel sizes are suitable for global and local distribution of information, respectively. Using different filter sizes in parallel can resolve the problem of choosing suitable sizes. Different versions of inception modules, so-called V1, V2, V3, and V4, have evolved iteratively [181].
ResNet: Resnet50 [182] is a deep network that implements residual learning. It was introduced by Facebook AI Research (FAIR). Although it provides significantly high accuracy, it requires considerable processing time because of the significant depth of the network. ResNet50 has 50 main layers and 177 in total, and ResNet101 has 101 main layers with 347 layers in total [19].
GoogleNet: GoogleNet was proposed by Szegedy et al. [19]. It has 22 main layers (144 in total) [182] but 12 times fewer parameters than AlexNet. Using weighted Gabor filters [183] with various sizes in the inception sparse architecture allows for a deeper and wider network without increasing the computational budget.
ZFNet: ZFNet is the fine-tuned version of AlexNet, which was the Winner of ILSVLC 2013 in image classification. The Architecture of AlexNet and ZFNet is similar, and the differences are mainly focused on filter size and stride that resulted in reducing the error rates [184].
CrackNet: the first version of CrackNet (CrackNet I) was developed for crack detection on three-dimensional (3D) asphalt surfaces with the explicit objective of pixel-perfect accuracy [27,137]. This CNN does not use max-pooling layers and it achieved 90.13% precision and 87.63% recall. The second version (CrackNet II) enhanced the CrackNet I in terms of robustness against noise and increasing performance speed by removing the handcrafted feature generation, adding learnable parameters, and increasing the depth of the network. This CNN achieved 90.20% precision and 89.06% recall, which are better than the original version [185].
Sensors 2020, 20, x FOR PEER REVIEW 10 of 36 in its similarity to the source domain. For a CNN model with convolutional layers in series, followed by fully connected (FC) layers, a common practice is to fine-tune the last FC layers, or replacing them with new layers. Therefore, the convolutional layers are frozen and skipped during the training.  [58].
The application of TL is an emerging area in SHM, and novel studies are being carried out in classification problems in vision-based SHM [144]. Pre-trained CNN models can even be used for new problems with completely different output classes. In pavement crack detection, TL has been proven to be an efficient approach for improving the accuracy of the classification problem [178]. For example, Gopalakrishan et al. [179] were able to perform crack detection on Hot-Mix Asphalt, as well as Portland Cement Concrete, through the TL technique. They used the VGG16 network that was trained on ImageNet data. Dorafshan et al. [97] compared the performance of a fully trained AlexNet model with the same AlexNet, but in transfer learning and no-training modes for concrete crack detection tasks. Perez et al. [180] applied a pre-trained VGG-16 for localization of building deteriorations that stem from dampness such as stain, peeling, and crazing. Özgenel and Sorguç [181] conducted a comprehensive study with an emphasis on the dataset size, number of training epochs, number of convolution layers, as well as the trainability and transferability features of each CNNbased pre-trained models. Wu et al. [182] designed an efficient DCNN that was developed using transfer learning (of VGG16 and ResNet18) and Taylor expansion-based network pruning. The network pruning technique refers to removing the least important neurons and filters of a belief network. They showed that the proposed approach reduces memory demands and inference time. This technique can be applied to decrease the need for a huge amount of training data without losing performance in damage detection. Table 1 provides further examples of TL applications using the popular pre-trained DNNs for SHM problems. The most popular pre-trained DNNs that have been frequently used for SHM problems are summarized in the following: AlexNet: AlexNet [18], one of the earlier DL models, has been developed to classify objects in the images, and it won the ImageNet [183] classification competition in 2012. It has five convolutional and max-pooling layers, three fully-connected layers, and a 1000-way softmax output layer (25 layers in total). When considering the concrete surface cracks as objects, AlexNet can be fine-tuned for crack detection purposes through transfer learning [97,124]. AlexNet can be loaded to Matlab or Python using the dedicated toolboxes.  Post-earthquake assessment [32,34,102,141,144,145,176,178,[186][187][188][189][190][191][192][193][194] Inception (Inception-V2, V3, V4) 1 [19,181,195] • Crack detection • Damage detection of historic masonry buildings [33,169,176,196,197] ResNet (ResNet-20, 50, 101, 152) [185] • Crack detection • Bridge component extraction • Structural inspection [33,139,144,171,178,190,191,194,[198][199][200] AlexNet [18] • Crack detection • Comprehensive maintenance and inspection • sUAS-assisted structural inspections • Post-earthquake assessment [73,96,122,124,143,178,193,196,[201][202][203] GoogleNet [19] •

Databases for DL-Based SHM
Data are an essential part of all data-driven techniques in SHM. During recent years, the importance of reliable data for SHM applications has become obvious along with the emerging DL-based algorithms that can process and interpret massive amounts of data [15]. To date, no open-access and standard dataset for DL-based SHM has been published; however, individuals have created datasets generated under laboratory-controlled environments for limited applications. For example, Maguire et al. [206] created a database, called SDNET2018, which contains 56,000 images of a bridge deck, wall, and pavement, which is suitable for DL-based applications. The dataset by Hoskere et al. [139] consists of images from different types of structures, such as buildings, bridges, dams, pavements, as well as laboratory tests images. For small size datasets, the augmentation techniques can be applied, including geometry transformation, color conversion, and adding noise or blurring, which address the camera position and image quality as well as the light and weather conditions in real-world [122]. More sophisticated techniques, such as a generative adversarial network (GAN), have been used for data augmentation for structural images [207]. Tables 2 and 3 summarize the vibration-based and vision-based datasets that have been recently used in DL-based SHM. The goal of the studies, as well as the nature of the datasets, size, and other characteristics, are also included. Table 2. Examples of vibration-based datasets for deep learning (DL)-based SHM.

Reference(s) Goal Dataset
Zhang et al. [208] Vibration-based structural state identification 8595, 14,465, and 4800 raw acceleration data (9 Ch. × 10,000) for each of the bridges Pathirage et al. [28] Damage identification by making a deep mapping between the modal characteristics and structural damage 20,000 data samples containing the first three frequencies and mode shapes obtained by Eigen analysis of finite element model Avci et al. [52] Wireless vibration-based bolt loosening detection 330 signals each containing 245,760 samples of velocity Pathirage [63] Vibration-based damage detection and finding the stiffness reduction of elements Modal information of 10,300 damage cases that include the first seven frequencies (7 arrays) and the regarding mode shapes at 14 beam-column joints (98 arrays) Tang et al. [58] Data anomaly detection and classification 10,014 time and frequency response of a long-span cable-stayed bridge stacked in two channels with the resolution of 100 × 100 Wang and Cha [55] Vibration-based loosened bolt localization 6800 frequency domain 50 × 50 matrices calculated by Fast Fourier Transformation (FFT) of acceleration signals of a lab-scale bridge Yu et al. [209] Damage identification and localization of buildings controlled with smart devices 1900 group of 5 × 2832 matrices of power spectral density Lin and Nie [54] Vibration-based feature extraction for damage detection 459 set of vertical acceleration signals collected from nine nodes in 1024 × 9 matrices Bao et al. [30] Vision-based anomaly detection and classification in a long-span cable-stayed bridge 333,792 of acceleration signals plotted in 100 × 100 one channel images Abdeljaber et al. [53] Bolt loosening localization on a lab-scale steel grandstand simulator 749 × 12 vectors of acceleration signals with 128 × 1 dimension Table 3. Examples of vision-based datasets for DL-based SHM.

Reference(s) Goal Dataset
Gulgec et al. [210] Robust damage detection and localization of steel connections 30,000 damaged and 30,000 healthy strain distribution matrices in 28 × 56 dimension  [171,196] Spalling detection for historic masonry structures Table 3. Cont.

Reference(s) Goal Dataset
A vision-based [189] Crack detection of gusset plate welding in steel bridges 12,896 images with 64 × 64 pixels of cracks and the same number of non-cracks Liu and Zhang [192] Image-driven low cycle fatigue-induced damage identification for post-hazard inspection 8259 images with 224 × 224 pixels extracted from a the video was taken during the experimental test Ni et al. [198] Concrete thin crack identification and width measurement 65,319 crack and 64,681 non-crack 224 × 224 RGB images for GoogleNet and 60,000 images for ResNet Hoskere et al. [140] Rapid and autonomous post-earthquake inspections including identification of damage presence and damage type A set of 665 images with 288 × 288 pixels containing post-earthquake damage scenarios such as concrete cracks, spalling and exposed rebar  Table 3. Cont.

Reference(s) Goal Dataset
Kumar et al. [215] Automated CCTV inspection of sewer pipelines 12,000 images of cracks, root intrusions, deposits, and in pipelines with the dimension of 256 × 256 Narazaki et al. [144,190,199] Structural bridge components recognition 39,081 images of a concrete girder bridge with a size of 240 × 320 Karaaslan et al. [186] Mixed reality inspection including detection and segmentation of cracks and spalls 51,300 concrete crack, road damage and bridge inspection images with the size of 300 × 300 Oliveira et al. [216] Damage classification in an aluminum plate 720 frames of 128 × 128 greyscale representation of electromechanical impedance Hoskere [139] Damage localization and classification for post-earthquake structural assessment 1695 images of concrete spalling, exposed rebar, steel corrosion, concrete cracks, fatigue cracks and asphalt cracks with a resolution of 288 × 288 Nazaraki et al. [191] Pixel

Software Frameworks for DL Applications
Various frameworks have been developed by research centers to perform deep learning tasks. Table 4 summarizes the most popular ones (Pouyanfar et al. [220]) that continue to evolve. TensorFlow: a powerful platform that is developed by Google and dedicated to deep learning applications is TensorFlow [221]. TensorFlow can deploy multiple CPUs or GPUs while using the same Application Programming Interface (API), which increased its popularity among the DL researchers [222]. No framework is superior over the others; however, some researchers compared the performances on a single GPU, and showed that the TensorFlow might not outperform Theano, Torch, Neon, and TensorFlow DL frameworks, despite its flexibility [105,223]. On the other hand, it is faster when Long Short Term Memory (LSTM) units are used as the core of the model [222]. It is worth noting that TensorFlow can also run models on mobile platforms.
Torch/PyTorch: Torch is a simple, open-source, and extensible DL framework that is dedicated to building fast and efficient GPU-based ML algorithms [222]. It outperforms CPU-based training according to Bahrampour et al. [105]. PyTorch is considered to be the primary software tool for deep learning after TensorFlow, which is developed by Facebook services.
Keras: one of Python's most popular and high-level libraries in DL that is capable of running on top of TensorFlow [224]. Keras is extremely user-friendly and, due to the modularity and extensibility features, it is attractive for both novice and experienced researchers in DL.
Caffe: for computer-vision tasks, Caffe, which is developed at the University of California-Berkeley, is one of the fastest and easiest platforms to train CNNs with the capability of processing 60 million input images per day [225].
Theano: originally developed to be used as a CPU/GPU compiler in Python, and not to be a DL framework, but it is one of the Python libraries for fast numerical computations, particularly non-standard DL models [105]. For relatively shallow models of CNNs and LSTMs, Theano might outperform Torch and TensorFlow [105]. Theano is not further developed since 2017.

Summary and Prospects
The advantages of machine learning (ML) in solving structural health monitoring (SHM) problems are evident, and the number of deep learning (DL) applications has grown exponentially since its introduction during recent years. However, every proposed method has its limitations, and some of the major challenges are summarized, as follows:

1.
Despite remarkable advances since the introduction of DL-based SHM, current techniques in the literature cannot be considered to be fully automated, and human perception is not easy to replicate through vibration or vision-based DL algorithms [226]. Such capabilities should be addressed in future studies, including the significance of damage with respect to the type of structural components, materials, locations, and other environmental conditions. 2.
The number of available image databases of structural systems and other infrastructure components is very limited for SHM purposes, which leads to lower performance of the available trained models when new conditions arise, such as texture, joints, light, environment, pollutions, etc. In addition, environment-related issues cannot be perfectly simulated via generalized numerical models; therefore, larger datasets can only be formed by acquiring data from the real world. 3.
The nature of damages, as well as their significance, may differ from one structural component to another when considering the global structural context. Therefore, comprehensive hierarchical approaches should be devised in which structural component recognition is included as an essential first step before using image data. Such an approach has the potential to bring context in the assessment, evaluation, and interpretation of damage as part of SHM applications.

4.
Laboratory conditions are idealized in the available studies, and further research is needed for in-situ DL-based SHM. For example, the presence of wind and light may disrupt UAVs for the vision-based measurements. A clear example is the dynamic vibrations and deflections of a long suspension bridge under traffic and wind loads. In addition, damages occur gradually with small changes [145,146].
Along with the rise of the DL-based methods, the importance of data is more evident than ever before. Vision and vibration data from host structures both need to be automatically processed and stored efficiently, which requires robust and real-time signal and image processing models that are capable of identifying an anomaly in data. Response data compression and response reconstruction [227] for long-term monitoring purposes remain challenges that require more developments to avoid information loss. The near-future directions of DL-based technologies can be grouped into the following categories when considering the limitations and advantages of the recently proposed methods.

DL Applications in Vision-Based and Vibration-Based SHM
The DL-based damage detection techniques are capable of learning implicit relations between features with less computational costs. DL-based techniques result in significantly higher accuracy, despite the conventional ML methods that rely on hand-crafted features. Along with the advances in computer vision techniques, vision-based SHM is expected to play a pivotal role in the next generation of SHM. In addition, it has been shown that vibration data can be treated in grid-like images to train deep CNN models [61], which expands DL applications to solve real-world problems. In addition, DL models can be used to predict or minimize the response of complex structures without the need for complex finite element models.

DL Applications in SHM Using Self-Powered Sensors
During the last decade, wireless sensors networks (WSNs) have been developed as the best alternative for traditional wired networks. In addition, the development of self-powered sensors for harvesting energy from the sensed vibrations has recently been in the center of attention. Such technology has transformed data acquisition and promoted energy-efficient data interpretation [228]. The majority of the ML-based SHM methods are combinations of hand-crafted formulas and empirical models, which still depend on the knowledge of the inspector as well as the capability of computers and algorithms, despite the significant advances in data-driven approaches [22]. As the predominant methods in SHM, ML-based approaches require a large amount of data from the host structure. The incorporation of wireless networks of self-powered sensors with deep learning technologies will increase the efficiency and minimize the cost of maintenance through continuous monitoring. DL-based methods are reliable and efficient end-to-end training models with higher prediction accuracies, which are able to learn the nonlinear interrelation without the need for manual feature extraction methods.

DL-Based SHM as Part of IoT and Smart Cities
Emerging wireless data transmission and cloud-based computation have created new paradigms, known as the Internet of Things (IoT) [22]. Nowadays, it is practically feasible to mount low-cost wireless sensors in large numbers on infrastructures to efficiently monitor the structural health [229]. With the advantages of the powerful cloud-based computing and DL platforms, future studies would integrate DL and IoT for SHM to extract information from a large amount of data that are constantly received from networks of sensors. Such studies would further expand the boundaries of SHM and IoT in a large-scale, by remote sensing and monitoring, as well as learning from previous events to make decisions in the future. Smart and sustainable cities have been at the center of attention since the introduction of the IoT concept. Interpreting a large amount of distributed data requires efficient DL approaches that, ultimately, result in the integrity of the whole system with minimum costs and in an automatic way. The future studies would introduce new protocols in data acquisition, transmission, storing, and DL-based interpretation for SHM as part of IoT.

DL Applications in SHM through Transfer Learning/Synthetic Simulation Data
It is obvious that data play a key role in deep learning, and, as explained in previous sections, data collection is not always possible. One option is using transfer learning to shortcut the learning process by using some level of knowledge from earlier studies. An emerging technique for increasing the training data size is using synthetic data for pre-training DL models while using finite element analysis packages (for vibration-based data) or game engines (for vision-based data). Therefore, such data can be used for pre-training a network before fine-tuning using real-work data.

DL-Based SHM Using Portable Smartphones, UAVs and GNSS
In the current era, cell phones have become an integral part of our lives. These programmable small devices have a camera, memory, as well as good computing and network technology, which enables offline and cloud-based real-time assessments. The smartphones are generally equipped with different sensors, namely magnetometer, gyroscope, accelerometer, and GPS. These capabilities have made them a potential device for structural health monitoring, particularly, when they are paired with UAVs to obtain and process data while using DL-based algorithms. Moreover, the Global Navigation Satellite System (GNSS) technology that is capable of acquiring massive amounts of accurate structural vibration response data. These data can lead to noticeable gains in the operational efficiency of SHM. The cost of SHM can be reduced and its reliability can be improved, as this modern technology and DL will be linked together with a resulting decrease in the need for sensors [230].

DL-Based Seismic Vibration Control for Smart Structures
Time-delay and sensor failure are among the major issues that directly impact the performance of vibration control systems during an earthquake [231]. By implementing recurrent neural networks (RNNs), such as long short-term memory (LSTM) networks, it can be possible to predict the response of structures to address delay problems. Furthermore, signal failures or damage presence can be effectively detected, which leads to the design of stable and robust controllers [232]. Model-free deep reinforcement learning techniques for seismic vibration control are novel and they can incorporate deep learning models for training controllers by learning from experiments, which is suitable for active and semi-active control problems [233]. With the recently developed open-source Python packages, such as OpenseesPy [234] and OpenAI Gym [235], it is possible to build, train, and validate a deep reinforcement learning algorithm for nonlinear structures under seismic excitations.