Deep Learning-Based Applications for Safety Management in the AEC Industry: A Review

Safety is an essential topic to the architecture, engineering and construction (AEC) industry. However, traditional methods for structural health monitoring (SHM) and jobsite safety management (JSM) are not only inefficient, but also costly. In the past decade, scholars have developed a wide range of deep learning (DL) applications to address automated structure inspection and on-site safety monitoring, such as the identification of structural defects, deterioration patterns, unsafe workforce behaviors and latent risk factors. Although numerous studies have examined the effectiveness of the DL methodology, there has not been one comprehensive, systematic, evidence-based review of all individual articles that investigate the effectiveness of using DL in the SHM and JSM industry to date, nor has there been an examination of this body of evidence in regard to these methodological problems. Therefore, the objective of this paper is to disclose the state of the art of current research progress and determine the relevant gaps, challenges and future work. Methodically, CiteSpace was employed to summarize the research trends, advancements and frontiers of DL applications from 2010 to 2020. Next, an application-focused literature review was conducted, which led to a summary of research gaps, recommendations and future research directions. Overall, this review gains insight into SHM and JSM and aims to help researchers formulate more types of effective DL applications which have not been addressed sufficiently for the time being.


Introduction
The architecture, engineering and construction (AEC) sector is a significant driver of economic activity around the world [1]. Structure-and workplace-related safety accidents have the potential to be life-threatening [2]. Unfortunately, these are always some of the most overlooked things in the sector. In the United States, around 40% of bridges are over 50 years old, and more than 9% of them are rated as structurally deficient, which would draw a total cost for bridge rehabilitation of around $123 billion [3]. In addition to the need to design more robust structures under various loads [4,5], efficient structural monitoring is also important for aging infrastructure. Accurate structural health assessments are the basis for the decision-making of infrastructure maintenance, repair and rehabilitation. Typically, structure health monitoring (SHM) relates to different approaches, such as conducting regular visual inspections or relying on structural monitoring sensors [6]. Visual inspections require experienced inspectors to carry inspection instruments to reach the structure surface and conduct the inspection, and such a process can be labor-intensive, time-consuming and sometimes risky. Sensor-based monitoring can identify defects from both the structure surface and interior, and it is more reliable when the sensors are functional [7,8]. As time goes by, however, the accuracy may be compromised due to changing environments or OR unsafe* OR fatigue* OR concrete* OR "computer vision*" OR "Natural Language Processing" OR NLP OR integration*). The time span of the search was from 2010 to 2020. The relevance of papers was ensured by reviewing each article's title and abstract and excluding the irrelevant ones. Eventually, 527 papers remained.

Literature Analysis
In this study, CiteSpace [30] was used to provide a comprehensive understanding of the research hotspots and development trends for DL-based SHM and JSM. CiteSpace is a dynamic visual analysis tool which can draw knowledge maps diversely via clusters, network connectivity diagrams, nodes and so forth [31]. The uniqueness of this analysis is that emerging trends can be determined based on indicators derived by CiteSpace without domain experts' intervention or prior knowledge of the topic. Additionally, CiteSpace can expand a data set by collecting the most-cited references. This makes the data set more robust than defining the researched field with a list of predefined keywords. Figure 1 demonstrates a keyword co-occurrence network which consists of 367 keyword nodes and 1787 links generated from the literature database. The frequency of co-occurrence of the keyword is proportional to the font size of the keyword. Table 1 summarizes each keyword's occurrence. The trendy research themes are shown in Figure 2, as reflected in the keyword bursts.     Co-citation analysis is an effective method for identifying the relations by annotating their citation and co-citation footmarks [32]. The literature work consisted of 492 nodes and 2075 links ( Figure 3). Figure 4 identifies the the strongest bursts. The references were sorted by starting time, and this o the development of the research trend. Figure 3 shows the co-citation ne based on the log-likelihood ratio (LLR) test. The order number of clusters re ber of relevant papers published in an area. As can be seen from Figure 3 search results are mainly focused on cluster #0 body posture, cluster #1 DL-b crack classification and cluster #2 infrastructure construction site. Therefo included in these clusters were analyzed primarily in order to identify em Besides that, the latent semantic indexing (LSI) test and the mutual inform were also used for cluster labeling. The top-ranked clusters derived from gorithms are summarized in Table 2. Table 3 demonstrates the journals quently cited by the acquired literature. It could be noted that Automatio tion, Journal of Computing in Civil Engineering, Lecture Notes in Comput vanced Engineering Informatic and Computer-Aided Civil and Infrastruc ing were the most-cited journals in the ML, DL, SHM and JSM domains derived clusters and keywords derived from CiteSpace, this study classifie papers into four categories using a systematic and manual process, namel damage detection, vibration-based damage detection, workers' unsafe beh and the analysis of construction safety documents. In the next section, the ML-and DL-based technologies and applications are discussed in the corr egories. Co-citation analysis is an effective method for identifying the relationship of citations by annotating their citation and co-citation footmarks [32]. The literature co-citation network consisted of 492 nodes and 2075 links ( Figure 3). Figure 4 identifies the citations with the strongest bursts. The references were sorted by starting time, and this order indicated the development of the research trend. Figure 3 shows the co-citation network clusters based on the log-likelihood ratio (LLR) test. The order number of clusters reflects the number of relevant papers published in an area. As can be seen from Figure 3, the major research results are mainly focused on cluster #0 body posture, cluster #1 DL-based roadway crack classification and cluster #2 infrastructure construction site. Therefore, the articles included in these clusters were analyzed primarily in order to identify emerging trends. Besides that, the latent semantic indexing (LSI) test and the mutual information (MI) test were also used for cluster labeling. The top-ranked clusters derived from these three algorithms are summarized in Table 2. Table 3 demonstrates the journals that were frequently cited by the acquired literature. It could be noted that Automation in Construction, Journal of Computing in Civil Engineering, Lecture Notes in Computer Science, Advanced Engineering Informatic and Computer-Aided Civil and Infrastructure Engineering were the most-cited journals in the ML, DL, SHM and JSM domains. Based on the derived clusters and keywords derived from CiteSpace, this study classified the collected papers into four categories using a systematic and manual process, namely vision-based damage detection, vibration-based damage detection, workers' unsafe behavior detection and the analysis of construction safety documents. In the next section, the state-of-the-art ML-and DL-based technologies and applications are discussed in the corresponding categories.

Review of DL and ML Safety Applications
In recent years, researchers have used computer vision-based methods to conduct the visual inspection of surface defects and have attested considerable merits [33][34][35]. These methods are primarily based on image processing techniques (IPTs), such as histogram transformation, texture recognition and edge detection [1,36]. However, these methods are vulnerable to lighting condition changes and image distortion issues.
To enhance the performance of IPT-based approaches for defect detection, researchers have integrated ML algorithms [37][38][39]. Technically, ML algorithms can efficiently classify different damage features extracted from IPTs. ML-based methods mostly focus on identifying typical structural defects such as cracks [40][41][42][43][44][45][46][47], rusting [48,49], spalling [50,51] and loose bolts [52]. Nevertheless, these methods require defect features to be clearly defined and extracted using proper classifiers. Overall, these methods lack efficiency, feasibility and accuracy. Rapidly developing DL techniques are expected to solve the problems mentioned above. The CNN, as an end-to-end model, can improve the efficiency of defect detection and localization significantly because it can learn the defect features automatically from the labeled defects in the training samples. Normally, the process of using a CNN to determine defects in images is as follows: a fixed-size sliding window is used to scan and separate the image into small patches, and then a well-trained CNN is used to detect the defects on each small patch separately. Because the scales and shapes of defects may vary, it is difficult to find an appropriate window size to fit all kinds of them in practice. To overcome the drawback above, a region-based CNN (R-CNN) [53] was proposed to replace the sliding windows method. The R-CNN is a two-stage detector. First, it employs a selective search approach [54,55] to generate region proposals. Then, the defect features can be extracted from the regions for classification and be highlighted by bounding boxes. However, the proposed region may be overlapped and, therefore, it may increase the computational burden [56].
To increase the efficiency of object detection, a Fast R-CNN [57] and Faster R-CNN [58] were proposed successively. Researchers have currently employed these two-stage detectors to detect the surface damages of structures. For instance, Cha et al. [56] used a Faster R-CNN to identify five classes of structural defects in both steel and concrete structures. A Faster R-CNN architecture modified by Li et al. [59] could determine the small concrete defects even in a complex image background. Although two-stage detectors can provide high accuracy, the generation of a region proposal hinders the detection speed. Therefore, two-stage detectors are hard to use for achieving real-time detection.
Due to the above limitations, single-stage detectors (e.g., the single shot Multibox detector (SSD) [60] and you only look once (YOLO) [61]) were proposed by combining object classification and localization into one single convolution network. The removal of the proposal generating step is the main feature of single-stage detectors. Besides that, they can predict multiple bounding boxes simultaneously. Hence, the detection speed of single-stage detectors can be improved significantly. Similarly, YOLOv3 was used by Zhang et al. [5] to detect the cracks, pop-out, delamination and exposed rebar of the bridge with relatively high accuracy.
However, using bounding boxes to identify the damage may not be suitable, as such rectangular boxes cannot determine the boundaries of the crack textures accurately. In this case, pixel-level semantic segmentation, which assigns classification labels to each image pixel instead of generating bounding boxes, allows for better defect localization and analysis [2]. A fully convolutional network (FCN) [19] is a pixel-by-pixel network for semantic segmentation. It predicts the class (crack or non-crack) of each pixel by using a deconvolution layer to upsample the last convolutional layer. Compared with the methods above, the pixel-level method can completely separate the damage feature from the background by highlighting each pixel of the defect. At present, the FCN method has already been exploited to identify concrete cracks [55,62] and other structure defects [63][64][65]. For example, Zhang et al. [66] proposed an FCN-based, pixel-level asphalt pavement crack detector. Similarly, Bang et al. [67] employed an FCN as a part of their developed network for road crack detection based on digital images produced by black box cameras. The results proved that an FCN is optimal for defect information segmentations. Except for segmenting the crack, it can also provide valuable damage information, such as crack widths for damage assessment. When annotating defects in an image sample, a small portion of non-cracked surfaces is likely to be labeled as defects. Such random errors are inevitable and not easy to measure. To reduce the impact of such uncertainty, Tong et al. [68] combined an FCN and a Gaussian-conditional random field for pavement defect detection. The developed framework can address the uncertainty of defect labeling.

Vibration-Based Damage Detection
Although a pixel-level representation of structural defects is beneficial for SHM, it can only identify the damage level on the structure surface and is not competent to infer the performance of internal structural components which may have been deteriorated in advance [69]. Vibration data is the main type of source of data utilized in SHM. Technically, any structural damage will change the stiffness and mass distributions of the structure and lead to differences in the natural frequencies and mode shapes [70,71]. Hence, vibrationbased SHM methods have the potential to detect internal structural damages by analyzing the abnormal data acquired from the sensors (e.g., accelerometers). The previous research of vibration-based SHM mainly focused on setting up a real physical model to imitate the status of a real structure. Basically, this model-driven method employs mathematical modeling and physical laws to represent the monitored structure [72]. Hence, the level and location of the damage can be determined accurately by analyzing and solving the model. Nevertheless, it is challenging to build and solve such a complicated model when the complexity of the monitored structure increases and the environmental factors are considered. Currently, model-driven methods have been progressively replaced by datadriven methods [73,74]. The most critical drawback of the model-driven approach is that modeling usually requires expertise and is time-consuming. Unlike the model-driven method, the data-driven method can identify the anomaly data directly by measuring the data collected from the sensors. Most of the data-driven method is based on the ML paradigm [75]. As the appropriate sensors' layouts can improve the efficiency and accuracy of data collection and transmission, ML algorithms, such as a genetic algorithm (GA) [76], have also been used for the determination of optimal sensor layouts. However, when applying vibration-based SHM methods in practice, the natural frequencies of the structure are easily affected by environmental factors (e.g., temperature) [77,78]. For example, if a structure has some small-scale damages, the changes in the natural frequency of the structure would possibly be suppressed by those environmental variables. Some scholars have conducted several analyses on the evolution of structural properties and their relationship with changes in environmental parameters [79,80]. Among them, the monitoring of the Z24 bridge is emblematic for addressing this issue [81]. Although significant efforts have been made in this regard, it requires comprehensive expertise and is time-consuming [82].
Currently, DL methods can potentially remedy the issue, as they can fully utilize the sensor data by automatically extracting the data features. Therefore, without expertise, even a delicate anomaly can be perceived. Recently, Ni et al. [83] presented a 1D CNNbased algorithm along with autoencoder data compression to identify anomaly data in a long-span suspension bridge. The results showed that the developed algorithm could achieve a precision of 97.53%. Similarly, Avci et al. [84] presented a 1D CNN-based method to detect the structural damages on a steel frame by using wireless sensor data. Azimi and Pekcan [85] introduced a novel CNN-based approach which could detect and locate the damage in a large-scale structure. Lin et al. [86] trained a deep CNN by feeding it unrefined sensor data and applied it for identifying the damage of a simulated beam structure. The results revealed that the trained CNN could detect structural damage with high accuracy, even if the test data were noisy. Zhang et al. [2] proposed a CNN to detect structural stiffness and mass changes. The developed CNN achieved good results on both the in-lab structure and the in-service bridge. Gulgec et al. [87] used a DL-based method for steel fatigue assessment. Compared to the traditional method, which is costly and laborious, their proposed method could achieve a high detection accuracy with a low cost.
Vibration-based methods integrated with DL have proven that they can perform well in damage detection, and the costs of these methods are relatively low. However, these methods still need much effort in data labeling, which is tedious. There is a novel application using an unsupervised learning method, which does not require data labeling. Guo et al. [88] presented an unsupervised learning method called a sparse coding algorithm for SHM. Sparse coding was employed to learn the feature representations from unstructured vibration data to improve the performance of damage detection. Through simulation, different degrees of damage were conducted, and the results showed that the proposed method outperformed other ML methods, such as logistic regression and decision trees, with a precision of 98%. However, this application was only validated on the simulated model. As the performance on real-world structures has yet to be verified, this may become a notable topic for future research.

Workers' Unsafe Behavior Detection
On-site surveillance videos or images have been used for automated unsafe behavior detection in recent years. Variables such as hard hats, safety vests and workers can be detected by using certain computer vision techniques (e.g., a background subtraction algorithm [89], the histograms of oriented gradients (HOG) method [90], and the scaleinvariant feature transform (SIFT) [91]). Nowadays, such methods which require much work for feature extraction are being replaced by DL gradually.
Mneymneh et al. [92] developed a CNN-based framework that could determine if workers (even they are moving) were wearing hard hats on the construction site. Xie et al. [93] modified a CNN to detect workers' hard hats, and the model produced excellent results in the mean average precision (mAP) performance metric. Similarly, the Faster R-CNN [94] and SSD methods [95] were also employed to detect hard hats.
Fang et al. [96] modified the Faster RCNN to identify if workers equipped harnesses properly. Kolar et al. [97] employed a VGG-16 model to detect if safety guardrails were installed correctly to prevent workers from falling from heights. Siddula et al. [98] integrated a Gaussian mixture model (GMM) with CNNs to detect roofers on roof construction sites. This research can alleviate roof site fall risks.
In the unsafe activities identification area, Ding et al. [19] coupled a long short-term memory (LSTM) model [99] with CNNs to identify if the worker would climb a ladder unsafely [19]. Kim et al. [100] developed an image-based risk prevention system to display the safety-related information of each construction worker on a wearable augmented reality (AR) device. Luo et al. [101] utilized a Faster R-CNN to determine workers' activities based on construction site images. Considering that temporal information is necessary for dynamic activities detection, Luo et al. [102] later improved the framework for video-based worker activity recognition by helping the temporal information emerge. Some researchers have also investigated construction vehicle detection using DL. Kim et al. [103] employed a region-based FCN to detect construction vehicles. Fang et al. [104] used a Faster R-CNN to identify the spatial relationship of workers and excavators on construction sites. This study provided a basic prototype of the site safety alert system, which can prevent workers from being hit by heavy equipment. Son et al. [105] used a Faster R-CNN to identify on-site workers in diverse poses against complex backgrounds.

Analysis of Construction Safety Documents
On-site managers can benefit from analyzing safety reports, as they can acquire details about the events and circumstances that result in safety accidents. Hence, corresponding actions can be taken to prevent similar accidents in the future [106]. With the development of safety management, scholars have proposed various text classification methods to classify and analyze accident causation [107]. More recently, researchers have tried to apply various ML algorithms to construction-related accident analysis. Tixier et al. [108] developed a predictive model which used random forest (RF) [109] and stochastic gradient tree boosting (SGTB) methods [110]. This model can forecast the different types of injuries recorded by on-site injury reports. Later, Tixier et al. [111] presented an improved NLP method to extract outcome variables and injury precursors from the unstructured injuryrelated text. This method can reduce the labor cost of text analysis. Chokor et al. [112] adopted an unsupervised learning approach for injury report classification based on Kmeans clustering. The text mining approach presented outstanding results regarding recall and precision. Goh et al. [113] applied six ML approaches to classify the near-miss accident reports. The results illustrated that ML algorithms performed better than traditional text classification methods. However, ML requires hand-crafted featuring, which limits the generalization of the classifier and also may affect the adaptability.
DL-based methods perform excellently, in regard to text mining and classification, when compared with traditional ML [114]. Word2Vec [115] is a popular DL method that adopts word embedding technology. Word2Vec can be trained by public semantic resources such as Wikipedia when utilizing it for text classification [116]. It avoids the manual featuring process and augments knowledge concurrently. However, there is relatively limited research focusing on this topic. Future research can be conducted by combining Word2Vec and computer vision in JSM. For example, CNN can be used to detect unsafe events from the construction site images taken from surveillance cameras. Then, image captioning technology provided by Word2Vec can be used to explain the event in text formation. In this way, an automated safety report generation system based on construction images can be developed.

Limitations of the Data Set and Weak Generalization of DL Models
To achieve a high detection or classification accuracy, normally a massive labeled image is needed to train a deep neural network well. However, there are few civil engineeringrelated data sets available to the public, and it is not easy to obtain such data. To this end, it is common to pre-train a CNN model with an extensive common object dataset (e.g., ImageNet, MNIST and CIFAR-10). Normally, most of the popular pre-trained CNN models (e.g., VGG-16, ResNet50, Inceptionv3, and YOLOv3) are available online. Such models with pre-trained weight have already learned to extract the basic features of the images. Researchers can download these models as the backbone of their DL structures for conducting specific tasks. This process is called transfer learning [20,117], and many DL-based applications have employed it in practice. However, a pre-trained model is still required to learn the nuanced features of the specific targets. Given the lack of available datasets for training, researchers in construction are normally required to create their own by labeling images manually. This process is time-consuming, tedious and expensive. Since the specific database has to be created, it tends to be limited and, thus, can hinder the generalization ability of the model. To this end, data augmentation methods can be employed to upsize the dataset synthetically. In general, those methods include spatial flipping, cropping, bending and other deformations of the images [118]. Typically, the semantic meaning of the labels will not be changed by these augmentation approaches. Therefore, the newly generated images are highly associated with the original ones. In this way, the generalization of the DL model may not be improved effectively.
In order to tackle these issues, a generative adversarial network (GAN) may be a possible option. GANs were first introduced by Goodfellow et al. [119] and can be used as data generation models. A GAN consists of a generator and a discriminator. The principle of a GAN is to let the discriminator evaluate the new data produced by the generator. As the recurrent training progresses, the performance can be improved gradually during the process. This kind of unsupervised learning has been proven to be powerful in many applications. In recent years, scholars start using GANs to produce synthetic images. Compared to the data augmentation methods, the images generated by GANs are more diverse and distinct [5]. In addition, researchers have used the advantages of GANs to generate face images. For example, Jin et al. [120] applied a GAN to anime character creation. The GAN demonstrated stable performance in the innovative face creation of anime characters. Zhang et al. [5] designed a GAN which illustrated good outcomes in natural scene images. Besides those examples, Kitchen and Andy [121] leveraged a GAN in the health sector. They applied a deep GAN to produce synthetic prostate lesion images and utilized these images to upsize their training dataset. However, there are relatively few GAN-related studies conducted in the AEC safety industry. Implementing a GAN in practice might be a possible solution for tackling the construction-related data set challenge.
Similarly, there are few public vibration data sets available for researchers to train their classifiers. Hence, future research should focus on finding an alternative method to train those classifiers when sensor data is insufficient. One recommended method for building up training data sets is combining real-life data from an undamaged structure with simulated data for the damage scenarios collected from an in-lab model. Besides that, as more and more emerging algorithms come out, unsupervised learning that can run on a small amount of labeled and unlabeled sensor data may become a hot topic in the future.

Directions of Future Studies
With the advances of powerful cloud-based computing and the DL platform, the importance of data is more obvious than ever. However, there are still many types of data from SHM and JSM that have not been fully utilized by ML and DL technologies. Considering the development of emerging technologies and underutilized data types, the recent development direction of DL-based SHM and JSM research can be divided into the following categories.

DL-Based Seismic Vibration Control for SHM
Seismic vibration control is, in conjunction with SHM, another very important field for safety management [122][123][124]. In order to improve the stability of the structure when an earthquake occurs, the active controller and the passive base-isolated systems are generally integrated together [125]. Some scholars have started to used ML algorithms to evaluate the failure probability of base-isolated systems [122] and design active control schemes [123]. However, sensor malfunction and time delay are among the major concerns that impact a vibration control system's performance under seismic vibration [59,126,127]. The LSTM model has been proven to be effective in distinguishing long-term dynamical dependencies over sequential frames [99]. Therefore, it is possible to predict the response of structures to overcome the breakdown of signal and time delay issues by using LSTM networks, which results in developing robust and efficient control systems.

Visual-Based On-Site Fatigue Monitoring
In the field of JSM, fatigue monitoring is also important. Given the increasing complexity of on-site dynamics, the heavy equipment operators, as well as their operations and judgment, play an essential role in ensuring safety and productivity on the construction site. Nevertheless, as the work becomes more intense, their cognitive awareness may become more jeopardized, which may lead to a safety hazard. It is noteworthy that Tam and Fung [128] revealed that approximately 60.5% of crane operators would continue to work even feeling fatigued after long working hours under a tight construction schedule, and about 52.6% of the crane operators experienced a lack of breaks, as they found it inconvenient to frequently move in or out within the narrow workspace. Hence, automated fatigue monitoring and warnings can provide timely support for this cohort. As the fatigue condition of heavy equipment operators and workers is relatively easy to be detected via analyzing their facial expressions, the real-time monitoring of their faces through cameras can be an effective, feasible and non-invasive method to identify drowsiness and avoid accidents [129]. Currently, CNNs have been used in real-time fatigue monitoring of on-road driving [130][131][132]. Some researchers have employed CNNs to extract video-level features and then integrate them into LSTM models to analyze the temporal information for fatigue identification [133,134]. However, the potential of a CNN-LSTM model regarding fatigue monitoring of heavy equipment operators is far from being comprehensively explored. Therefore, these DL-based applications have the potential to be leveraged to detect driver fatigue in the construction field.

Possible Integrations with Other Digital Technologies
Future research can integrate DL with other novel technologies, such as AR and virtual reality (VR) [135,136], building information modeling (BIM) [137][138][139][140], 5G technology and the Internet of things (IoT) [64], into real-time SHM and JSM to extract information from the massive amounts of data continuously received from wireless sensor networks. Such research can further expand the scope of SHM and JSM through various advanced sensors to provide decision-making assistance accurately. Being capable of interpreting unstructured data in large volumes, DL technology can be employed to facilitate the entire integrated system cost-effectively and intuitively. Future research can propose such a new paradigm in relation to IoT-based sensor data collection, transformation and visualization as part of the DL-based SHM and JSM applications. Nowadays, unmanned aerial vehicles (UAVs), mobile phones and AR devices (e.g., HTC VIVE Focus Plus, Microsoft Hololens and Google Glass) have become an integral part of human activities. These compact devices can normally collect digital data and have good computing and network technology. Moreover, most of these devices are programmable, having the potential to achieve realtime and cloud-based assessments. Currently, some popular CNNs (e.g., MobileNet-SSD and YOLOv3-Lite) have been proposed. These networks do not require a high computing power and can maintain relatively stable performance. Therefore, it is possible to conduct on-site safety inspection through these mobile devices. With the advent of the 5G era, lowlatency collaboration and intelligent optimization become possible. For example, UAVs can be paired with mobile phones to obtain and process image data while embedding light-weight DL algorithms. In addition, AR technologies have been broadly exploited in construction through superimposing virtual elements onto the construction site [84,141,142]. By integrating these advanced digital devices and technologies with DL, it is possible to establish a real-time hazard alerting system. First, site images or videos are transmitted to the mobile phone continuously by using UAVs. Then, a mobile phone with DL algorithms installed can be used to detect hazardous activities based on these image-based data. When a worker enters the hazardous area (e.g., working too close to the construction vehicles), the AR device worn by the worker can provide timely alerts through popping up a warning in the worker's point of view. However, current lightweight DL algorithms are only qualified for detecting and localizing objects with apparent features in the images; things such as cracks and minor defects of structures with subtle features may not be applicable in this case. Therefore, an optimized version of the DL structure is still needed in the future.

Conclusions
DL-based applications in the AEC safety industry are becoming more and more widespread [143,144]. The purpose of this paper was to summarize the past decade of research in SHM-and JSM-based on ML and DL applications and offer possible solutions for current research challenges. First, this study started with scientometrics analysis by using CiteSpace to visualize the knowledge map of ML and DL applications in the AEC safety industry. Second, this study reviewed the related state-of-the-art literature and identified the main challenges of current research. Additionally, possible suggestions for future research directions were provided. It is believed that this comprehensive review can inspire researchers to develop more types of practical DL applications in the future.