Current Status and Future Directions of Deep Learning Applications for Safety Management in Construction

The application of deep learning (DL) for solving construction safety issues has achieved remarkable results in recent years that are superior to traditional methods. However, there is limited literature examining the links between DL and safety management and highlighting the contributions of DL studies in practice. Thus, this study aims to synthesize the current status of DL studies on construction safety and outline practical challenges and future opportunities. A total of 66 influential construction safety articles were analyzed from a technical aspect, such as convolutional neural networks, recurrent neural networks, and general neural networks. In the context of safety management, three main research directions were identified: utilizing DL for behaviors, physical conditions, and management issues. Overall, applying DL can resolve important safety challenges with high reliability; therein the CNN-based method and behaviors were the most applied directions with percentages of 75% and 67%, respectively. Based on the review findings, three future opportunities aiming to address the corresponding limitations were proposed: expanding a comprehensive dataset, improving technical restrictions due to occlusions, and identifying individuals who performed unsafe behaviors. This review thus may allow the identification of key areas and future directions where further research efforts need to be made with priority.


Introduction
Construction is a large, dynamic, and complex field offering a large number of job opportunities for millions of people worldwide [1]. In addition, construction sites also contain various risks (e.g., struck-by accidents [2] and fall accidents [3]), and the accident rate continues to rise over time. According to global statistical data, the construction industry's accidental death and injury rates are three and two times higher than those of other industries, respectively [4]. The number of fatal injuries in this industry in the United States increased by 16%, from 781 in 2011 to 908 in 2014 [5], and its injuries and accidents in 2015 were 50% higher than those in any other industry [3]. These percentages reached 40% of the total accidents in Japan, 25% in the United Kingdom, and 50% in Ireland [6]. Although various countries have put effort into construction safety-related laws, regulations, and management systems over the past decades, their safety performance in construction is still unsatisfactory [7]. Thus, it is essential to apply an appropriate method to assist safety management in the construction industry.
To prevent occupational accidents, Sarkar and Maiti (2020) [8] investigated and reported several existing approaches, such as survey-based qualitative analysis, conventional statistical analysis, and data-driven machine-learning-based analysis. By reviewing publications examining the application of machine learning (ML) approaches in accident analysis, they also illustrated that ML outperforms its traditional counterpart, owing to its several potential benefits, including the capability to deal with large dimensional data, flexibility in recreating data generation structures regardless of complexity, and predictive and interpretive potential by extracting relationships/rules among attributes in data [8]. In support of this observation, Xu and Saleh (2021) [9] argued that ML has the potential to provide new insights and opportunities to address critical challenges in safety applications. However, one of the challenges of ML is that ML problems become extremely difficult for high-dimensional data [10]. Compared to traditional ML, deep learning (DL) algorithms can deal with high-dimensional input data, and they become highly efficient in resolving the issue of data sources such as images and videos when equipped with convolutional layers [9]. Moreover, the rapid development of graphics processing units (GPUs) has dramatically improved the computing capacity for processing ML algorithms, leading to an increase in the number of DL applications [11]. Therefore, Xu and Saleh (2021) [9] emphasized that in all applications to date, DL has considerably outperformed shallow ML algorithms. In this context, researchers in the construction industry have made considerable efforts to keep up with the pace of DL applications [12]. The amount of research on DL in construction has grown exponentially over the past few years, and the applications have spread over many construction areas since their inception [13]. For example, Akinosho et al., (2020) [12] proved that DL was applied to prevalent construction challenges, such as structural health monitoring, construction site safety, building occupancy modeling, and energy demand prediction [12]. In the context of construction safety, DL has also proven its potential for safety management. DL can be used to extract different types of data such as images, videos, text, and signals to reduce construction accident cases by detecting on-site damage conditions [14], detecting unsafe behaviors [15], and analyzing construction safety documents [16].
DL is a subset of ML, and can theoretically deal with all categories of ML [9]. For example, different types of DL techniques used in real-time object detection help develop new helmet detection systems with higher accuracy and less training time [17].   [18] demonstrated that DL can be used to automatically extract unstructured safety data from accident reports. As a result, managers become better positioned to make informed and timely decisions about how to ensure construction safety [18]. With these prominent and widespread applications of DL in construction safety, researchers need to understand what typical types of data can be used for different methods (e.g., convolutional neural networks, recurrent neural networks, etc.) for gaining high performance. Moreover, with the extremely rapid advancement of DL algorithms, the review of recent literature can play an important role in understanding the research status of DL studies and exploring an opportunity of its application for further enhancement of construction safety. However, there is limited literature examining the theoretical links between DL and safety management. For example, several review studies, such as [19,20], have mainly focused on construction safety without the detailed review on DL techniques. Hou et al., (2021) [21] carried out a review of the relevant papers on applications of DL for safety management in the architecture, engineering, and construction (AEC) industry; however, a comprehensive linkage between safety and DL methods (e.g., data types and quantities, DL algorithms and their performance, safety factors) was not fully investigated. Moreover, how the results of DL studies can be applied in safety management practice was not clearly presented and discussed by Hou et al., (2021) [21]. By addressing those issues, researchers and managers in the field of construction safety may better understand what type of method has achieved highly accurate results along with the type and amount of data has been used for a certain safety task, as well as the actions managers can take from the result of DL models for improving safety management. This study aims to fill these gaps by comprehensively reviewing DL studies in the construction safety area. Specifically, this literature review is performed to (1) identify and summarize the current status of recent DL studies in the construction safety area for showing how DL could be applied in previous studies; (2) analyze the links of data type and quantity, and DL models applied and newly proposed with three main research directions of construction safety (e.g., behaviors, physical conditions, and management issues) for understanding how to apply DL models in different safety-related tasks; (3) review the contributions of DL results in safety management practice; and (4) outline practical challenges and future opportunities associated with the applications for improving and fully exploiting the DL contributions in safety. This review may thus allow the identification of key areas and future directions where further research efforts need to be made with priority. The remainder of this paper is organized as follows. The paper firstly presents the research methodology used in this review (Section 2). An overview of DL algorithms commonly used for construction safety is then presented from a technical aspect (Section 3). Subsequently, this paper summarizes the current status of safety-related papers for an in-depth understanding of DL applications for safety management (Section 4). Along with a comprehensive review, this study discusses the contributions, practical challenges, and future opportunities of applying DL approaches to practice (Sections 5 and 6). Finally, the major findings are summarized to present the significance of this study (Section 7).

Research Methodology
With the purpose of analyzing the current status of DL studies in safety performed to understand how well the DL methods have been applied for safety management as well as how distinct DL models could address safety issues with different specific types of data, this study adopts a content-analysis-based review method, a systematic and structured technique "for compressing many words of text into fewer content categories based on explicit coding rules" to identify key research themes for literature review [22]. Content analysis is a research tool utilized to determine the presence of certain words, themes, or concepts within several given qualitative data (i.e., text). Using content analysis, researchers can analyze and quantify the presence, meanings and relationships of certain words, themes, or concepts [23]. This method has been well-recognized and widely used for reviewing and synthesizing literature, and rationalizing outcomes in the research field of engineering/construction management [22,[24][25][26]. The review process based on this method consists of three phases: literature search, title-and abstract-based literature selection, and full-paper-based literature selection, as described in Figure 1. In the literature search, an exhaustive search was carried out with keywords regarding DL and construction safety that aimed to find all articles related to the field of review. The title-and abstractbased literature selection was then conducted to filter papers applying DL to handle safety issues based on reading titles and abstracts. After that, an overall screening was performed in the phase of full-paper-based literature selection that aimed to identify the articles relevant only to construction safety and DL by reading the full paper. Therefore, the most significant DL studies on construction safety were collected and reviewed to guarantee the provision of fit and quality research materials for this study.

Literature Search
The first step of the review was an exhaustive search in Scopus and Google Scholar. Keywords and Boolean operators, AND and OR, were used to ensure that all relevant literature was captured from 2014 to 2021. According to Akinosho et al. (2020) [12], DL became popular with the achievements of CNNs in the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC2012), and its applications in the construction industry have achieved significance since around 2014. Thus, the chosen dates were based on the DL revolution. The search strings used were "deep learning" OR "computer vision" OR "CNN" OR "RNN" OR "neural networks" AND "construction safety" OR "construction hazard" OR "construction accident" OR "safety management". Initially, 387 documents were identified. To limit the scope of the search results, these documents were further screened by including only journal articles published in English and the remaining 145 papers. Moreover, we chose articles with the highest level of relevance to the research scope, namely, engineering, computer science, materials science, and management. After this screening, a total of 126 documents, including articles and conference papers, were selected as the literature sample.

Title-and Abstract-Based Literature Selection
This stage of document screening was conducted to identify articles relevant to construction safety and DL for further analysis. These documents from the literature search were manually screened by reading and exploring the titles and abstracts to identify and extract relevant articles. Publications that did not include keywords regarding construction safety and deep learning in titles or abstracts were screened out. The total number of documents remaining after this phase was approximately 98.

Full-Paper-Based Literature Selection
This phase aims to remove irrelevant papers by examining the contents of the articles. The remaining documents from the previous phase were screened by reading the full paper to identify articles relevant only to construction safety and DL. For example, articles (e.g., [27]) that only mentioned "deep learning" but did not focus on DL methods, were

Title-and Abstract-Based Literature Selection
This stage of document screening was conducted to identify articles relevant to construction safety and DL for further analysis. These documents from the literature search were manually screened by reading and exploring the titles and abstracts to identify and extract relevant articles. Publications that did not include keywords regarding construction safety and deep learning in titles or abstracts were screened out. The total number of documents remaining after this phase was approximately 98.

Full-Paper-Based Literature Selection
This phase aims to remove irrelevant papers by examining the contents of the articles. The remaining documents from the previous phase were screened by reading the full paper to identify articles relevant only to construction safety and DL. For example, articles (e.g., [27]) that only mentioned "deep learning" but did not focus on DL methods, were removed. Several articles, such as [28], were also removed, as they did not focus on safety in construction, although the term "construction safety" was found in its abstract. Similar articles that applied DL in manufacturing, structural assessment, and crack and defect detection were removed as they did not focus on safety issues in the construction industry. After the third screening, a total of 66 papers remained for an in-depth review and analysis.

Results
According to the final paper selection, a total of 66 papers in journals shown in Figure 2 were identified for further analysis. Figure 3 shows the number of publications by year, which proves the development of DL applications in construction safety in recent years. The number of studies using DL increased from 2018 to 2021 and is likely to continue to rise in the coming years. Figures 4-7 present an overview of the reviewed papers. In addition to extracting information related to DL models and safety factors, which is the purpose of this study, we also present the type of data and accident types to provide a comprehensive overview of what types of accidents researchers have attempted to reduce. Overall, these figures show that the CNN-based method and behaviors were the most applied directions with percentages of 75% and 67%, respectively; images were the most used data in these models (73%), and struck-by and other general accidents were two types of accidents DL studies have focused on with the percentages of 36% and 38%, respectively.
According to the final paper selection, a total of 66 papers in journals shown in Figure  2 were identified for further analysis. Figure 3 shows the number of publications by year, which proves the development of DL applications in construction safety in recent years. The number of studies using DL increased from 2018 to 2021 and is likely to continue to rise in the coming years. Figures 4-7 present an overview of the reviewed papers. In addition to extracting information related to DL models and safety factors, which is the purpose of this study, we also present the type of data and accident types to provide a comprehensive overview of what types of accidents researchers have attempted to reduce. Overall, these figures show that the CNN-based method and behaviors were the most applied directions with percentages of 75% and 67%, respectively; images were the most used data in these models (73%), and struck-by and other general accidents were two types of accidents DL studies have focused on with the percentages of 36% and 38%, respectively.       Year CNN 75%

Behaviors
Physical conditions

Management issues
Videos 12%

General
Percentage of publications by accident types

Behaviors
Physical conditions

Management issues
Videos 12%

Percentages of publications by data
Struck-by 36%

General accident
Percentage of publications by accident types

Overview of Deep Learning Architectures
DL is a set of ML algorithms that attempt to learn features at multiple levels with different levels of abstraction [29]. The grades in these learned models correspond to different levels of concepts, where the same lower-level concepts can support many higherlevel concepts [29]. Thus, a DL architecture can be defined as an artificial neural network (ANN) with two or more hidden layers to enhance prediction accuracy [29,30]. Three important reasons for the popularity of DL today are the drastic increase in the abilities of chip processing (e.g., GPU units), the significant increase in the size of data used for training, and the recent algorithm advances in ML and signal/information processing studies [29,31]. These advances have enabled DL methods to exploit complex, compositional nonlinear functions, and effectively use both labeled and unlabeled data [29]. Therefore, unlike the architectures of shallow ML, DL networks are capable of processing nonlinear information [32] and provide training for both supervised and unsupervised categories [33]. With the outstanding ability in processing various types of data, including images, videos, text, speech, and signals, DL networks and techniques have been implemented widely in various fields such as image classification [34], object detection [35], object tracking [36], activity recognition [37], information extraction [38], text classification [39], and speech recognition [40].
According to Khallaf and Khallaf (2021) [13], DL is called "deep" due to the number of layers available in the network model. Generally, the DL architecture is composed of three types of layers: an input layer, hidden layers, and an output layer; the typical architecture of DL is shown in Figure 8. Data are received in an input layer, features are extracted from the datasets via hidden layers depending on the purpose of their application, and the resulting features are passed to the output layer for prediction. In the network, the output of the previous layer is used as the input of the next layer. There are different types of DL architectures [13], and for safety management, the most commonly used types of DL include convolutional neural networks (CNNs), recurrent neural networks (RNNs), and general neural networks (GNNs).

Convolutional Neural Networks
For DL, the term "deep" is derived from the many hidden layers in the ANN structure [41]. Unfortunately, this structure is receptive to translation and shift deviation, which may adversely affect the performance of classification [42]. To eliminate these drawbacks, an extended ANN version, the CNN, was developed, which can ensure spatial translation and shift invariance [43]. The CNN is a supervised DL architecture mainly used for image analysis applications [30,44,45]. Similar to the ANN, the network consists of multiple hidden layers between an input layer and an output layer ( Figure 9). However, the hidden layers comprise convolutional, pooling, and fully connected layers. The convolution filter acts as a feature extractor by learning hidden patterns from different input signals [41] and generating relevant feature maps through kernels or filters [30]. The calculation of convolution is defined as where I x+s×i,y+s×j is the value of the input feature at the point of (x + s × i, y + s × j), C in is the number of input channels, K is the kernel size, s is the stride of convolutional layer, w i,j is the weight in the kernels, b x,y is the bias, and O x,y is the value of the output feature at the point of (x, y). This convolutional layer thus allows the detection of low-level features, such as lines and edges, as well as high-level features such as shapes and objects [46]. In this process, the convolutional layer can enhance the input data features and reduce noise [32]. The convolutional layer is likely connected to a pooling layer with a nonlinear mapping function (e.g., rectified linear unit (ReLU)) [47]. The appropriate pooling layer has a positive effect on reducing the input dimension without losing information [47]. Different types of pooling methods exist, such as global pooling, average pooling, and max pooling [30]. In particular, for extracting features from images, the performance of the maximum pooling method is better than that of average pooling [48]. Maximum pooling splits the input image into multiple rectangular regions based on the size of the filter, and its output is the maximum value for each region [49]. The output of the max pooling layer can be calculated as N out where the max pooling layers take the maximum value from the region i × i of input as the output, N in x+m,y+n is the value of the input at the point of (x + m, y + n), and N out x,y is the value of output at the point of (x, y). This process is known as downsampling or subsampling [30]. After these layers, the fully connected layer commonly connects all neurons from the previous layer to every single neuron [32]. Thus, this layer sets a weighted sum of all the previous layer outputs to determine a specific target output [41].

Overview of Deep Learning Architectures
DL is a set of ML algorithms that attempt to learn features at multiple levels with different levels of abstraction [29]. The grades in these learned models correspond to different levels of concepts, where the same lower-level concepts can support many higherlevel concepts [29]. Thus, a DL architecture can be defined as an artificial neural network (ANN) with two or more hidden layers to enhance prediction accuracy [29,30]. Three important reasons for the popularity of DL today are the drastic increase in the abilities of chip processing (e.g., GPU units), the significant increase in the size of data used for training, and the recent algorithm advances in ML and signal/information processing studies [29,31]. These advances have enabled DL methods to exploit complex, compositional nonlinear functions, and effectively use both labeled and unlabeled data [29]. Therefore, unlike the architectures of shallow ML, DL networks are capable of processing nonlinear information [32] and provide training for both supervised and unsupervised categories [33]. With the outstanding ability in processing various types of data, including images, videos, text, speech, and signals, DL networks and techniques have been implemented widely in various fields such as image classification [34], object detection [35], object tracking [36], activity recognition [37], information extraction [38], text classification [39], and speech recognition [40].
According to Khallaf and Khallaf (2021) [13], DL is called "deep" due to the number of layers available in the network model. Generally, the DL architecture is composed of three types of layers: an input layer, hidden layers, and an output layer; the typical architecture of DL is shown in Figure 8. Data are received in an input layer, features are extracted from the datasets via hidden layers depending on the purpose of their application, and the resulting features are passed to the output layer for prediction. In the network, the output of the previous layer is used as the input of the next layer. There are different types of DL architectures [13], and for safety management, the most commonly used types of DL include convolutional neural networks (CNNs), recurrent neural networks (RNNs), and general neural networks (GNNs).

Convolutional Neural Networks
For DL, the term "deep" is derived from the many hidden layers in the ANN structure [41]. Unfortunately, this structure is receptive to translation and shift deviation, which may adversely affect the performance of classification [42]. To eliminate these  The variations of CNN methods include region-based CNN (R-CNN), fast R-CNN, faster R-CNN, and you only look once (YOLO). As discussed above, DL methods with convolutional networks are widely used for image processing tasks. Among the various applications of CNNs, object detection frameworks combining both classification and localization to detect and draw boxes around objects in images have markedly developed in recent years [50]. According to Koirala et al. (2019) [50], early object detection frameworks based on CNN used a sliding window approach at evenly spaced locations over the image, where many patches are generated to classify each patch as containing an object or not. Thus, feeding all available patches for multiscale detection to a CNN slowed the object detection framework [50]. R-CNN replaced the sliding window method by using a group of boxes for the image and then analyzing each box if either of the boxes contained a target [51]. The entire target identification method through R-CNN uses the following three models: a linear SVM classifier for object identification, CNN employed for characteristic extraction, and a regression model required to tighten the bounding boxes [52]. Therefore, the drawbacks of R-CNN are multiple stages of training, taking up disk space and training time consuming cumbersome steps [53]. Therefore, a fast R-CNN was developed to improve the detection speed of R-CNN [50]. In place of using three different models of R-CNN, fast R-CNN [54] employs a model to extract characteristics from different regions. However, the drawback of the fast R-CNN method is that it is based on a selective search [55]; for example, 2000 sections are excerpted per image [52]. Thus, this approach may increase the running time of the fast R-CNN method [52]. In contrast, faster R-CNN creatively utilizes the convolution network to create the proposed box and shares the convolution network with the object detection network, which reduces the number of proposed frames, for example, from approximately 2000 to approximately 300 [56]. However, despite the speed of faster R-CNN-based detection model being improved compared to that of fast R-CNN, it is still too slow to apply to real-time video streaming [50]. To address this limitation, YOLO was developed to generate a one-step process involving detection and classification [57]. YOLO's idea differs from other traditional systems in that bounding box predictions and class predictions are performed simultaneously [57], making YOLO one of the fastest object detection methods [50]. The variations of CNN methods include region-based CNN (R-CNN), fast R-CNN, faster R-CNN, and you only look once (YOLO). As discussed above, DL methods with convolutional networks are widely used for image processing tasks. Among the various applications of CNNs, object detection frameworks combining both classification and localization to detect and draw boxes around objects in images have markedly developed in recent years [50]. According to Koirala et al., (2019) [50], early object detection frameworks based on CNN used a sliding window approach at evenly spaced locations over the image, where many patches are generated to classify each patch as containing an object or not. Thus, feeding all available patches for multiscale detection to a CNN slowed the object detection framework [50]. R-CNN replaced the sliding window method by using a group of boxes for the image and then analyzing each box if either of the boxes contained a target [51]. The entire target identification method through R-CNN uses the following three models: a linear SVM classifier for object identification, CNN employed for characteristic extraction, and a regression model required to tighten the bounding boxes [52]. Therefore, the drawbacks of R-CNN are multiple stages of training, taking up disk space and training time consuming cumbersome steps [53]. Therefore, a fast R-CNN was developed to improve the detection speed of R-CNN [50]. In place of using three different models of R-CNN, fast R-CNN [54] employs a model to extract characteristics from different regions. However, the drawback of the fast R-CNN method is that it is based on a selective search [55]; for example, 2000 sections are excerpted per image [52]. Thus, this approach may increase the running time of the fast R-CNN method [52]. In contrast, faster R-CNN creatively utilizes the convolution network to create the proposed box and shares the convolution network with the object detection network, which reduces the number of proposed frames, for example, from approximately 2000 to approximately 300 [56]. However, despite the speed of faster R-CNN-based detection model being improved compared to that of fast R-CNN, it is still too slow to apply to real-time video streaming [50]. To address this limitation, YOLO was developed to generate a one-step process involving detection and classification [57]. YOLO's idea differs from other traditional systems in that bounding box predictions and class predictions are performed simultaneously [57], making YOLO one of the fastest object detection methods [50].

Recurrent Neural Networks
The neurons of a fully connected network or a CNN are fully connected in different layers but disconnected in the same layer; each layer processes signals independently and then propagates to the next layer [48]. In this regard, this architecture cannot resolve the problem of relationships between input data [32]. RNNs can be considered as another class of DL networks that are used for sequential data for supervised and unsupervised learning [29]. An RNN can "remember" past information and utilize the knowledge learned from the past to make its present decision [58]. In RNNs, the output of the previous step is stored and utilized to calculate the current output ( Figure 10), which means that the network's input contains both the data from the input layer and the output of the previous hidden layers [32]. The output of the RNN model can be calculated as where U is the weights matrix of the input x t to the hidden layers, W is the duplicated recurrent weight matrix, V represent sts the hidden to output weight matrix, f is a nonlinear activation function, and b h and b o are the biases added to the hidden and output layers, respectively. Thus, the RNN is extremely powerful for modeling sequence data (e.g., speech or text) [29].

Recurrent Neural Networks
The neurons of a fully connected network or a CNN are fully connected in different layers but disconnected in the same layer; each layer processes signals independently and then propagates to the next layer [48]. In this regard, this architecture cannot resolve the problem of relationships between input data [32]. RNNs can be considered as another class of DL networks that are used for sequential data for supervised and unsupervised learning [29]. An RNN can "remember" past information and utilize the knowledge learned from the past to make its present decision [58]. In RNNs, the output of the previous step is stored and utilized to calculate the current output ( Figure 10), which means that the network's input contains both the data from the input layer and the output of the previous hidden layers [32]. The output of the RNN model can be calculated as where U is the weights matrix of the input xt to the hidden layers, W is the duplicated recurrent weight matrix, V represent sts the hidden to output weight matrix, f is a nonlinear activation function, and bh and bo are the biases added to the hidden and output layers, respectively. Thus, the RNN is extremely powerful for modeling sequence data (e.g., speech or text) [29]. Despite the promising performance of RNN, vanishing gradient is a significant problem in the conventional RNN because it makes the gradient easily vanish (e.g., the previous information is lost through multiple layers), and the model learning process becomes much more difficult [59]. One solution to solve this problem is to use long short-term memory (LSTM) networks, which can store sequences for a long time, as well as using gated recurrent units (GRUs) [60,61]. The LSTM algorithm combines a memory block with three gates: input, output, and forget gates [41]. The input gate determines what new information is saved and updated in the cell state, the output gate determines what infor- Despite the promising performance of RNN, vanishing gradient is a significant problem in the conventional RNN because it makes the gradient easily vanish (e.g., the previous information is lost through multiple layers), and the model learning process becomes much more difficult [59]. One solution to solve this problem is to use long short-term memory (LSTM) networks, which can store sequences for a long time, as well as using gated recurrent units (GRUs) [60,61]. The LSTM algorithm combines a memory block with three gates: input, output, and forget gates [41]. The input gate determines what new information is saved and updated in the cell state, the output gate determines what information is utilized based on the cell state, and the forget gate is used to delete the unimportant information from the cell state. Thus, the difference from RNN is that LSTM can determine what information is useful through the cell, which can avoid the disappearance of the gradient to some extent [48]. The learning capacity of the LSTM cell is also superior to that of a conventional recurrent cell [62]. However, additional parameters increase the computational burden [62]. To reduce the number of parameters, the GRU combines the input and forget gates of the LSTM model into an update gate, and the output gate in the LSTM model is called a reset gate [63]. Thus, the GRU is an extension of LSTM, which achieves a performance comparable to that of LSTM but uses fewer parameters and makes training faster [64].

General Neural Networks
In addition to the two common methods of DL (i.e., CNN and RNN), bidirectional encoder representations from transformers (BERT) [39] (Figure 11) and other deep learning models for natural language processing (NLP) ( Figure 12) and computer vision (CV) [65] ( Figure 13) have also been applied in safety management. Unlike recent language representation models, BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right contexts in all layers [66]. BERT's execution for tasks consists of two phases: pretraining for language understanding and fine-tuning for a specific task such as text classification and text summarization [67]. A pretrained language model can be defined as a black box containing previous knowledge of natural language [68]. The BERT by Devlin et al., (2018) [66] used encoders in a transformer as a substructure for pretraining models for NLP tasks. Specifically, the BERT-based model is pretrained using two unsupervised tasks: (1) the masked language model (LM) predicts some randomly masked tokens in the input to train the bidirectional encoder and (2) next sentence prediction (NSP) predicts the following sentence of the input sentence to understand sentence relationships, so the pretrained BERT model can be more suitable for other NLP applications [69]. BERT can be fine-tuned using a dense layer of neural networks for different classification tasks [68]. The advantages of BERT include its ability to address contextual information extraction owing to its bidirectional ability and faster training capabilities [67]. With the above characteristics, the BERT model demonstrated state-of-the-art performance in many NLP tasks [70]. BERT is known to achieve exceptional results in 11 natural language understanding (NLU) tasks [66]. However, BERT still has specific drawbacks, including the use of BERT-large, made up of 24-layered transformer encoder blocks, and producing a total of 340 million parameters, which may tend to be computationally expensive [67]. mation is utilized based on the cell state, and the forget gate is used to delete the unimportant information from the cell state. Thus, the difference from RNN is that LSTM can determine what information is useful through the cell, which can avoid the disappearance of the gradient to some extent [48]. The learning capacity of the LSTM cell is also superior to that of a conventional recurrent cell [62]. However, additional parameters increase the computational burden [62]. To reduce the number of parameters, the GRU combines the input and forget gates of the LSTM model into an update gate, and the output gate in the LSTM model is called a reset gate [63]. Thus, the GRU is an extension of LSTM, which achieves a performance comparable to that of LSTM but uses fewer parameters and makes training faster [64].

General Neural Networks
In addition to the two common methods of DL (i.e., CNN and RNN), bidirectional encoder representations from transformers (BERT) [39] (Figure 11) and other deep learning models for natural language processing (NLP) ( Figure 12) and computer vision (CV) [65] (Figure 13) have also been applied in safety management. Unlike recent language representation models, BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right contexts in all layers [66]. BERT's execution for tasks consists of two phases: pretraining for language understanding and fine-tuning for a specific task such as text classification and text summarization [67]. A pretrained language model can be defined as a black box containing previous knowledge of natural language [68]. The BERT by Devlin et al. (2018) [66] used encoders in a transformer as a substructure for pretraining models for NLP tasks. Specifically, the BERT-based model is pretrained using two unsupervised tasks: (1) the masked language model (LM) predicts some randomly masked tokens in the input to train the bidirectional encoder and (2) next sentence prediction (NSP) predicts the following sentence of the input sentence to understand sentence relationships, so the pretrained BERT model can be more suitable for other NLP applications [69]. BERT can be fine-tuned using a dense layer of neural networks for different classification tasks [68]. The advantages of BERT include its ability to address contextual information extraction owing to its bidirectional ability and faster training capabilities [67]. With the above characteristics, the BERT model demonstrated state-of-the-art performance in many NLP tasks [70]. BERT is known to achieve exceptional results in 11 natural language understanding (NLU) tasks [66]. However, BERT still has specific drawbacks, including the use of BERT-large, made up of 24layered transformer encoder blocks, and producing a total of 340 million parameters, which may tend to be computationally expensive [67].

Deep Learning Applications for Construction Safety Management
According to Reason's model [71], on-site safety management is the last layer of management for preventing accidents and requires considerable emphasis. In this context, we focus on construction safety aspects based on a safety management system (SMS). An SMS integrates activities and functions to identify accidents and manage risks in the workplace [72]. Construction safety management can be divided into preconstruction and construction phases [73]. In the preconstruction phase, the potential safety accidents are normally identified based on the experience of safety officers or project managers and eliminated through safety training and safety planning [74]. During construction, hazards are prevented by monitoring workers and the environment at construction sites [75]. Therefore, in general, a safety management system approach focuses on three main aspects: behaviors, physical conditions, and management issues [76,77]. Figure 5 shows the percentage of publications based on these safety factors. The common types of behaviors on construction sites identified [77,78] are (1) pose and gesture, (2) action, (3) interaction, (4) activity, and (5) personal protection equipment (PPE) and safety compliance. We then presented factors that influence the physical conditions on construction sites [76], including (1) site condition (SC), (2) work environment (WE), and (3) site layout (SL). Finally, management issues were discussed [79,80] based on the following subcategories: (1) safety management plan, (2) accident investigation and analysis, and (3) hazard identification and risk management. The general applications of DL in construction safety are shown in Figure  14.

Deep Learning Applications for Construction Safety Management
According to Reason's model [71], on-site safety management is the last layer of management for preventing accidents and requires considerable emphasis. In this context, we focus on construction safety aspects based on a safety management system (SMS). An SMS integrates activities and functions to identify accidents and manage risks in the workplace [72]. Construction safety management can be divided into preconstruction and construction phases [73]. In the preconstruction phase, the potential safety accidents are normally identified based on the experience of safety officers or project managers and eliminated through safety training and safety planning [74]. During construction, hazards are prevented by monitoring workers and the environment at construction sites [75]. Therefore, in general, a safety management system approach focuses on three main aspects: behaviors, physical conditions, and management issues [76,77]. Figure 5 shows the percentage of publications based on these safety factors. The common types of behaviors on construction sites identified [77,78] are (1) pose and gesture, (2) action, (3) interaction, (4) activity, and (5) personal protection equipment (PPE) and safety compliance. We then presented factors that influence the physical conditions on construction sites [76], including (1) site condition (SC), (2) work environment (WE), and (3) site layout (SL). Finally, management issues were discussed [79,80] based on the following subcategories: (1) safety management plan, (2) accident investigation and analysis, and (3) hazard identification and risk management. The general applications of DL in construction safety are shown in Figure  14.

Deep Learning Applications for Construction Safety Management
According to Reason's model [71], on-site safety management is the last layer of management for preventing accidents and requires considerable emphasis. In this context, we focus on construction safety aspects based on a safety management system (SMS). An SMS integrates activities and functions to identify accidents and manage risks in the workplace [72]. Construction safety management can be divided into preconstruction and construction phases [73]. In the preconstruction phase, the potential safety accidents are normally identified based on the experience of safety officers or project managers and eliminated through safety training and safety planning [74]. During construction, hazards are prevented by monitoring workers and the environment at construction sites [75]. Therefore, in general, a safety management system approach focuses on three main aspects: behaviors, physical conditions, and management issues [76,77]. Figure 5 shows the percentage of publications based on these safety factors. The common types of behaviors on construction sites identified [77,78] are (1) pose and gesture, (2) action, (3) interaction, (4) activity, and (5) personal protection equipment (PPE) and safety compliance. We then presented factors that influence the physical conditions on construction sites [76], including (1) site condition (SC), (2) work environment (WE), and (3) site layout (SL). Finally, management issues were discussed [79,80] based on the following subcategories: (1) safety management plan, (2) accident investigation and analysis, and (3) hazard identification and risk management. The general applications of DL in construction safety are shown in Figure 14.

Behaviors
Unsafe worker behavior is a significant cause of workplace accidents [81]. It has been proven that 88% accidents are caused by workers' unsafe behavior [82]. According to Fam et al., (2012) [83], unsafe behavior occurs when an employee fails to respect safety rules, standards, instructions, procedures, and specified project criteria. In general, unsafe behaviors are factors related to workers' awareness, unsafe actions, and noncompliance attitudes that cause dangerous consequences (e.g., injury). Due to the varying levels of abstraction and complexity of human behaviors, Edwards et al., (2016) [78] proposed a five-level classification system for workers' behaviors, which included pose, gesture, action, interaction, and activity. Likewise, Guo et al., (2021) [77] proposed a six-level hierarchical framework of safety behavior with the contribution of the safety compliance factor. According to a series of these studies and based on the applications of DL in construction safety, the unsafe behaviors causing accidents in construction are categorized as (1) pose and gesture, (2) action, (3) interaction, (4) activity, and (5) personal protection equipment (PPE) and safety compliance. Table 1 summarizes the DL studies on behaviors in the construction industry.

Behaviors
Unsafe worker behavior is a significant cause of workplace accidents [81]. It has been proven that 88% accidents are caused by workers' unsafe behavior [82]. According to Fam et al. (2012) [83], unsafe behavior occurs when an employee fails to respect safety rules, standards, instructions, procedures, and specified project criteria. In general, unsafe behaviors are factors related to workers' awareness, unsafe actions, and noncompliance attitudes that cause dangerous consequences (e.g., injury). Due to the varying levels of abstraction and complexity of human behaviors, Edwards et al. (2016) [78] proposed a fivelevel classification system for workers' behaviors, which included pose, gesture, action, interaction, and activity. Likewise, Guo et al. (2021) [77] proposed a six-level hierarchical framework of safety behavior with the contribution of the safety compliance factor. According to a series of these studies and based on the applications of DL in construction safety, the unsafe behaviors causing accidents in construction are categorized as (1) pose and gesture, (2) action, (3) interaction, (4) activity, and (5) personal protection equipment (PPE) and safety compliance. Table 1 summarizes the DL studies on behaviors in the construction industry. Table 1. Construction safety studies about behaviors.

Type of
Numbers of Data Accident Refer- Figure 14. DL applications in construction safety.  Note: The mAP represents mean average precision, and WMSDs represent work-related musculoskeletal disorders. The accident types in parentheses were judged by the authors' assessment and not specified in the paper.

Pose and Gesture
Posture-related safety risks have been a significant concern in construction projects that need to be addressed [90]. Pose and gesture are defined as the spatial arrangement of a human body at a single temporal instance, or a temporal pose series or action primitives on a subaction scale [77]. The worker's safety risk level can be assessed based on the worker's current posture by calculating the similarity of the workers' posture to the identified hazardous postures [88]. Several methods can be employed to represent human posture: images, text descriptions, or skeleton data [88]. The goal of human pose estimation is to specify the position of human joints from images or skeleton data provided using motioncapturing hardware [123]. Text description is a user-friendly way to facilitate human understanding, but it removes the objective and quantitative features of the postures [88]. Based on this, researchers have utilized DL methods for detecting unsafe postures using different types of data (e.g., videos [92], images [86], and signals [91]).
DL has been widely and successfully applied for detecting unsafe workers' postures with different typical statuses, including standing still, climbing down, standing on the ladder, and bending.  [87] developed a real-time smart surveillance system based on the YOLOv2 detection approach that can detect people and the status of excavators in hazardous areas. The results proved that the developed systems could provide immediate feedback concerning unsafe behavior and thus enable appropriate actions to be taken to prevent reoccurrence.

Action
Falls are highly frequent accidents in the construction industry, and occupational injuries and fatalities caused by falls from height pose a severe public problem worldwide [124]. According to prevention strategies for falling accidents in construction proposed by Huang and Hinze (2003) [3] and Chi et al., (2005) [125], fatal occupational falls on-site were closely associated with serious on-site risk factors, including poor work practices and bodily actions. Thus, it is essential to achieve and improve unsafe action recognition to ensure the safety of construction. In the study by Guo et al., (2021) [77], action is defined as a series of gestures that form a contextual event, or more specifically, action in construction is a single activity executed by a subject, such as ladder-climbing, walking, and running. In particular, the actions' pattern and pace vary from individual to individual as well as from time to time [126]. Thus, it can be determined that different action categories can have similar postures, and one action category can have a variety of postures [127]. According to Gong et al., (2011) [127], action is classified as either action at a single moment as depicted in an image or action in a time period as shown in a sequence of images. Based on this, studies have used DL to recognize actions on construction sites from images/videos. Ding et al., (2018) [15] developed a new hybrid DL model that integrates a CNN and LSTM to automatically recognize workers' unsafe actions from videos. By extracting the visual features from videos using a CNN model and sequencing the learning features using LSTM models, the results revealed that the accuracy of the model exceeded the current state-of-the-art descriptor-based methods for the detection of safe/unsafe actions conducted by workers on-site. Likewise, an automatic computer-vision approach that utilizes an R-CNN-based model was proposed by Fang et al. (2019) [93] to detect individuals traversing structural supports from photographs during construction. By automatically identifying the presence of people and recognizing the relationship between people and concrete/steel supports, the results demonstrated that the developed model could accurately detect people traversing concrete/steel supports during construction; thus, the proposed approach could be used by site managers to automatically identify unsafe behavior and provide feedback to individual workers about their likelihood of falling from heights.

Interaction
In several cases, whether an action is safe depends on the status of other objects [77]. As a proof of this concept, Zhang et al., (2020) [99] proved that constant interaction and the state of random movement increase the risks of worker injury [99]. One of the accidents caused by inappropriate interactions between entities on construction sites is struck-by accidents, which led to 804 fatalities from 2011 to 2015 [37]. Therefore, to recognize unsafe behavior, current researchers not only recognize involved objects (e.g., workers, crane, and load) in terms of their identity, location, and movement direction, but more importantly, attempt to understand the interactions between these objects. Interaction is a pairwise or reciprocal action committed by two or more entities. In the concept of construction safety, entities can be defined as human (workers, managers, etc.) or objects (excavators, dump trucks, etc.). Each entity has a single action that reflects its state compared to the other entity. For example, earthmoving activities involve interactions between dump trucks and excavators.
Recognizing ongoing activities and related working groups is crucial as it allows the comprehension of jobsite context, which in turn enables the interpretation of worker intentions, their movement prediction, and the detection of inappropriate interactions that are counterproductive and may cause harmful consequences [37]. To consider the applications of DL in the interaction assessment of on-site entities, there are three different interaction types: human-to-human interaction, human-to-object interaction, and object-to-object interaction. Human-to-human interaction is an action committed by two people or groups of people (workers and managers), human-to-object interaction is an action committed directly by people to an object or multiple objects, and object-to-object interaction is an action committed by two objects or groups of objects. The interaction between construction workers and equipment is a crucial reason for on-site safety hazards [96]. Therefore, the risks posed by this interaction have received significant attention in current DL studies. For example, various studies have identified and evaluated the spatial relationship between construction workers and equipment to prevent struck-by hazards from images based on DL algorithms such as faster R-CNN [97,99,102] and YOLO [2]. Moreover, by extracting information from images, studies proposed CNN-based models for not only automatically predicting potential safety hazards by detecting construction workers and equipment and identifying hazardous zones [96], but also tracking and analyzing spatial-temporal interactions on construction sites for real-time detection [98]. Likewise, to demonstrate that the sequence-to-sequence method could better predict trajectories and avoid error accumulation compared to conventional predictions, Cai [37] proposed an LSTM method using construction videos that integrates both personal movement and workplace contextual information (e.g., movements of neighboring entities, workgroup information, and potential destination information). Studies have also focused on monitoring the equipment's interactions and crew relationships using DL methods. For example, based on data of historical motion from camera videos and activity attributes, Luo et al., (2021) [95] proposed an RNN framework, called GRU, for predicting future construction excavator and truck poses and monitoring when either one-to-one or group interactions of construction machines exist during earthmoving tasks. Similarly, Xiong et al., (2019) [100] developed an automated hazard identification system (AHIS) based on the CNN method to detect visual relationships between objects, including site components or crews. The results demonstrated that the proposed visual relationship detection method had the potential to enrich the semantic representation of operation facts, which could lead to better automation in construction hazard detection.

Activity
The information on basic actions may not be sufficient for safety analysis and schedule assessment; therefore, in recent years, researchers have attempted to recognize actions with a higher level of abstraction and complexity [77]. Guo et al., (2021) [77] showed that various on-site human activities are characterized by a complex spatial and temporal composition of objects and actions. According to the definition proposed by Turaga et al., (2008) [128], activity is a complex series of actions performed by several people who could interact with each other in a constrained manner over longer durations compared to action. Therefore, activity in construction safety can be defined as a group of actions and/or interactions that are executed to describe high-level work such as roofing, formwork, and scaffolding activities. Each action and interaction can be considered as a subactivity event in such scenarios [78]. In the context of construction safety, DL has been applied in activity recognition with different events such as scaffolding activity [103], earthmoving activity (27), and concrete pouring activity [104].
Scaffolding-related falls are an important potential threat at the job site, causing a significant number of accidents annually [129]. According to Khan et al., (2021) [103], the fatality rate due to falls from scaffolds, ladders, working platforms, and roof edges, was 60%. Therefore, the detection of unsafe activities during scaffolding activities has received attention from researchers. For example, in a study conducted by Khan et al., (2021) [103], a deep neural network, mask R-CNN, was proposed for monitoring mobile scaffold safety and detecting workers' unsafe behaviors from image dataset, including 703 training and 235 validation data with an overall accuracy of 0.86. DL was also applied to monitor other construction activities. By using the temporal and spatial CNN for recognizing basic actions during concrete pouring tasks, a hierarchical statistical method proposed by Luo et al., (2019) [104] proved the ability to recognize workers' activities with an average accuracy of 0.84. Similarly, Lin et al., (2021) [36] analyzed consecutive image sequences to automatically identify irregular operations during earthmoving work and its visualization. Therein, faster R-CNN was adapted with transfer learning to detect workers and pieces of construction equipment on the jobsite, and a hybrid model integrating CNN and LSTM was employed for action recognition. The results illustrated that the proposed framework could aid field managers in efficiently identifying potential abnormal activities, providing opportunities for further investigations and appropriate adjustments.

PPE and Safety Compliance
Safety rules are intended to outline safety guidelines for people and activities occurring in the workplace to ensure construction safety. Safety compliance involves following these rules in construction, adhering to safety procedures, and carrying out work safely. One of the regulations on construction sites is the use of protective equipment. Personal protective equipment, also termed as "PPE", is equipment designed to protect people against personal injury while performing tasks at the workplace. PPE includes helmets for avoiding head injuries, hand gloves for hand protection, safety glasses for eye protection, vests, boots, harnesses, and respirators [130]. A survey conducted by the US Bureau of Labor Statistics (BLS) suggested that 84% of workers who had suffered head injuries were not wearing head protection equipment [131].  showed that 75.1% of decedents from fall from height did not use personal fall arrest systems (PFAS) [110]. The "fatal four" (i.e., fall, struck-by object, electrocution, and caught-in/between) accounted for nearly 60% of all fatalities in construction in 2017, and the majority of these fatalities could have been prevented by wearing appropriate PPE [109]. However, there are often cases in which construction workers ignore regulations [113], and not all construction workers are aware of the importance of wearing hard hats [106]. In practice, many workers tend to take off their hard hats because of religious values [132] or discomfort due to weight and to cool off at high temperatures [106]. In addition, some frequent accidents are closely related to workers who are not certified to perform specific tasks. To support this observation,   [112] showed that fewer accidents occur when workers are qualified and their qualifications are appropriately certified.
Previous studies have utilized DL methods to detect behaviors that do not follow construction safety rules, thereby preventing serious injuries. As discussed above, one of the most significant actions in noncompliance with construction safety regulations is the failure to wear appropriate PPE. In this regard, detecting workers with non-PPE has received considerable attention in recent studies. For example, by extracting information from images, various researchers have proposed PPE detection algorithms to identify the proper use of hard hats on human objects using DL methods such as faster R-CNN [106,111,114], YOLO [105,107,115,117,[119][120][121][122], and CNN-based algorithms [109,110,113,118]. In addition, according to Wu et al. (2019) [108], the colors of hard hats can signify different roles on construction sites, providing an accessible way to improve construction safety management. Thus, in addition to detecting hard hats, researchers identified their corresponding colors that can achieve a mean average precision (mAP) of at least 0.84 [108,116]. Moreover, accidents are less likely when workers are qualified and their qualifications are properly certified [133]. Hence, DL was also applied to check whether a site worker is working within the constraints of their certification [112]. A faster R-CNN model was used to detect common objects based on the latest face detection and face recognition methods. The experimental results demonstrated the reliability and accuracy of the DL-based method to detect workers carrying out work for which they are not certified to facilitate safety inspections and monitoring.

Physical Conditions
According to the accident causation model [82], unsafe conditions and unsafe actions are considered as two direct causes of accidents. Therefore, safety performance can be improved if one can moderate people's unsafe behavior and improve their work conditions [134]. According to   [25], a hazardous working environment is a workplace with unusual hazards that violate the prevailing safety standards, thus being considered unsuitable for work [25]. In the context of construction safety, unsafe conditions can include poor lighting, temporary structure instability, unsecured equipment, etc., which can cause unfortunate accidents at construction sites. According to the extant literature [76], the common types of physical conditions identified include: (1) site condition (SC), (2) work environment (WE), and (3) site layout (SL). These conditions were also research directions of previous DL studies, and a summary of these studies is presented in Table 2. Prediction of water inflow into drill and blast tunnels (General accident) [140] Note: The mAP represents mean average precision. The accident types in parentheses were judged by the authors' assessment and not specified in the paper.

Work Environment (WE)
The nature of the construction working environment poses both health and safety risks to workers. According to a report by the Occupational Safety and Health Administration (OSHA), approximately 40% of all construction fatalities are caused by falls from heights, followed by struck-by objects, electrocution, and caught-in/between [141]. To support this, Kolar et al. (2018) [14] showed that "fall protection, construction" was at the top of the list of the most frequently violated OSHA standards. In addition, the results from the study of Arditi et al., (2007) [142] indicated that the safety risks at nighttime could be five times higher than those in the day time due to several significant factors, including the lower illumination conditions and the fatigue of workers and machine operators. Therefore, managing, monitoring, and improving the work environment, including guarding systems, structural defects, functional defects, lighting, and noise, etc., play an important role in reducing accidents at construction sites. Passive falling prevention approaches, such as guardrails, warning lines, and fall arrest systems, often act as on-site measures for reducing the risk of falling [14].
With the development of DL, researchers have developed models for monitoring construction safety under different work environments. For example, Kolar et al. (2018) [14] developed a safety guardrail detection model based on a CNN to check whether the guardrail system is set up appropriately. The results showed that the proposed model could obtain a high accuracy of 0.97, so their model has the potential to improve construction site situations. Similarly, studies have also demonstrated that the CNN-based model can reduce the number of injuries and fatalities by detecting structural defects such as crane cracks [135] and concrete diaphragm wall (CDW) deflections [136]. In addition, by considering the poor lighting conditions that can affect the visibility of monitoring construction safety,   [138] proposed a vision-based method for automatically tracking construction machines at night by integrating DL illumination enhancement. The results showed that with a multiple-object tracking accuracy (MOTA) of 0.95 and a multiple-object tracking precision (MTOP) of 0.76, the proposed methodology could also be used to help accomplish automated monitoring tasks during construction at nighttime to improve safety performance.

Site Layout (SL)
Construction is characterized by its dynamics, such as multiple construction workers, diverse types of equipment and materials, and continuously changing working environments [19]. Quickly changing and complex workplace conditions were identified as the direct cause of more than 30% of construction accidents [143]. Therefore, proper site layout management, including arrangement, storage, and positioning of agents (e.g., construction vehicles, heavy machines, materials, etc.), is an urgent requirement to avoid hazardous issues such as site congestion and failure to properly locate utilities. However, activities involving multiple pieces of equipment and workers taking place often in a unique, complex, and dynamic environment always create challenges for monitoring proper site layout. Thus, the development of DL has proven the ability to assist in effectively managing safe layout in construction sites.  [139] proposed a CNN-based end-to-end approach for precisely detecting dense multiple construction vehicles using images from unmanned aerial vehicle (UAV). The results illustrated that the proposed method was of great significance to ensure the safety of construction sites by accurately identifying many dense vehicles with an AP of 0.99.

Site Condition (SC)
Site conditions, including weather, temperature, and geographical conditions, considerably affect safety during the construction process. Awolusi et al. (2018) [144] showed that both health and safety risks of workers are posed by the construction work environment. This is partly because most of the activities are performed outdoors, significantly exposing workers to weather elements [144]. In addition, Mahmoodzadeh et al., (2021) [140] proved that other natural environmental conditions, such as groundwater inflows during tunnel construction, were among the most common and challenging issues faced by constructors and designers in karst regions. The sudden and unexpected significant water inflow at the heading often damages construction machinery and leads to worker fatalities [140]. For example, a large-scale water inflow accident occurred in the Yesanguan tunnel of the Yichang-Wanzhou railway in China on 5 August, 2007 [145]. Therefore, applying DL to the prediction of the influence of natural conditions has made important contributions to safety management. For example, by proposing an LSTM-based prediction model, Mahmoodzadeh et al., (2021) [140] proved that their proposed model could predict water inflow into tunnels with higher accuracy than other ML techniques; thus, this model could ensure safety and help with scheduling during the underground construction process.

Management Issues
Safety management, a method of applying on-site safety policies, procedures, and practices convolving a construction project, is one of the most frequently used techniques to regulate construction activities and control risks [146]. Various studies related to construction safety confirmed that most accidents at construction sites could have been reduced and prevented by establishing a proper and consistent safety management process or program of planning, education/training, and inspection [147]. In general, common safety management activities in the construction industry include monitoring, controlling safety rules, planning, training, and managing the practice process to ensure safety at the construction site. According to the extant literature [79,80] and based on the context of considering DL applications on construction safety, the categories of safety management identified include (1) safety management plan, (2) accident investigation and analysis, and (3) hazard identification and risk management. Table 3 lists previous studies regarding applications of DL in handling safety management issues in the construction industry. Note: The mAP represents mean average precision. The accident types in parentheses were judged by the authors' assessment and not specified in the paper.

Safety Management Plan
With the presence of cost and time pressures and the frequent need to perform unplanned work (e.g., rework), people tend to take risks to make their work more efficient [158][159][160]. The upshot of this case is that people tend to commit unsafe actions, especially when they know they are not being supervised [152]. Therefore, safety management plans regarding publishing safety policies, objectives, and requirements; proposing plans; making decisions; and monitoring safety play an important role. The purpose of health and safety monitoring is to ensure effective measurement and management of construction workers' safety practices against existing safety plans and standards [19]. Visual information related to construction activity scenes is becoming increasingly important for construction management [161,162]. The scene of construction activity in images can be defined as an integral overview of the activity in pictures that synchronously contain objects (e.g., workers, equipment, and materials), their interrelationships (e.g., cooperation between objects or coexistence of objects), and other vital scenario elements (e.g., earthmoving and concrete pouring) [151]. Thus, with the development of DL, automatically manifested construction activity scenes [151] provide managers with information for making decisions and safety management plans [148].
Recent research has focused on providing site managers the status of construction sites by detecting construction objects to assist in planning safety management at construction sites. For example, various studies proposed DL models such as faster R-CNN [148,149], YOLO [150], LSTM [151], and the CNN-based method [153] to provide supervisors with more insight into the real-time status of large-scale construction jobsites so they could assist supervisors in inspecting construction safety and processes [148,150,153]. In addition, as discussed above, workers sometimes have the proclivity to commit unsafe actions, especially when they know they are not being supervised [152], so it is important to provide direct feedback to people committing unsafe actions so that they can modify their future behavior. In a notable study by Wei et al., (2019) [152], a novel DL approach was developed to automatically determine a person's identity, which can be utilized by site managers to automatically recognize individuals engaging in unsafe behavior; therefore, it can be used to provide immediate feedback about their actions and possible consequences.

Accident Investigation and Analysis
Accidents and incidents should be analyzed for better implementation and continuous improvement of safety management systems [79]. Collecting and organizing accident reports, regulations, and laws, and then presenting them publicly, are considered good practices for improving the safety management of construction sites [38]. Safety reports are an extremely valuable information source that can be used by site managers to learn about the conditions and events contributing to the occurrence of accidents [158,163]. Therefore, it enhances managers' safety awareness and urges them to prevent accidents or related construction work issues [38]. Nowadays, using DL, accident documents are processed to provide useful information for safety management under two main tasks: information extraction and text classification. Information extraction is the task of finding structured information from unstructured or semistructured text [164], which is essential for handling continuously growing data published on the online, especially in the Big Data era [165]. For example, Feng and Chen (2021) [38] adopted the BiLSTM-CRF model to automatically extract information from accident reports, so this model could help to raise workers' security awareness and prevent hazards and accidents. Similarly, Baker et al., (2020) [16] compared two state-of-the-art DL architectures, CNN and hierarchical attention networks (HAN) based on GRU, to automatically learn injury precursors from raw construction accident reports. The results illustrated that HAN outperformed CNN almost everywhere with a mean performance of 0.87; thus, the HAN model can extract useful information, which not only allows the exploration of empirical relationships for postanalysis and project statistics, but can also be used proactively during typical work planning, job risk analyses, prejob meetings, and audits. Another application of DL is text classification, which is a fundamental task in the natural language processing area where one needs to assign one or multiple predefined labels to a text sequence [166]. For example, previous studies proposed DL-based models to classify and analyze the narrative surrounding accidents and to better understand their causal nature from accident reports [18,39,154]. In addition,   [155] proposed a DL-based method for the collection and automatic generation of video highlights from construction videos. The proposed CNN-based approach was validated through two case studies: a gate scenario and an earthmoving scenario. With a score of 0.89 for precision and 0.93 for recall, the proposed model proved that it could offer potential benefits to construction management in terms of significant reduction in video storage space and efficient indexing of construction video footage, which was beneficial for project management tasks such as safety control.

Hazard Identification and Risk Management
Dynamic and complex construction environments have caused significant risks during construction. Unfortunately, studies across the world have reported that a substantial portion (approximately 50%) of hazards remain unrecognized [167][168][169]. These unrecognized hazards expose construction workers to unanticipated risks and potential injuries [168]. Therefore, identifying hazards and managing risks play an important role in construction safety management. DL has been used to identify risks with notable achievements. For example,   [156] integrated computer vision algorithms with ontology models to develop a knowledge graph that can automatically and accurately recognize hazards while complying with safety regulations, even when they are subjected to change. Therein, mask R-CNN was adopted in their research for entity detection. The results showed that the proposed approach could successfully detect falls from height (FFH) hazards in varying contexts from images. Similarly, a mask RCNN-based framework was developed by Jeelani et al., (2021) [157] for an automated system that detects hazardous conditions and objects in real-time with over 93% accuracy; therefore, this model can assist workers and safety managers in identifying risks in complex and dynamic construction environments.

Overall Research Trends in Safety Management: Summary of Contributions and Limitations
In this study, three safety management factors, including behaviors, physical conditions, and management issues, were identified in the context of applying DL models to construction safety. This section provides an overview of the research trends from technical and managerial aspects (e.g., data types, algorithms, and safety issues) (Figures 15 and 16). Table 4 shows the accuracy of the studies using DL for construction safety. Overall, a CNN is the most commonly used method applied in these studies from the major data of the images, and unsafe behaviors is the main research direction with high performance, gaining a variety of contributions to safety management. terms of significant reduction in video storage space and efficient indexing of construction video footage, which was beneficial for project management tasks such as safety control.

Hazard Identification and Risk Management
Dynamic and complex construction environments have caused significant risks during construction. Unfortunately, studies across the world have reported that a substantial portion (approximately 50%) of hazards remain unrecognized [167][168][169]. These unrecognized hazards expose construction workers to unanticipated risks and potential injuries [168]. Therefore, identifying hazards and managing risks play an important role in construction safety management. DL has been used to identify risks with notable achievements. For example,   [156] integrated computer vision algorithms with ontology models to develop a knowledge graph that can automatically and accurately recognize hazards while complying with safety regulations, even when they are subjected to change. Therein, mask R-CNN was adopted in their research for entity detection. The results showed that the proposed approach could successfully detect falls from height (FFH) hazards in varying contexts from images. Similarly, a mask RCNN-based framework was developed by Jeelani et al. (2021) [157] for an automated system that detects hazardous conditions and objects in real-time with over 93% accuracy; therefore, this model can assist workers and safety managers in identifying risks in complex and dynamic construction environments.

Overall Research Trends in Safety Management: Summary of Contributions and Limitations
In this study, three safety management factors, including behaviors, physical conditions, and management issues, were identified in the context of applying DL models to construction safety. This section provides an overview of the research trends from technical and managerial aspects (e.g., data types, algorithms, and safety issues) (Figures 15  and 16). Table 4 shows the accuracy of the studies using DL for construction safety. Overall, a CNN is the most commonly used method applied in these studies from the major data of the images, and unsafe behaviors is the main research direction with high performance, gaining a variety of contributions to safety management.       [16,38,151] GNN Accuracy 0.87 [39] Precision 0.51 [39] Recall 0.54 [39] mAP: mean average precision.

Recognition of Unsafe Behavior
The advancement of DL has opened up significant opportunities for examining unsafe behaviors in construction. Among the five categories of behaviors that DL has focused on, construction workers (21 of 44 papers) and PPE (17 of 44 papers) are the main objects of interest. For objects, different algorithms (e.g., object detection algorithms [35], object tracking [36], and activity recognition [37]) have demonstrated good performance in detecting and tracking workers. For example, by using DL-based object detection architectures, previous studies detected workers and PPE successfully with an accuracy exceeding 0.90 [85,87,107,111,118]. In addition, recognizing equipment operations (e.g., dump trucks and excavators) has also attracted much attention from researchers for mainly examining the interaction between entities. For example, researchers proposed DL-based models to monitor and analyze the interaction between workers and equipment with an accuracy range of 0.65 to 1.00 [2,[97][98][99]102].
As various DL methods that use a CNN, RNN, and GNN have been applied, different formats of data (e.g., videos, images, and signals) have been used to detect those representing unsafe behaviors in the data. In particular, detection and tracking of unsafe behaviors were performed mainly using videos and images (85%). The reason for this phenomenon is partly because collecting videos and images at construction sites is easier and more common than other types of data (e.g., signals). According to Daniel and Chen (2003) [170], along with digital camcorders, video conferencing, digitized movies, and video emails that are making their way into everyday life, it is almost certain that the use of video data will multiply by multiple times in the coming years. Moreover, nowadays, there are various publicly available data sources such as Microsoft's Common Objects in Context (MS COCO) [171], ImageNet [172], Pascal VOC [173], etc., which researchers can easily access.
From an algorithmic perspective, recent neural networks, especially CNNs, have achieved considerable success in various areas, including image/video understanding, processing, compression, etc. [174]. The trained CNN can be used to handle classification, recognition, and prediction tasks on test data with highly efficient adaptability [174]. Therefore, CNN was dominantly applied in detecting unsafe behaviors using image data sources (34 of 44 papers). For videos and other sequence data such as signals (e.g., time-series data), RNNs, designed for sequence learning [175], were also used with high performance. For example, various studies have utilized RNN models to detect unsafe behaviors from videos with an accuracy exceeding 0.9 [15,37,95].

Physical Condition Identification
Previous research on unsafe physical conditions have focused mainly on structural defects and site layout status at construction sites. The main objects of interest in such research include structures (e.g., guardrails and diaphragm walls) and equipment (e.g., cranes, wheel loaders, and construction vehicles). For example, various studies proposed CNN-based methods to detect structural defects such as guardrail defects [14], crane cracks [135], and diaphragm wall deformations [136] from images and signals with an accuracy of up to 0.97. In addition, to evaluate whether the site layout is appropriate, entities in the construction sites need to be detected precisely. Therefore, in a construction environment involving a wide range of heavy equipment (e.g., tower cranes, dump trucks, and excavators), recognizing equipment operations has also attracted much attention from researchers (50% of the total number of papers regarding physical conditions), and these DL studies can gain accuracy of over 0.9 [135,[137][138][139].
For detecting unsafe physical conditions, image was the most used type of data in DL models (62.5% of total papers). By applying the image classification task, the status of physical conditions regarding guardrails [14], the surface of crane cracks [135], and dense multiple construction vehicles [139] were detected and located to ensure safety at construction sites. Moreover, because of the growing interest in CNNs, the most common tool used for image analysis and image classification [34], they have been applied the most in handling issues related to physical conditions. In addition, to predict other physical conditions such as diaphragm wall deformation [136] and water flow [140] during the underground construction process, time-series data was used to describe properties related to deformation and inflow over time. For this application, an RNN is commonly used to deal with such issues with a high accuracy, reaching 0.99 [136,140].

Safety Management
DL has been used effectively to support construction safety management. By using image datasets, the CNN method was utilized the most (8 of 14 papers) to provide managers with real-time statuses of large-scale construction jobsites, so this can assist in improving their decision-making regarding safety and planning [148][149][150][151]. These studies mainly focused on workers (six of eight papers) and equipment (e.g., pump trucks, excavators, rollers, and tower cranes) (seven of eight papers). For example, various studies applied CNN models to detect workers and construction equipment from images with an accuracy range of 0.55-1.0. In addition, to minimize safety risks in construction, data are recorded in various formats (e.g., video, photographs, and safety reports), which researchers have used to monitor safety [134]. Thus, various studies have used videos and images (64%) and accident reports (36%) to aid the investigation and analysis of risks at construction sites. For example, previous studies applied DL models for NLP tasks (e.g., text classification and information extraction) with high accuracy, ranging from 0.54 to 0.87 [16,18,38,39,154].
Besides recognizing individuals committing unsafe actions from images, identifying the person's identity also plays an important role in supporting safety management. Once a person's identity can be determined, site managers can provide specific feedback regarding their unsafe behaviors [152]. However, very little research has focused on this issue (one of 66 papers). In a notable study conducted by Wei et al. (2019) [152], a DL model was applied to determine a person's identity by computing the c between the identity feature with previously saved features of other people's identities; however, this study reported practical limitations such as the limited number of activities (e.g., people walking), and the possibility of delay in recognizing a person's identity in real-time because of the computation requirements placed on the attention network to extract representations from videos.

The Summary of Contributions and Limitations of Deep Learning on Safety Management
This study reviewed the contributions and limitations specified in previous papers and reports the key contributions with limitations, as summarized and outlined in Table 5. In terms of contributions, by detecting unsafe physical conditions, construction workers and equipment, as well as their behaviors, the multiple contributions of DL models include monitoring safety and proactively preventing hazards, evaluating proactive safety risk levels, strategizing effective training solutions, designing effective hazard recognition and management practices, and applying operator assistance systems in construction machinery to achieve active safety. The investigation and analysis of safety reports can not only be used proactively during typical work planning, job hazard analyses, prejob meetings, and audits, but also raise the safety awareness of workers and professionals. However, applying DL in construction safety still has challenges such as the limitation of the dataset, the influence of performance due to the presence of occlusions, blurriness, and background patches, and the lack of consideration of an individual's identity during action recognition.

Contributions Limitations References
Detecting workers and equipment, estimating, recognizing, and analyzing their behaviors.
DL models can support monitoring safety and proactively preventing hazards by sending early warning information combined with the on-site alarm equipment to the management staff so they can provide instant feedback concerning unsafe behavior, and appropriate actions can be put in place to prevent reoccurrence.
The dataset was limited. [36,89,94,99] Cases of the on-site experiment failed due to visual obstacles. [92] Not mentioned. [85,86,88,90,104] DL models support strategizing effective training solutions and designing effective hazard recognition and management practices.
The dataset was limited. [93,99,100] The accuracy of the method is affected by the presence of occlusions. [93] Individual workers need to be identified. [157] The proposed method can be applied to operator assistance systems in construction machinery to achieve active safety.
The dataset was limited. [2] Detecting unsafe physical conditions. DL models can support monitoring safety and early warning, so managers can provide the appropriate solutions to prevent or control risks.
The dataset was limited. [14,140] Occlusion was not addressed. [14] Not mentioned. [65,[135][136][137][138][139] Before the predicted deformation reaches the threshold limit, control strategies can be implemented to avoid excessive deformation and the corresponding risks to the engineering project and surrounding environment.
Not mentioned. [136] Investigating and analyzing safety reports.
The results can be used proactively during typical work planning, job hazard analyses, prejob meetings, and audits.
Not mentioned [16,39] DL models raise the security awareness of workers and professionals to better understand and prevent hazards and accidents, and aid in educating workers about "what not to do" and "what to do".

Future Research Directions
Despite recent technical advances in DL, there are still challenges in its practical applications. Based on the limitations identified and summarized, directions for future research are discussed to resolve these issues and further expand their applications. These directions include (1) expanding a comprehensive dataset, (2) improving technical restrictions due to occlusions, and (3) identifying individual who performed unsafe behaviors.

Expanding a Comprehensive Dataset
In a dynamic and complex construction environment involving many human resources, diverse types of equipment, as well as many types of actions of humans and equipment, larger and more comprehensive datasets are important for improving the performance of DL models. According to Ding et al., (2018) [15], some worker actions could not be recognized due to the training sample size and the limited number of unsafe actions considered. Therefore, with larger datasets, the model may further improve and provide more accurate results. However, there is currently no comprehensive and common dataset publicly available, not only for specific tasks such as object detection, pose detection, and activity recognition but also for a variety of construction sites, different viewpoints, lighting, and occlusion conditions. Although several studies such as Xuehui et al., (2021) [149] presented the Moving Objects in Construction Sites (MOCS) image dataset for detecting objects at construction sites, its use may be limited by the size and type of the dataset. Therefore, further research is needed to generate and share a comprehensive dataset for the research community. Potential solutions may include generating publicly available datasets by developing a DL-based methodology to automatically create safety reports in natural language based on construction site imagery and using models to collect and amalgamate reports across the industry through continuous updates as new data arrive. In a study on DL in generating radiology reports, Monshi et al., (2020) [176] reported that CNNs used for image analysis could be integrated alongside RNNs for NLP and natural language generation (NLG), generating radiology coherent paragraphs in the medical field. Thus, once this DL application is applied in construction, creating automatic safety reports based on construction site images increases the number of datasets. A platform then needs to be built for public access so that researchers can easily share and upload datasets.

Improving Technical Restrictions Due to Occlusions
In dynamic and continuously changing construction environments, as images and videos data are mostly used, DL models have faced challenges such as occlusion [84], poor illumination and blurriness [105], and background clutter [97]. For example, Fang et al., (2019) [93] reported that the DL model could not detect all people traversing structural supports due to the presence of occlusions. However, previous studies often ignored occlusions by assuming no occlusion (e.g., the guardrail is always visible for detection in [14]). To handle these issues, potential solutions may include the following. First, a method is needed to search and identify the optimum placement of cameras (e.g., position and distance of a camera, the effect of occlusion, and lighting conditions) where full or maximum coverage of resources (e.g., workers, materials, and machines) can be achieved. Second, to handle the self-occlusion of projected objects in a 2D vision, reconstructing the 3D bounding boxes of these objects can be conducted using DL models to estimate depth and reconstruct depth scenes as a global 3D model from monocular images. Finally, another method for coping with occlusions is to combine vision-based approaches with sensor-based methods (e.g., the global positioning system), which can provide the location and motion of objects.

Identifying Individuals Who Performed Unsafe Behaviors
Providing feedback to individuals regarding the likelihood of their unsafe actions can result in immediate behavior modification and targeted safety training [93]. Therefore, in addition to identifying unsafe actions at construction sites, it is necessary to identify who performed these unsafe actions. Based on this, site managers can automatically identify unsafe behavior in real time and provide feedback to individuals about their unsafe behaviors. However, previous studies have not focused on the identification of workers (e.g., [156,157]). To achieve this goal, several solutions can be used in the future. First, sensors can identify a person's identity and location [177]. Thus, future research can combine the results of an individual's identity from sensors and action monitoring of the DL model to identify those who do not perform unsafe actions. Second, this issue can be addressed by developing a DL approach to identify individuals from videos by integrating temporal and spatial information. Wei et al., (2019) [152] provided an example of this approach. This DL approach focuses on using the spatial attention network for extracting spatial feature maps, temporal attention networks for extracting temporal information, and computing the distance between features to recognize a person's identity. In addition, a person's identity can also be recognized by face recognition models based on a CNN [178], so future research can combine face recognition and action recognition to identity workers performing unsafe behaviors.

Conclusions
This study synthesized and reviewed the current DL studies applied to safety management in the construction industry. It was found that DL studies had paid attention to three main research directions, including behaviors, physical conditions, and management issues. By providing detailed summaries of DL applications in each category, this paper aims to support researchers and managers in the field of construction safety with a specific overview regarding what type of method has achieved highly accurate results, along with the type and amount of data that has been used for a certain safety task, as well as the actions managers can take from the result of DL models for improving safety management. In general, detecting unsafe behaviors was the main research direction of previous studies (67%) with high performance, which has contributed to safety management in the construction industry. Moreover, the results indicated that CNN modeling was the most common method used in these studies (75%) and achieved high accuracy, which could reach up to~1.0, from the primary data of images (73%). In addition to providing the overall trends of DL applications, this literature also presents limitations and future directions for applying DL in construction safety. In a dynamic and complex construction environment involving many human and equipment resources, expanding larger and more comprehensive datasets is important for improving the DL model performance. In addition, the presence of occlusions causing challenges for DL studies using image and video data should be addressed in future studies. Another direction is to identify individuals who performed unsafe behaviors for immediate behavior modification and targeted safety training. DL is an emerging area of construction safety and is still developing, so outlining key challenges and corresponding proposal research can aid in developing DL applications in the future. We expect that this paper will provide not only new lines of advanced methods for researchers working on safety management but also opportunities to apply DL in practice.