A C-BiLSTM Approach to Classify Construction Accident Reports

The construction sector is widely recognized as having the most hazardous working environment among the various business sectors, and many research studies have focused on injury prevention strategies for use on construction sites. The risk-based theory emphasizes the analysis of accident causes extracted from accident reports to understand, predict, and prevent the occurrence of construction accidents. The first step in the analysis is to classify the incidents from a massive number of reports into different cause categories, a task which is usually performed on a manual basis by domain experts. The research described in this paper proposes a convolutional bidirectional long short-term memory (C-BiLSTM)-based method to automatically classify construction accident reports. The proposed approach was applied on a dataset of construction accident narratives obtained from the Occupational Safety and Health Administration website, and the results indicate that this model performs better than some of the classic machine learning models commonly used in classification tasks, including support vector machine (SVM), naïve Bayes (NB), and logistic regression (LR). The results of this study can help safety managers to develop risk management strategies.


Introduction
Workplace health and safety is a significant concern in all countries [1] because there are more than 2.78 million deaths caused by occupational accidents every year according to the International Labour Organization [2]. The construction industry is recognized as the most hazardous one among various industries [3]. In the United States, construction accounts for approximately one-sixth of fatal accidents while only employing 7% of the national workforce, and there are four recorded injuries per 100 full-time construction workers in the construction industry.
Construction accidents usually result in both health/safety issues and financial loss [4], and thus there has been abundant research motivated by the alarming injury and fatality rates. Research on construction safety is mainly conducted from two perspectives: it is either management-driven or technology-driven [5]. In general, it is assumed that enhanced construction safety management can effectively improve on-site safety performance and reduce the number of accidents. Research from the management perspective usually includes either safety management processes such as safety education and training or focuses on individual/organizational characteristics such as workers' attitudes towards safety. However, the effect of traditional strategies for preventing injuries was limited due to their reactive and regulatory-based nature [6]. Esmaeili and Hallowell [7] indicate that the construction industry has reached saturation with respect to these injury prevention strategies. Along with the advancement of information and communication technology, various innovative technologies have been investigated to assist and improve on existing management-driven safety management practices. These technical approaches aimed to enhance rather than replace management efforts [8].
Besides the assistance of technologies, some new injury prevention strategies have been developed for the construction industry. The risk analysis method is one of them which are used in safety programs to improve safety performance. For example, Baradan and Usmen [9] compared the risk of different building trades, Hallowell and Gambatese [10] quantified the safety risk for various activities required to construct concrete formwork, and Shapira and Lyachin [11] studied the impact of tower cranes on job site safety. However, most of these risk-based studies are limited to specific application fields and hard to translate well to a general scope of the construction industry.
To address this limitation in the previous literature, Esmaeili and Hallowell [12] proposed an attribute-based risk identification and analysis method that helps designers to identify and model the safety risk independently of specific activities or trades. In this method, accidents are considered the outcome of interaction among physical conditions of the jobsite, environmental factors, administrative issues, and human error. Although this method shows promise, it requires the analysis of large numbers of construction injury reports to first classify the causes and then see patterns and trends that emerge from the data. Such manual content analysis is laborious and resource-intensive [13].
It is vitally important to analyze past accidents and understand the causes to prevent the occurrence of similar accidents and promote workplace safety [14] by removing or reducing the identified causes. Construction injury reports contain a wealth of empirical knowledge that could be used to better understand, predict, and prevent the occurrence of construction accidents. Some major construction companies and federal agencies, for example, the Occupational Safety and Health Administration (OSHA), possess those reports in the form of huge digital databases. Because different companies have different requirements for the forms of accident reports, these reports are often unstructured or semi-structured.
The first step towards the effective analysis of construction injury reports is the rigorous classification based on accident causes. In the construction industry, early text classification was achieved manually, which is not only demanding for professional knowledge but also requires a large amount of human and material resources. Furthermore, the consistency of the classification results is difficult to be ensured. Therefore, it is significantly important to investigate the method of automatic classification of texts written in natural language [15]. However, studies of text mining, natural language processing (NLP) and deep learning (DL) techniques for the analysis of construction accident narratives are very limited [16]. To fill this gap, using accident narratives data obtained from the official website of OSHA, this paper presents a novel and unified architecture that contains a bidirectional long short-term memory (BiLSTM) model and a convolutional layer for the classification of construction accident causes. The proposed architecture is called convolutional-BiLSTM (C-BiLSTM). This novel construction accident report classification model was compared with some advanced methods in previous research work using a set of OSHA data and indicated a superior result than other models.
The rest of this paper is organized as follows. Related works are presented in the next Section, including text mining and machine learning techniques, existing studies on accident narrative classification, and performance metrics. Then, the research approach is presented in detail in Section 3, along with the method for data pre-processing. Before the conclusion section, the study discusses the result of applying this introduced approach to the OSHA accident narratives and compares its performance with the state-of-the-art approaches in text classification.

Text Mining and Machine Learning
Text mining refers to obtain valuable information and knowledge from text data, which is a method in data mining [17]. With the fast coming of big data, the use of massive data has been reforming every industry and business, becoming an important production factor [18]. In many cases, the text data is one of the most easily generated data forms, with typical features of unstructured data. Although the unstructured text is easily perceived and handled by humans, it is hard to be understood by machines. The most basic yet important application of text mining is to achieve automatic classification based on text content [19]. Text classification is one of the common tasks in NLP which concerns the method to program computers to process and analyze large amounts of natural language data. In-text classification, a mathematical model is trained by a set of input texts with associated classification tags to have certain generalization ability so that the model can perform a good prediction on the category of other texts in the same domain. It is essential to measure and calculate the similarity of texts.
Currently, as an effective method for text information management, text classification has been widely applied to multiple fields such as information classification [20], recommendation system [21], and sentiment analysis [22]. Traditional text classification methods generally adopt a machine learning method [23]. Machine learning is the scientific study of algorithms and statistical models to perform a specific task without using explicit instructions but relying on patterns and inference. The mathematical model is built based on sample data (as known as training data) to make predictions. Various types of models have been used and researched for machine learning systems, including Support Vector Machine (SVM) [24], Naive Bayes (NB) [25], K neighbors [26], etc.
DL is a class of machine learning algorithms that uses multiple layers to progressively extract higher-level features from the raw input [27]. For example, in image processing, lower layers may identify edges, while higher layers may identify the concepts relevant to a human such as faces. DL method has been proved to be effective for feature extraction [28]. A series of DL algorithms such as Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) have been extensively used by researchers in various fields for text classification [29,30].
The transformer architecture is proposed to deal with the difficulty of parallel training related to BiLSTMs. It completely replaces LSTMs by the so-called attention mechanism [31]. With attention, an entire sequence is treated as a whole, therefore it is much easier to train in parallel. There are many variants of attention mechanisms, such as co-attention mechanism [32] and self-attention [33]. As a transformer-based approach, BERT (Bidirectional Encoder Representations from Transformers) [34] has achieved amazing results in many language understanding tasks, including the tasks of text classification. However, those advanced models usually have a large size with many parameters, making a higher cost in training.

Existing Studies on Accident Narrative Classification
There are some existing studies in the field of accident classification using machine learning approaches. Bertke et al. [35] used an NB-based model to classify the reasons for insurance claims on work-related injuries. The overall accuracy of the model is approximately 90%, but the accuracy of the category of claim on minor injury is somehow decreased. Tanguy et al. [36] evaluated the aviation safety reports by an SVM-based model and obtained an accuracy rate of 60-96%. Wellman et al. [37] proposed a fuzzy Bayesian model to classify the injury reports obtained from the National Health Interview Survey (NHIS) and achieved an accuracy rate of 87%. Abdat et al. [38] extracted the scenes of high recurrence Occupational Accident with Movement Disturbance (OAMD) from the narrative texts using a Bayesian network. However, this method requires expert knowledge to pre-process data and it is time-consuming. Zhong et al. [29] classified building quality complaint reports using a CNN model, and the weighted average of F1 values is 0.73, which is superior to the results of traditional machine learning algorithms such as NB and SVM.
In the application field of construction accident classification, related studies are very limited. Tixier et al. [39] proposed a rule-based automatic content analysis system that automatically extracts attributes and safety outcomes from unstructured injury reports. This system achieved an accuracy rate of 95%, but it showed poor performance when dealing with unexpected situations. Moreover, this approach requires an external dictionary for professional terminologies. Goh [40] used the data obtained from OSHA website to compare the performance of several machine learning algorithms, including SVM, NB, decision trees (DT), linear regression (LR), random forest (RF) and k-nearest neighbor (KNN), in the classification of construction accident reports. The results showed that the SVM-based classifier generated a better F1 value than other classifiers. Zhang et al. [16] further proposed sequential quadratic programming (SQP)-based integrated algorithm based on Goh et al.'s work. This combined method achieved a weighted result of 0.68, which is better than the result of a single machine learning algorithm.
It is not difficult to find that although there are some studies on the classification of construction accidents by traditional machine learning algorithms, there is still a lack of research on the application of DL algorithms in this field. Therefore, this study is aimed to evaluate the performance of the DL algorithm in the automatic classification of construction accident narratives.

LSTM, BiLSTM, and C-BiLSTM
In recent years, LSTM has been applied more widely. To further improve the performance of LSTM in processing variable-length sequence information tasks, researchers have proposed many methods to improve LSTM. The combination of LSTM or its variants with other network structures is an important research direction at present. Lu et al. [41] proposed a new emotion classification model called P-LSTM. By introducing a phrase factor mechanism, the P-LSTM model can extract more accurate information from text. Wang et al. [42] used a BiLSTM model to perform sequence analysis of microblog conversations to capture the distance dependence of the emotional semantic field. Experiments show that the BiLSTM model with context information is superior to other algorithms. Wei et al. [43] proposed a migration learning framework, ConvL, based on CNN and LSTM, which was used to automatically identify whether online comments expressed confusion, determine the degree of urgency and classify the polarity of emotions. Le et al. [44] introduced multi-View recurrent neural networks (MV-RNN) for 3D network segmentation. This framework combines CNN and double-layer LSTM for 3D shape segmentation, which can output the edge image of each defined view. Harish et al. [45] used the model combining CNN and BiLSTM to automatically identify inappropriate query suggestions, and the performance of this model is better than that of multiple benchmark models using the same data set for training. The model used in this study is based on the above research and further improved according to the research needs. The specific content of the model will be highlighted in the next chapter.

Performance Metrics
In most existing studies, the F1 score proposed by Buckland and Gey [46] was widely used as a performance indicator to evaluate the classification model. This indicator considers both the Precision and the Recall of the test to compute the F1 score: Precision is the number of correct positive results divided by the number of all positive results returned by the classifier, and Recall is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive). The F1 score is equivalent to the comprehensive evaluation indicator of Precision and Recall. Equation (1) lists the calculations for Precision, Recall, and F1 scores respectively.
Where TP refers to the number of positive samples that are classified correctly, FP refers to the number of negative samples that are classified to be positive, and FN refers to the number of positive samples that are classified to be negative.
However, in the case of unbalanced categories where some categories have a large number of instances, but some categories have much fewer instances, the F1 score does not take into account the difference in the number of instances in different categories. Therefore, to better compare the performance of the overall results, the weighted average F1 value (Equation (2)) is used for a performance indicator. where n represents the number of categories, S i represents the number of instances of the ith category, T indicates the total number of instances, and F1 i denotes the F1 score of the ith category: Figure 1 illustrates the framework of the proposed C-BiLSTM-based method for classifying construction accident narratives. There are two modules in the framework, i.e., model training and model application. In the module of model training, labeled training data need first to be pre-processed, such as word segmentation and stop word removing before being transmitted to the C-BiLSTM classifier for training. In the module of the model application, raw data were pre-processed and then transmitted to the trained model, and then corresponding classification labels were obtained to complete the classification task. See Figure 2 for the framework of the C-BiLSTM Model, which mainly comprises two parts, i.e., CNN and BiLSTM. In the model, the convolutional layer extracts n-gram features in the text for sentence modeling. Then BiLSTM obtains the forward and backward context features via the combination of forward and backward LSTMs and transfers the result to the softmax classifier.

F1 = 2 × × +
However, in the case of unbalanced categories where some categories have a large number of instances, but some categories have much fewer instances, the F1 score does not take into account the difference in the number of instances in different categories. Therefore, to better compare the performance of the overall results, the weighted average F1 value (Equation (2)) is used for a performance indicator. where n represents the number of categories, Si represents the number of instances of the ith category, T indicates the total number of instances, and F1i denotes the F1 score of the ith category: Figure 1 illustrates the framework of the proposed C-BiLSTM-based method for classifying construction accident narratives. There are two modules in the framework, i.e., model training and model application. In the module of model training, labeled training data need first to be preprocessed, such as word segmentation and stop word removing before being transmitted to the C-BiLSTM classifier for training. In the module of the model application, raw data were pre-processed and then transmitted to the trained model, and then corresponding classification labels were obtained to complete the classification task. See Figure 2 for the framework of the C-BiLSTM Model, which mainly comprises two parts, i.e., CNN and BiLSTM. In the model, the convolutional layer extracts ngram features in the text for sentence modeling. Then BiLSTM obtains the forward and backward context features via the combination of forward and backward LSTMs and transfers the result to the softmax classifier.

Data Pre-Processing
In real applications, raw data often contain a lot of noise information which not only affects the accuracy of data mining but also influences the work efficiency. Therefore, a series of pre-processing work needs to be performed on the data before the data are used. The main tasks of text preprocessing are word segmentation and stop word removal. Word segmentation refers to the division of long text data into individual words or phrases. NLTK is a kind of natural language toolkit that is the most commonly used in the NLP field and provides a set of more professional English word segmentation tools [47]. Therefore, in this study, the tokenize segmentation package provided by NLTK was used directly to convert the text data into a word-level dictionary. In-text datasets, the most common words may appear many times, such as "in," "the," and "a," and they do not provide valuable information. Stop word removing means the removal of these words which can effectively reduce the size and dimension of data and will make training faster and better [48]. Other preprocessing works include the conversion of uppercase letters to lowercase letters, removal of numbers and special symbols, and lemmatization. All these can effectively clean the data to improve the accuracy and speed of data mining.

Word Embedding
Traditional methods of word representation, such as one-hot vectors, usually have two problems which are losing word order and excessive dimension. In this paper, the distributed word vector representation [49] with automatic parameter tuning was employed to replace the one-hot sparse matrix used in traditional machine learning models. It has a better performance in obtaining the semantics and syntactic information of words in each text. The core of the study lies in the text level classification. Suppose a text contains L words and wi stands for the vector of the ith word in the text, the word representation ai can be embedded in the matrix via embedded matrix W, which means matrix representation xi, as shown in Equation (3). Word2vec method proposed by Mikolve et al. [50] is used in this paper for word embedding. The Skip-gram model in the Word2ve method is used for the task, which trains semantic embedding by predicting target words from context and obtains the semantic relation between words.

Data Pre-Processing
In real applications, raw data often contain a lot of noise information which not only affects the accuracy of data mining but also influences the work efficiency. Therefore, a series of pre-processing work needs to be performed on the data before the data are used. The main tasks of text pre-processing are word segmentation and stop word removal. Word segmentation refers to the division of long text data into individual words or phrases. NLTK is a kind of natural language toolkit that is the most commonly used in the NLP field and provides a set of more professional English word segmentation tools [47]. Therefore, in this study, the tokenize segmentation package provided by NLTK was used directly to convert the text data into a word-level dictionary. In-text datasets, the most common words may appear many times, such as "in," "the," and "a," and they do not provide valuable information. Stop word removing means the removal of these words which can effectively reduce the size and dimension of data and will make training faster and better [48]. Other pre-processing works include the conversion of uppercase letters to lowercase letters, removal of numbers and special symbols, and lemmatization. All these can effectively clean the data to improve the accuracy and speed of data mining.

Word Embedding
Traditional methods of word representation, such as one-hot vectors, usually have two problems which are losing word order and excessive dimension. In this paper, the distributed word vector representation [49] with automatic parameter tuning was employed to replace the one-hot sparse matrix used in traditional machine learning models. It has a better performance in obtaining the semantics and syntactic information of words in each text. The core of the study lies in the text level classification. Suppose a text contains L words and w i stands for the vector of the ith word in the text, the word representation a i can be embedded in the matrix via embedded matrix W, which means matrix representation x i , as shown in Equation (3). Word2vec method proposed by Mikolve et al. [50] is used in this paper for word embedding. The Skip-gram model in the Word2ve method is used for the task, which trains semantic embedding by predicting target words from context and obtains the semantic relation between words.
x i = W·a i Although Word2vec model has achieved good results in many fields, it cannot deal with polysemy well because Word2vec uses unique word vectors to represent multiple semantics of a word. To better deal with the polysemy issue, this study uses the BERT model to pre-train texts. Devlin et al. [34] pre-trained over 3 billion words in BooksCorpus and English Wikipedia using a multi-layer, two-way Transformer encoder, and obtained a BERT pre-training model. To apply the BERT model to text classification of construction accidents, this study directly employed this BERT-base model.

One Dimension Convolutional Layer
The local features of a text can be extracted via CNN, which refers to a kind of feedforward neural network whose model structure mainly includes an input layer, convolutional layer, pooling layer, fully connected layer, and output layer. In C-BiLSTM Model, a single convolutional layer is used to reduce data dimensions and extract serial information. Its structure is shown in Figure 3. A max-over-pooling or dynamic k-max pooling is generally used to choose the most important or k-most important features after the convolution. However, the input to BiLSTM must be a serialized structure, and pooling will destroy the sequence organization of the text. Therefore, data after convolution operation will not undergo pooling operation anymore.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 17 Although Word2vec model has achieved good results in many fields, it cannot deal with polysemy well because Word2vec uses unique word vectors to represent multiple semantics of a word. To better deal with the polysemy issue, this study uses the BERT model to pre-train texts. Devlin et al. [34] pre-trained over 3 billion words in BooksCorpus and English Wikipedia using a multi-layer, two-way Transformer encoder, and obtained a BERT pre-training model. To apply the BERT model to text classification of construction accidents, this study directly employed this BERTbase model.

One Dimension Convolutional Layer
The local features of a text can be extracted via CNN, which refers to a kind of feedforward neural network whose model structure mainly includes an input layer, convolutional layer, pooling layer, fully connected layer, and output layer. In C-BiLSTM Model, a single convolutional layer is used to reduce data dimensions and extract serial information. Its structure is shown in Figure 3. A max-over-pooling or dynamic k-max pooling is generally used to choose the most important or kmost important features after the convolution. However, the input to BiLSTM must be a serialized structure, and pooling will destroy the sequence organization of the text. Therefore, data after convolution operation will not undergo pooling operation anymore. Take ∈ × , the text representation after word embedding, as input and the word vector of each word in the text is ∈ where L is the max length of the text and d is the dimension of word vector (the dimensionality in this study is 300). Convolution is mainly used for feature extraction. The features of input text can be extracted by sliding on input sequence via filter ∈ × (k is the length of filter). In every position i in the sentence, there is a window vector wi that contains k continuous word vectors , , … , . Eigenvalue ci is obtained by convoluting filter m and window vector wi and its calculation process is shown in Equation (4).

= · +
where b is bias term, the value of which can be adjusted in the training process; . represents the nonlinear activation function, rectified linear units (ReLU). ReLU has better performance than other activation functions in terms of iterations required for reducing network convergence. By convoluting every window vector in the text, the feature sequence = , , … , can be obtained for the text. 128 filters in the same size are used in C-BiLSTM to acquire several feature sequences. Therefore, data will become a new feature representation O after the convolution layer, as shown below. Take X ∈ R L×d , the text representation after word embedding, as input and the word vector of each word in the text is x i ∈ R d where L is the max length of the text and d is the dimension of word vector (the dimensionality in this study is 300). Convolution is mainly used for feature extraction. The features of input text can be extracted by sliding on input sequence via filter m ∈ R k×d (k is the length of filter). In every position i in the sentence, there is a window vector w i that contains k continuous word vectors (x i , x i+1 , . . . , x i+k−1 ). Eigenvalue c i is obtained by convoluting filter m and window vector w i and its calculation process is shown in Equation (4).
where b is bias term, the value of which can be adjusted in the training process; f (.) represents the nonlinear activation function, rectified linear units (ReLU). ReLU has better performance than other activation functions in terms of iterations required for reducing network convergence. By convoluting every window vector in the text, the feature sequence C = (c 1 , c 2 , . . . , c L−k+1 ) can be obtained for the text. 128 filters in the same size are used in C-BiLSTM to acquire several feature sequences. Therefore, data will become a new feature representation O after the convolution layer, as shown below.
To be specific, semicolon stands for column vector connection and new features representation will be fed into BiLSTM as input.

BiLSTM
LSTM model can be used to solve the problem that traditional machine learning models are difficult to extract high-level semantics in texts when classifying texts. This model adopts a text sequence matrix composed of pre-trained distributed word vectors as input and then extracts the feature expressions containing context information by using its unique memory structure. The LSTM model structure is shown in Figure 4a. The standard LSTM network can only leverage the historical context. However, the lack of future context may lead to an incomplete understanding of the meaning of the text. BiLSTM is the combination of a forward LSTM layer and a backward LSTM layer. The information of context can be fully used by summarizing the information of two ideates before and after the word. The model structure is shown in Figure 4b.
To be specific, semicolon stands for column vector connection and new features representation will be fed into BiLSTM as input.

BiLSTM
LSTM model can be used to solve the problem that traditional machine learning models are difficult to extract high-level semantics in texts when classifying texts. This model adopts a text sequence matrix composed of pre-trained distributed word vectors as input and then extracts the feature expressions containing context information by using its unique memory structure. The LSTM model structure is shown in Figure 4a. The standard LSTM network can only leverage the historical context. However, the lack of future context may lead to an incomplete understanding of the meaning of the text. BiLSTM is the combination of a forward LSTM layer and a backward LSTM layer. The information of context can be fully used by summarizing the information of two ideates before and after the word. The model structure is shown in Figure 4b. The key idea of RNN is to use sequential information [51]. Text classification can be treated as a sequential modeling task. Due to its sequential feature, RNN models have been widely used in text classification tasks [52,53]. The LSTM model is a special type of RNN [54] that overcomes the issue of vanishing gradients during RNN model training. The key point is to find and establish long-term dependencies between input values by its specially designed memory unit so that it can understand more contextual information to extract high-level abstract features from texts. The memory unit structure of the BiLSTM model is shown in Figure 5. The most important part of the memory unit is the memory state C which is transmitted directly over the entire structure chain and performs only a small amount of linear operation so that the information can be easily kept unchanged during transmission. At the same time, the memory unit has a smart "gate" structure to add or delete the information contained in the memory state. The so- The key idea of RNN is to use sequential information [51]. Text classification can be treated as a sequential modeling task. Due to its sequential feature, RNN models have been widely used in text classification tasks [52,53]. The LSTM model is a special type of RNN [54] that overcomes the issue of vanishing gradients during RNN model training. The key point is to find and establish long-term dependencies between input values by its specially designed memory unit so that it can understand more contextual information to extract high-level abstract features from texts. The memory unit structure of the BiLSTM model is shown in Figure 5.
To be specific, semicolon stands for column vector connection and new features representation will be fed into BiLSTM as input.

BiLSTM
LSTM model can be used to solve the problem that traditional machine learning models are difficult to extract high-level semantics in texts when classifying texts. This model adopts a text sequence matrix composed of pre-trained distributed word vectors as input and then extracts the feature expressions containing context information by using its unique memory structure. The LSTM model structure is shown in Figure 4a. The standard LSTM network can only leverage the historical context. However, the lack of future context may lead to an incomplete understanding of the meaning of the text. BiLSTM is the combination of a forward LSTM layer and a backward LSTM layer. The information of context can be fully used by summarizing the information of two ideates before and after the word. The model structure is shown in Figure 4b. The key idea of RNN is to use sequential information [51]. Text classification can be treated as a sequential modeling task. Due to its sequential feature, RNN models have been widely used in text classification tasks [52,53]. The LSTM model is a special type of RNN [54] that overcomes the issue of vanishing gradients during RNN model training. The key point is to find and establish long-term dependencies between input values by its specially designed memory unit so that it can understand more contextual information to extract high-level abstract features from texts. The memory unit structure of the BiLSTM model is shown in Figure 5. The most important part of the memory unit is the memory state C which is transmitted directly over the entire structure chain and performs only a small amount of linear operation so that the information can be easily kept unchanged during transmission. At the same time, the memory unit has a smart "gate" structure to add or delete the information contained in the memory state. The so- The most important part of the memory unit is the memory state C which is transmitted directly over the entire structure chain and performs only a small amount of linear operation so that the information can be easily kept unchanged during transmission. At the same time, the memory unit has a smart "gate" structure to add or delete the information contained in the memory state. The so-called "gate" is a method of selecting information, which includes the pointwise multiplication operation of vectors and the sigmoid function. A complete memory unit mainly includes the following parts: memory C t−1 at time t−1, output h t−1 at time t−1, forget gate f t , gate i t , and output gate O t , among which the values of the three gates are all between 0 and 1, while the memory state C t−1 records the historical information of all previous time nodes, which is the long-term memory of the model, and the h t−1 records the information of the time node right before the current time, which is the short-term memory of the model. The memory state of the j th memory unit at time t, C j t , is the result of the input gate i j t , forget gate f j t , and the previous memory state C j t−1 . Referring to the version of Zaremba et al. [55], the calculation formula of the memory unit is as follows In Equation (7), W represents the weight matrix corresponding to each control gate, b denotes bias parameter, σ denotes sigmoid activation function, tanh represents the hyperbolic tangent function and x t represents the input of the model at time t. The input gate i t and the forget gate f t respectively control the addition of new information and the deletion of old information. When the memory unit is updated, the hidden layer will calculate the current hidden layer h t according to the result of the current output gate O t : From the data processing procedure of the memory unit, it can be known that the core idea of the memory unit structure is to continuously update the long-term and short-term information in the model according to the input information of the current word, that is, to continuously obtain the context features in the text. Let the hidden status output by forward LSTM be → h t and that output by backward LSTM be ← h t at moment t, the hidden status output h t by BiLSTM will be The BiLSTM model finally achieves the prediction of the category of the text by using the softmax classification layer, as shown in Figure 2. Softmax refers to the softmax regression model which is a commonly used multi-classification algorithm. It calculates the probability that the text to be classified belongs to each category by transmitting the output of the BiLSTM hidden layer to the softmax classification layer so that the classification result is finally obtained by the maximum probability among all categories.

Data Description
The original data of construction accident narratives were downloaded free of charge from the Occupational Safety and Health Administration (OSHA) website which contains more than 16,000 construction accident reports from 1983 to the present. Each report comes with a detailed description of the accident, including the cause and the final result. Unfortunately, these data are not explicitly labeled. Therefore, this study used an open-source dataset published in the early work of Goh et al. [40] which contains 4470 narrative data downloaded from OSHA, where 1000 narratives are labeled. Table 1 shows a labeled example in which the title and the narrative are combined as one single piece of text data to make full use of the header information. The dataset was labeled according to labels used by the Institute for Workplace Safety and Health [56]. Table 1. Example of accident narrative (with label).

Title
Employee is Found Dead after Exposure to Chlorine

Summary
On 27 June 2008 employee #1 and a coworker were performing mold inspections in an army barrack. A contractor was spraying a 6 to 1 mixture of bleach and water. Employee #1 complained of chest pains and the odor of chlorine later that evening. He was found dead in his hotel room the following day. Label exposure to chemical substances By carefully checking the labeled dataset, it is found that there are obvious imbalances in the samples of different categories in the dataset, which tends to results in over-fitting of the category with a large number of samples and the under-fitting of the category with a small number of samples in the process of model training [57]. Therefore, to make the dataset more suitable for classification experiments, this research manually labeled some additional data to make the number of samples in all categories more balanced. The dataset now contains 1863 instances, and it has been published to GitHub [58]. The 11 accident categories and their sample distributions are shown in Table 2. To train the model and evaluate model performance, a separate test dataset needs to be reserved for the evaluation of the universality of the model in analyzing complex texts [59]. Therefore, the labeled dataset was randomly divided into two groups, i.e., one training set for optimizing the model and a test set for evaluating the model performance. The training set contains 1490 instances (accounting for 80% of the total data) and the test set contains 373 instances (accounting for 20% of the total data).

Baseline Models
In the field of text classification, a number of machine learning algorithms have shown good performance. However, no single algorithm can always be superior to other algorithms and is applicable in all fields [23]. To evaluate the performance of the proposed C-BiLSTM-based method, this study selected three baseline classifiers, namely SVM, NB, and LR, for comparison. These three classifiers are not only widely used, but also state-of-the-art in the field of classification of construction-related documents. These algorithms are well established and are introduced in detail by Bishop [60]. In addition, the results of the C-BiLSTM model were compared with three deep learning models, namely CNN, LSTM, and BiLSTM. To compare the effects of different pre-training models on the experimental results, BERT and Word2Vec were used to process the data as inputs to the C-BiLSTM model.
SVM [61] is established based on the statistics and the principle of risk structure minimization, with the ultimate goal to find the optimal classification line in the current condition, which means the optimal partition hyperplane that achieves optimal classification of the test sample. The optimal partition hyperplane not only correctly divides the subject dataset, but also ensure the partitions reach the maximum interval.
NB [62] has a simple structure and is extensively used. It models the classification of documents through the probability model under the assumption that different items are independent of each other and subject to the same distribution. The algorithm idea of the NB classifier is that, for the given item to be classified, the probability of occurrence of each category under the condition of some given features is solved, and the category of the item to be classified is determined according to the maximum probability.
LR [63], also known as Logistic Regression Analysis, is used to deal with the regression problem where dependent variables are categorical variables. It commonly copes with binary classification or binomial distribution problem but can also deal with multi-classification problems. It predicts the probability of future outcomes through the performance of historical data.

Experiment Results
As discussed in Section 2, the F1 score is used as an indicator to evaluate the performance of a classifier of construction accident narratives. Table 3 summarizes the F1 results of the proposed C-BiLSTM model and other baseline models, with the highest F1 score (i.e., the best classification performance) underlined. Figure 6 gives a better visualized illustration. It can be seen from Table 3 that the proposed C-BiLSTM-BERT-based model is generally superior to other methods in terms of both the F1 score and the weighted average F1 score, with a maximum weighted average F1 of 0.81, which is better than 0. 8    It can be seen from Table 3 that the proposed C-BiLSTM-BERT-based model is generally superior to other methods in terms of both the F1 score and the weighted average F1 score, with a maximum weighted average F1 of 0.81, which is better than 0.8 of BERT, 0.78 of C-BiLSTM-Word2vec, 0.76 of BiLSTM, 0.75 of LSTM, 0.71 of CNN, 0.71 of SVM, 0.58 of NB, and 0.69 of LR. In addition, the proposed model achieves the highest F1 score for most categories, except for "Exposure to chemical substances", "Struck by moving objects" and "Struck by falling objects" where the BERT method has a better performance respectively. It is worth noting that the category "Electrocution" shows the best classification results in all methods, especially in the C-BiLSTM-BERT-based model with a superior score of 0.96, and the category "Struck by moving objects" has the worst classification result, with an F1 score of 0.65 in the proposed method. In general, different classifiers can have various performances in the classification results of different categories, and the proposed C-BiLSTM-based method is better in an aggregate performance.

Discussions and Future Work
It is obvious that the C-BiLSTM-based method outweighs a single model in terms of the classification effect. As the convolutional layer can capture the local correlation of spatial or temporal structures, it can extract n-gram features of different positions of text from word vectors by the convolutional filter. What is more, the BiLSTM network can capture the features of a long-time interval in the preceding part and obtain the features of the following text. Therefore, BiLSTM has more advantages than LSTM in making full use of semantic and word order information of the text. The C-BiLSTM-based method has the strength of both models to improve classification accuracy. It is worth noting that compared with the SVM model, a single CNN does not significantly improve the text classification effect, while the BiLSTM model improves the accuracy of classification results more. For the text classification task, the word order feature combined with context may have more influence on the classification result than the local feature. It is noted that the results of the BERT model are much better than those of the C-BiLSTM model, which uses Word2vec as the data pretraining. The results of the C-BILSTM model using BERT as a data pretraining method have also been clearly improved. The results indicate that the generalization ability of BERT and the extraction of context features have obvious advantages, but the complexity of the model and the time of data processing are much higher than the Word2vec model.
It can be found that in Table 3, although the overall performance of this proposed C-BiLSTM model is better than other baseline methods, the results are not ideal in the classification of some specific categories, which greatly affects the weighted F1 score. One example is the category of "Collapse of object" which only gets an F1 score of 0.57. One possible reason is the unclear definition of the classification label. By carefully examining the dataset, it was found that some labels are even difficult to be classified by human readers, especially the three types: "Collapse of objects," "Falls," and "Struck by falling objects". In accident narratives, the accidents involving the collapse of objects are often accompanied by accidents in which workers fall and are hit by falling objects. The inaccuracy of the definition of this classification label may affect the accuracy of model identification. On the other hand, we noticed that the accident narratives often contain too much description of the working environment and the contents of rescue and human care after the accidents, while there are not enough words describing the direct cause of the accident. These redundant texts also pose a challenge for the accuracy of the classification result.
It can be found from the experimental results that no classifier can achieve the consistent and best performance for all categories. As such, in future work, the performance of different DL models can be further investigated. In addition, a common problem in text classification tasks is too many terminologies in texts. To reduce the number of features and improve performance, professional ontology and dictionaries can be used in the pre-processing [39] to help remove the detailed description which is not related to events and reduce data dimensions. This work usually requires experts to build an ontology and/or a dictionary of a specific domain.
Finally, this proposed C-BiLSTM-based model was only used on the text dataset obtained from OSHA for evaluation. As such, the consistency of models in different data sets can be further discussed in future work.

Conclusions
Accidents could occur every day on a construction site and cannot be ignored by construction industry practitioners. To make full use of the effective information and knowledge in existing construction accident reports to prevent future similar accidents, this paper proposed an automatic classification method for the causes of accidents, based on a C-BiLSTM model. By extracting the semantic features from the contextual information in the text, this method can automatically classify the construction accident narratives according to its causes. This proposed method shows significant improvement in terms of classification performance measured by the F1 score, compared to other baseline methods in the construction industry such as SVM, NB, and LR classifiers. More accurate classification results obtained by this C-BiLSTM-based method also provides a data basis for the further application of the construction accident reports, such as in the prediction of construction accidents [64]. This classification model also provides a reference for the text classification in other specific fields in the construction industry, for example, quality inspection.
In addition, this work still requires manual labeling of construction accidents to form datasets for model training. In future work, unsupervised models that do not need to be labeled can be considered for experiment and comparison of results. On the other hand, through the analysis of construction reports, it can be found that real accidents are often caused by multiple accident causes. Therefore, the processing of data according to multi-label classification tasks or further refinement of classification categories can be considered in future research.