Integrating Information Entropy and Latent Dirichlet Allocation Models for Analysis of Safety Accidents in the Construction Industry

Liu, Yipeng; Wang, Junwu; Tang, Shanrong; Zhang, Jiaji; Wan, Jinyingjun

doi:10.3390/buildings13071831

Open AccessEditor’s ChoiceArticle

Integrating Information Entropy and Latent Dirichlet Allocation Models for Analysis of Safety Accidents in the Construction Industry

by

Yipeng Liu

^1,2

,

Junwu Wang

^1,2

,

Shanrong Tang

^1,2,

Jiaji Zhang

^1,2 and

Jinyingjun Wan

^3,*

¹

Hainan Institute, Wuhan University of Technology, Sanya 572025, China

²

School of Civil Engineering and Architecture, Wuhan University of Technology, Wuhan 430070, China

³

Office of Infrastructure Development, Wuhan University of Technology, Wuhan 430070, China

^*

Author to whom correspondence should be addressed.

Buildings 2023, 13(7), 1831; https://doi.org/10.3390/buildings13071831

Submission received: 13 June 2023 / Revised: 14 July 2023 / Accepted: 15 July 2023 / Published: 20 July 2023

(This article belongs to the Section Construction Management, and Computers & Digitization)

Download

Browse Figures

Versions Notes

Abstract

:

Construction accident investigation reports contain critical information, but extracting useful insights from the voluminous Chinese text is challenging. Traditional methods rely on expert judgment, which leads to time-consuming and potentially inaccurate results. To overcome this problem, we propose a novel approach that combines text mining techniques and latent Dirichlet allocation (LDA) models to analyze standardized accident investigation reports in the Chinese construction industry. The proposed method integrates an information entropy term frequency-inverse document frequency (TF-IDF) weighting scheme to evaluate term importance and accounts for word and model uncertainty. The method was applied to a set of construction industry accident reports to identify the key factors leading to safety accidents. The results show that the causal factors of accidents in Chinese accident investigation reports consist of keywords and negative expressions, including “failure to timely identify safety hazards” and “inadequate site safety management”. Failure to timely identify safety hazards is the most common factor in accident investigation reports, and the negative expressions commonly used in the reports include “not timely” and “not in place”. The information entropy TF-IDF method is superior to traditional methods in terms of accuracy and efficiency, and the LDA model that considers word frequency and feature weights is better able to capture the underlying themes in the Chinese corpus. And the subject terms that make up the themes contain more information about the causes of accidents. This approach helps site managers more quickly and effectively understand the causal factors and key messages that lead to accidents from incident reports. It gives site managers insight into common patterns and themes associated with safety incidents, such as unsafe practices, hazardous work environments, and non-compliance with safety regulations. This enables them to make informed decisions to improve safety management practices.

Keywords:

text mining; information entropy TF-IDF; latent Dirichlet allocation (LDA); accident investigation report; construction industry

1. Introduction

In the construction industry, identifying critical factors that contribute to accidents and focusing on the different causal factors that lead to hazards and accidents on construction sites are critical to improving construction safety [1,2]. Traditionally, the identification of safety-related factors relied on the expertise of domain experts or manual safety inspections conducted by managers to record safety hazards in textual form [3]. In addition, the investigation of accident causes also involves literature research, semi-structured interviews with construction industry practitioners, and court case analysis [4]. However, these activities were time-consuming, labor-intensive, and often lacked relevance and validity for analyzing safety incidents, as they mainly recorded hazards when no safety incidents occurred.

With the advance of digitization, researchers have turned to automated methods to obtain critical information, primarily through data mining and data analysis [5,6,7]. Automation allows computers to analyze large amounts of unstructured textual data in a short period of time [8], reducing the time and cost associated with manual analysis [9] and significantly increasing research efficiency. The construction industry generates a substantial amount of unstructured textual data that contains key information about accidents [10,11], and text mining techniques have proven effective in analyzing the causes and prevention of accidents in this industry [12,13]. However, a significant amount of textual information, particularly accident investigation reports produced by government agencies, remains underutilized [11]. These reports provide credible and standardized textual information that describes the causes and responsibilities of accidents throughout their development, but there is a lack of comprehensive analytical studies in this area.

Accident investigation reports play a vital role in enhancing safety management practices within the construction industry. These reports provide valuable information, as they not only describe the sequence of events leading to an accident but also provide a comprehensive analysis of its causes [14]. Accident reports may be extensive and contain large amounts of textual data, making it difficult to effectively extract relevant and meaningful insights [15]. In addition, information may be structured in different ways or use different terminology, further complicating the extraction process [3]. Safety management needs to focus on the key factors that lead to accidents and identify trends or patterns that can help prevent future accidents. However, within these multi-dimensional datasets, it can be challenging to determine what information is most relevant and important for effective safety management. Although text mining approaches can extract semantic features from the text, achieving a deep analysis of text content is difficult. Traditional methods involve manual feature extraction by experts or rely on techniques such as word frequency and term inverse document frequency (TF-IDF). However, word frequency-based information extraction methods often consider a large number of common words with low information content as important, while TF-IDF-based methods consider the importance of low-frequency words but overlook the distribution of keywords within the text, resulting in extracted keywords lacking readability. These limitations hinder the analysis of complex texts, such as accident investigation reports, which contain a wealth of information. Despite some partial improvements in this area [16], there is a lack of research focusing on keyword weighting based on the semantic information content of the text and the integration of information entropy weighting with TF-IDF weighting. In addition, the varying information content of words in short texts poses a challenge to the accurate determination of word importance by simple, improved methods. The current keyword determination approaches do not address this problem, which hinders information extraction from complex Chinese texts. To overcome these challenges, an improved text mining method that combines term frequency-inverse document frequency (TF-IDF) and information entropy is proposed in this study. By incorporating information entropy as a weight, this combined approach provides a more comprehensive representation of the importance and relevance of each word in the text. It enables more effective identification and extraction of important information, thereby improving the accuracy of text analysis and understanding. This research not only addresses the limitations of the traditional TF-IDF method but also presents an improved and effective method for determining keywords in text mining. By integrating these weights, the resulting composite weights provide a more comprehensive description of the meaning and relevance of each word in the text.

The main contributions of this paper are:

(1): Text mining is performed on standardized Chinese accident investigation report titles and causes texts to identify high-frequency accident types, key causal factors, and corresponding responsibilities. This information helps to clarify causes and responsibilities, leading to more accurate and efficient construction safety management.
(2): The study addresses the issue of information content in word-based text mining and proposes an improved feature extraction method called information entropy TF-IDF. By combining TF-IDF with information entropy, the method takes into account the importance and distribution of keywords. Furthermore, by combining TF-IDF information entropy with the LDA topic model, the key factors contributing to accidents are visualized. This visualization helps safety managers better understand the causes of accidents and take appropriate preventive measures more clearly and intuitively.

The article is organized into the following sections. Section 2 provides a review of the current state of research in text mining as well as the existing methods for keyword extraction and LDA topic modeling. Section 3 outlines the main steps involved in text mining and topic modeling, along with the derivation process of the improved information entropy TF-IDF method proposed in this paper. In Section 4, the collection and initial analysis of accident investigation reports are described, along with the visualization of the results obtained through the improved approach for keyword extraction and LDA topic modeling. These findings are further discussed in Section 5, and the limitations of the study are presented in the concluding part of the discussion. Finally, Section 6 highlights the research contributions and the key results achieved in this paper.

2. Literature Review

Text mining is a powerful technique for extracting valuable information from large amounts of textual data. Using computer processing, text mining converts unstructured text data into structured data and automatically extracts textual features and information [17]. This information and knowledge can be used to analyze, process, and understand large amounts of security-related text data. It enables the identification and resolution of potential safety issues, the prediction and prevention of similar accident events, and the improvement of the efficiency and accuracy of construction safety management [18]. As a result, text mining provides safety managers with the ability to quickly access a large amount of information, identify hidden safety problems, and implement timely preventive measures. safety problems and implement timely preventive measures, thereby reducing the likelihood of construction accidents and protecting workers’ lives.

2.1. Status of Text Mining

Text mining techniques have been widely applied to analyze unstructured accident text data, enabling researchers to study and analyze the causes of safety accidents in the construction process [19,20]. For instance, Liu et al. employed text mining in risk assessment to construct knowledge graphs from accident reports [21], while Zhang et al. used text mining and natural language processing (NLP) to conduct a comprehensive study of construction safety accident investigation reports [9], and the TF-IDF method was used to optimize the weights and extract the main factors that led to the accidents. Zhou et al. utilized the LDA model and related explanatory clustering to identify and analyze the significant areas of research in the field of construction safety and health [22]. Zhong et al. [23] proposed an intelligent text mining method combining deep learning and LDA modeling to test the effectiveness of the method with the text of accidents occurring at construction sites on the OSHA website. And the extracted textual knowledge was used to describe the accident scenarios visually. However, they also pointed out that the limitation of the study is that only the text on the OSHA website was used, and the analysis of text information from other sources was lacking. In addition, the subject words extracted by the LDA model used in the study were descriptive words about the types of accidents that are often found in accident reports, and there was a lack of keyword extraction for the more specialized technical terms. They visually described the extracted textual knowledge to depict accident scenarios. Additionally, Tixier et al. designed an automated, rule-based text information extraction procedure for the accurate scanning of injury reports [20]. This approach allowed for the extraction of safety-related structured information from large-scale datasets, which in turn facilitated the prediction of safety outcomes and enhanced safety management.

In the context of accident investigation reports in Chinese, Zhang et al. applied accident causation theory and a systems thinking approach to extract 26 key factors from 571 accident investigation reports in China [24]. Li et al. constructed a lexicon and used document frequency to identify three frequently occurring safety risk factors from 15 accident reports [25]. In addition, Li used text mining techniques to identify 78 safety risk factors from 726 coal mining accident investigation reports, intending to establish an effective automatic information mining scheme for non-standardized coal mining accident investigation reports [26]. Xu et al. conducted an analysis of 188 subway accident reports using a modified TF-IDF method [27]. However, this study included manual intervention to set thresholds and eliminate low-frequency items, potentially reducing the information content of the text. In addition, Xu highlighted the association between different construction activities and different safety risks, emphasizing the need to analyze multiple categories of accident reports to uncover additional mechanisms of workplace accidents. It is evident that current research mainly focuses on analyzing accident characteristics based on a single accident investigation report in the context of construction activities. As a result, the study of accident report texts in construction accident investigation reports lacks depth, and there is a lack of effective methods for extracting more valid information from Chinese texts.

The aforementioned text mining methods demonstrate the effectiveness of machine learning and deep learning-based approaches in extracting relevant information from text. These methods enable fast and efficient retrieval and extraction of information from a significant amount of unstructured text content. However, most existing research has focused on construction process safety hazard texts or OSHA-reported accident reports, overlooking the non-standard nature of accident reports. Few studies have used machine learning methods to develop models for automated safety text information extraction and analysis methods for standardized Chinese construction accident investigation reports. Since accidents in the construction industry can result in fatalities and have significant consequences, the analysis of accident cases is crucial, as they serve as illustrative examples that provide persuasive evidence.

2.2. Key Methods for Text Mining

The design and extraction of features play a pivotal role in text mining using machine learning techniques, especially in enhancing performance and accuracy in short text classification [6,23]. In recent studies, various approaches have been applied to extract textual information and features from safety incident reports using machine learning. For instance, Sun et al. applied an improved TF-IDF method that considered word length, lexicality, and lexical position to extract informational keywords from construction reports [16], which were then visualized as a network graph. Based on 2.3 million public comments related to smart cities, Yue constructed topic mining and sentiment analysis models using LDA and CNN-BiLSTM algorithms. These models aimed to extract public perceptions of the concept of smart cities from a large volume of text [28].

Furthermore, many researchers have combined multiple methods to analyze text features extracted using the TF-IDF method. They have applied these features for information extraction and topic mining using the LDA model on large text datasets [29]. Na utilized text mining techniques to identify safety risk factors from a vast collection of accident investigation reports [27]. They introduced an improved terminology importance assessment method, TF-H, which integrated the frequency and distribution of the key factors in the report text. Tian et al. focused on identifying technical terms in the safety domain based on semantic similarity and information association [3]. They constructed a corresponding terminology library and used the TF-IDF method for mining and analyzing technical terms to determine deeper insights into safety hazard information. Additionally, Suh employed case data from the OSHA website and utilized text mining along with latent Dirichlet allocation (LDA) to extract key textual information from accident reports [30].

It is worth noting that while machine learning text mining methods have improved the efficiency of extracting and managing text information, accurately identifying text features, such as keywords, from a large corpus remains a challenge. One significant drawback of TF-IDF is its disregard for word semantics, resulting in ineffective keyword extraction [31]. Given the abundance of specialized terminology in the field of building construction, traditional TF-IDF methods may assign higher weights to rare words, and word frequency-based LDA models may treat many common words with little meaning as significant terms. This interference from irrelevant information within the corpus hinders the extraction of key information from the text. Moreover, Chinese text mining differs from English text mining due to the presence of hyphenated text and the weaker individual meanings of Chinese characters, emphasizing the need for text segmentation.

2.3. Research Gaps and Contributions

Existing research on text mining of accident investigation reports in the construction industry, especially Chinese reports, reveals significant research gaps. Previous studies have primarily focused on characterizing accidents based on individual reports, resulting in limited depth of analysis and a lack of comprehensive understanding of construction accidents. In addition, these studies have overlooked the non-standard nature of accident reports, making it difficult to extract valid and meaningful information. Another research gap lies in the limitations of traditional text mining methods, such as TF-IDF, which fail to accurately identify relevant keywords and extract key information from accident reports. TF-IDF neglects word semantics and may assign higher weights to rare words, resulting in unreliable keyword extraction. Thus, there is a clear need to develop automated methods for extracting and analyzing safety-related textual information in standardized Chinese construction accident investigation reports.

To address these research gaps, this study makes several important contributions. First, we have compiled a comprehensive corpus of Chinese accident investigation reports from authoritative sources in the construction industry. By constructing a specialized lexicon and applying text segmentation techniques, the dimensionality can be reduced, synonyms and irrelevant features can be eliminated, and interference in the analysis process can be minimized. Second, the study proposes an improved TF-IDF method that accounts for the inherent ambiguity, uncertainty, and semantics of words in accident reports. This method, which incorporates an information entropy-based TF-IDF feature extraction approach combined with the LDA model, enables a more thorough analysis of standardized Chinese construction accident investigation reports.

3. Materials and Methods

This paper presents a comprehensive study of the challenges associated with extracting information from Chinese accident investigation reports in the construction industry. The study involves several key steps, including data collection, text preprocessing, feature extraction, topic model training, and application. Data collection mainly involves obtaining information from the official websites of government departments and construction industry authorities. In the data pre-processing stage, special attention is paid to the unique structure of synonym merging and the identification of key causal factors. In the feature extraction stage, an improved information entropy (TF-IDF) method is used to extract meaningful features from the text. Finally, the text is subjected to thematic analysis using LDA topic modeling to uncover implicit themes and key information within the accident reports. By following these steps, the study effectively extracts critical information and insights from accident investigation reports and provides valuable support for accident investigation and prevention efforts. The specific implementation framework is shown in Figure 1.

3.1. Text Pre-Processing

Data pre-processing is a crucial step in Chinese text mining due to the unique characteristics of the Chinese language. Chinese texts often present challenges such as unclear word boundaries, a large number of synonyms, and subjective expressions used by writers [26,27]. To address these challenges, various data pre-processing techniques are used to clean and improve the quality of the text before analysis. This typically involves several steps, including word segmentation, stop word removal, part-of-speech tagging, and normalization.

(1): Deactivation removal: Deactivated words are often referred to as stop words. Deactivation removal helps refine the data set by focusing on essential and informative vocabulary. This process reduces noise and unnecessary clutter in the text, allowing for a more accurate analysis of accident investigation reports. By excluding deactivated words, subsequent analysis can focus on the key factors, relationships, and patterns present in the data, allowing for a more meaningful interpretation of the results.
(2): Removal of irrelevant information: This step involves eliminating irrelevant content, such as numbers, special characters, URLs, or any other information that does not contribute to the analysis. By removing these extraneous elements, the text is simplified, and the structure of the data is streamlined.

Overall, the pre-processing of Chinese text data is critical to ensuring that the data is ready for analysis, as it enables the removal of irrelevant information and ensures the consistency and accuracy of the data. By following these steps, researchers can improve the reliability and validity of the analysis conducted on accident investigation reports.

3.2. Add Custom Dictionaries and Word Frequency Statistics

The construction industry is a highly specialized field with a variety of unique and technical terms. When investigating construction safety accidents, it is crucial to consider various details such as site conditions, accident causes, and response measures, which often include a large number of specialized terms such as high fall, external frame, pit, and enclosure structure. However, standard Chinese dictionaries for deactivation lack professional terminology related to the construction industry. As a result, when these terms are broken down into smaller units during text mining, they may lose their original meanings [27], leading to misinterpretation and inaccurate modeling results. Therefore, it is necessary to identify the specialized vocabulary and terms in the construction industry domain before identifying relevant keywords and excluding irrelevant or interfering words, resulting in more effective and accurate text mining.

In order to better understand the causes of and responses to construction accidents, it is essential to add custom dictionaries for the analysis of construction safety accident investigation reports [32]. There are two primary sources for constructing these custom dictionaries. The first is the selection of words related to safety production and engineering in the Chinese input method, Sogou’s Thesaurus, as the research focus of this paper is construction safety accidents in the construction industry. The second source is research reports and papers on construction safety management, which contain a wealth of technical terms and keywords used in the process of risk indexing and risk assessment. By systematically collecting and organizing relevant professional terms and keywords, a more accurate and complete custom dictionary can be constructed, which helps to improve the quality and efficiency of topic modeling.

3.3. Word Division and Word Frequency Statistics

Chinese text poses unique challenges to word segmentation due to the lack of explicit spaces between words. This lack of clear word boundaries makes it difficult to determine where one word ends and another begins, resulting in a phenomenon known as word boundary disambiguation [25,33]. The accuracy of word segmentation directly affects subsequent text mining tasks, such as keyword extraction and topic modeling, and influences the overall quality and reliability of our analysis. Consequently, the initial step involves segmenting the text into individual words or characters. Various methods can be employed for this purpose, including rule-based approaches, statistical methods, and machine learning algorithms. Among these, the “jieba” word segmentation method is commonly used in Chinese word segmentation. It effectively carries out Chinese word segmentation, word frequency statistics, and the removal of stopwords.

Before performing word segmentation on accident investigation reports, several challenges need to be addressed using different processing methods:

(1): Standardized Terminology and Thesaurus Construction: Upon collecting the index system for this study, it was observed that many causal factors have standardized expressions. By employing the aforementioned word segmentation and de-wording processes, more precise accident causal factors can be obtained. However, there may still be variations in phrases that convey the same meaning. For instance, while “low safety awareness” describes a lack of safety awareness among individuals, some accident reports may use phrases like “insufficient safety awareness” or “poor safety awareness”. Furthermore, synonymous expressions such as “insufficient safety awareness” and “poor safety awareness” may be used interchangeably. To address this issue, it is essential to construct a thesaurus of synonyms to streamline and consolidate the causal factors.
(2): Extraction of Key Terms from Non-Standardized Expressions: Accident investigation reports often contain key terms for cause analysis that lack standardized expressions. These terms may not be directly obtained from previous research reports or academic papers. For example, the term “overload” may have various expressions, such as “overload 50%”, “overload 1.7 times”, or “overload lifting”. Although the expressions differ, they convey the same meaning. Therefore, in the process of mining cause-related information from text, it is important to extract the keyword “overload” to obtain the relevant key information.

Key causal factors can be derived from a combination of keywords and negative descriptions. These negative descriptions often appear in accident reports with terms like “not in place”, “not seriously implemented”, “not strictly implemented”, “insufficient”, and so on. They lack standardized expressions and are typically dispersed throughout the text. Even when extracting key technical terms, it is important to recognize that complete causal factors consist of key terms accompanied by negative descriptions.

By segmenting the text into words, it becomes possible to split it into multiple lexical units, which can then be used for word frequency statistics. This process aids in extracting information such as keywords, key phrases, and causal factors that are directly composed of keywords and negative expressions. In the headlines of Chinese accident reports, word frequency statistics can be employed to quickly obtain vital information such as the time, location, and type of the accident. This assists managers in gaining a better understanding of the accident events. Furthermore, word frequency statistics serve as the foundation for generating word clouds, which visually highlight the key points and critical information in the text by displaying words or phrases with high word frequency.

3.4. Calculation of Information Entropy TF-IDF

A crucial step before running LDA modeling is text feature extraction, and selecting an appropriate method for feature extraction can enhance the performance and efficiency of the model. By reducing the dimensionality of the features and identifying keywords, LDA models can be trained more efficiently, and the resulting topics may exhibit greater coherence and interpretability. Furthermore, feature selection aids in eliminating noisy or irrelevant features that could adversely impact the performance of the LDA model. Various approaches exist for text feature selection, among which is the TF-IDF method. TF-IDF is a widely employed technique in text mining for extracting keywords and essential feature information. It measures the significance of a word in a document collection by multiplying the word’s term frequency (TF) in a particular document by its inverse document frequency (IDF) across the entire corpus [34], as depicted in Equation (1).

TF - I D F = t f_{i j} \times i d f_{i}

(1)

The TF-IDF method assigns higher scores to words that are more relevant to the documents, but it also has some limitations. When studying security risk extraction, TF-IDF does not consider the semantic relationship between words or the distribution of key causal factors in documents [35]. In contrast, information entropy based on information theory is widely used in data dimensionality reduction and feature selection in statistical text information semantics [36]. In natural language processing, a word’s entropy can indicate its level of ambiguity or uncertainty in a given dataset, and it can effectively measure the information content of a word in a text, as shown in Equation (2).

H = - \sum p_{i} \log p_{i}

(2)

Therefore, to address the limitations and shortcomings of TF-IDF and consider the fuzzy uncertainty in accident reports and the importance of words, it is essential to take into account the semantic information of words in the text. In this regard, the TF-IDF matrix is enhanced by incorporating information entropy weighting. This approach considers the significance of words by measuring their information entropy TF-IDF scores. Specifically, based on the word weights calculated by the traditional TF-IDF, the information entropy weights are further calculated, and then the two weights are combined to balance the importance of words.

(1): Calculate the TF-IDF score

Calculate the TF-IDF score of each word in the document using Equation (1). This score captures the importance of a word within the specific document and the entire corpus. In this step, the traditional weights of the words are calculated to measure their relevance to the documents. Subsequently, the information entropy weights of the words are computed to gauge the information content and uncertainty associated with each word.

(2): Calculate word information entropy

According to the information entropy formula in information theory, the calculation of word information entropy is based on treating each sentence as an event and the occurrence or non-occurrence of a word as the result of that event. The information entropy of a word is determined by the probability of the event occurring.

To calculate the probability of a word occurring in the data set, we divide the number of times the word occurs in a sentence by the total number of sentences, denoted as p(w). Similarly, the probability that a word does not occur in the data set is calculated as 1 minus p(w).

Based on the definition of information entropy, if the probability is 1 or 0, the corresponding information entropy is 0. This is because the outcome of the event is already determined and there is no uncertainty. When the probability is between 0 and 1, the information entropy can be expressed as the negative of the probability of each category multiplied by the sum of the information content in that category. This allows us to quantify the degree of uncertainty associated with the word.

Therefore, the expression for word information entropy, as denoted by Equation (6), combines these concepts and provides a measure of the information content and uncertainty of the word in the data set. The information entropy will be higher, indicating that it is more informative. Information entropy allows us to gain insight into the meaning and distinctiveness of individual words, which can further contribute to the accurate analysis and interpretation of textual data. Equation (3) is used to calculate the information entropy:

E n t r o p y = \{\begin{matrix} - p (w) \log p (w) - (1 - p (w)) \log (1 - p (w)) & p (w) \neq 0, 1 \\ 0 & p (w) = 0, 1 \end{matrix}

(3)

In this equation, p(w) represents the number of texts containing the word w in the text collection, divided by the total number of texts in the collection.

(3): TF-IDF and information entropy normalization

The TF-IDF weights assign higher weights to uncommon words, emphasizing their importance in the text. On the other hand, the information entropy weights measure the importance of words based on their information content. After obtaining both the TF-IDF weights and information entropy weights, it is essential to integrate them, taking into account their respective contributions.

To ensure a fair integration process, normalization of the weights is necessary. This step addresses the issue of weight equalization caused by the disparity in quantity between the two weight sets. Normalization involves adjusting the magnitudes of the weights to comparable ranges, which allows for a balanced integration.

The normalization step scales the range of values of TF-IDF and information entropy to between (0, 1) to ensure that their weights are considered in the final result in a balanced way.

t f i d f_{n o r m a l i z e d} = \frac{t f i d f_{i} - \min (t f i d f)}{\max (t f i d f) - \min (t f i d f)}

(4)

Here, tfidf is the TF-IDF value of the word calculated according to Equation (1), and “min(tfidf)” and “max(tfidf)” denote the minimum and maximum TF-IDF values of the word among all documents in the data set, respectively. Similarly, Equation (5) represents the process of normalizing the information entropy values. Information entropy is a measure of the amount of uncertainty or randomness in a data set.

E n t r o p y_{n o r m a l i z e d} = \frac{E n t r o p y_{i} - \min (E n t r o p y)}{\max (E n t r o p y) - \min (E n t r o p y)}

(5)

In this equation, Entropy_i represents the original information entropy values of each word, while min(Entropy) and max(Entropy) denote the minimum and maximum entropy values of the word among all documents in the data set, respectively.

(4): Calculate the improved information entropy TF-IDF score

Finally, Equation (6) calculates the information entropy TF-IDF score, which takes into account both the normalized TF-IDF and entropy values. The equation is defined as the sum of the normalized TF-IDF and entropy values:

E n t r o p y_t f i d f = t f i d f_{n o r m a l i z e d} + E n t r o p y_{n o r m a l i z e d}

(6)

The improved information entropy TF-IDF score takes into consideration not only the frequency of words but also the overall characteristics and complex semantic information of the text collection. By ranking the words according to this score, we can obtain the most important words or phrases in the text, which can be used in keyword extraction and feature extraction for topic modeling. By using the information entropy TF-IDF method, we can address the limitations of TF-IDF and improve the reliability and effectiveness of these applications.

3.5. Construction of LDA Model

LDA is a probabilistic generative model based on Bayesian inference. It assumes that each document is composed of a mixture of multiple topics, where each topic is represented as a probability distribution over a set of words. The primary goal of LDA is to uncover hidden topics in a corpus by analyzing the co-occurrence patterns of words within documents [37]. To achieve this, LDA first constructs a bag-of-words model, which represents each document as a vector of word counts. Subsequently, relevant features are extracted from the bag-of-words model using feature extraction techniques. The resulting feature matrix is then utilized as input to train the LDA model using the Gibbs sampling algorithm [38]. After the training process, the LDA model can be employed to infer the topic categories of each document and the lexical clustering of each topic, providing the results of the topic modeling. By identifying the most significant topics in the corpus and their associated words, LDA offers insights into the underlying structure and themes of the corpus. The modeling process of LDA can be summarized as follows:

(1): Construct the document-word matrix V; this matrix is constructed to represent the corpus, where i represents the number of documents, m represents the number of words within each document, and the element a_ij denotes the frequency of word i in document j.

V = (\begin{matrix} a_{1, 1} & a_{1, 2} & \dots & a_{1, j} \\ a_{2, 1} & a_{2, 2} & \dots & a_{2, j} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{i, 1} & a_{i, 2} & \dots & a_{i, j} \end{matrix})

(7)

(2): The information entropy weight matrix can be expressed as

E = (\begin{matrix} e_{1} & 0 & \dots & 0 \\ 0 & e_{2} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & e_{m} \end{matrix})

(8)

(3): Weight the document-word matrix to obtain the weighting matrix W

W = V \times E

(9)

where the element wij in the matrix represents the information entropy TF-IDF weighted value of word frequency in the document.

(4): The weighting matrix W is used as input for training and classification with the LDA model.

In this paper, the weighted word frequency matrix is constructed by combining the word frequency matrix with the information entropy TF-IDF weighting. This weighted matrix is then fed into the LDA model to enable better differentiation among commonly used words.

In practical applications, the perplexity value is a common evaluation metric used to assess the performance of LDA models and topic models generated by LDA [39]. The lower the perplexity value, the better the model’s performance. The model perplexity can be calculated from the perplexity value [40].

p e r p l e x i t y (P_{D}) = \exp (- \frac{\sum_{m = 1}^{M} \log_{p} (K_{d})}{\sum_{d = 1}^{M} N_{d}})

(10)

where P_D represents the perplexity value of the model, D is the total number of documents, K_d represents the words contained in document d, and N_d represents the word count in document d.

4. Results

4.1. Data Sources

During the construction process in China, the person in charge of a construction unit must report any accidents to the relevant industry department and safety supervision department within a specified timeframe. Subsequently, the respective competent departments are responsible for conducting investigations and issuing accident investigation reports, which are made available on government department websites. Depending on the severity of the accident, different levels of government departments oversee the preparation of these investigation reports. For this study, a total of 183 accident investigation reports were collected from the websites of provincial emergency management bureaus, municipal emergency management bureaus, and the people’s governments of various districts. These investigation reports are specifically related to production safety responsibility accidents that occurred within the past five years.

4.2. Statistical Analysis of Accident Types

Chinese accident reports consistently include a title that provides specific details about an accident, encompassing information such as the location, project, time, and nature of the incident. These titles exhibit a well-defined semantic organization that establishes relationships between time, space, and events. Extracting and analyzing this textual information can enhance our understanding of the accident type and context, enabling industry supervisors to swiftly grasp the safety conditions within the construction industry of a particular region. In this study, Chinese text segmentation and word frequency statistics were applied to the document titles. The results of the accident type statistics according to the title word frequency are shown in Figure 2.

Of the 183 accident investigation reports analyzed, construction accidents consisted of four main types: falls from height, struck-by events, electrocutions, and collapses. Falls from heights were the most common, accounting for 50.8 percent of all accident types. The construction industry is particularly susceptible to falls from heights, which pose a significant safety risk to workers. Other injury types mentioned in the reports related to falls that do not meet the criteria for working at height (2 m or more), including falls from lower platforms and vertical supports, which still result in safety accidents. In addition, three notable construction accidents were identified, including one case of poisoning and asphyxiation and two collapse accidents. The poisoning accidents occurred as a result of working in confined spaces without appropriate protective measures. The analysis and collection of statistics on these types of accidents and their background information can facilitate a more rapid understanding of the safety conditions in the construction industry in specific regions, thereby helping industry regulators implement targeted safety measures.

4.3. Extraction of Key Information on the Cause of the Accident

4.3.1. Data Pre-Processing and Word Frequency Statistics

After extracting the accident investigation report titles, further mining is necessary to identify the direct and indirect causes that contributed to the accident. Based on the preprocessing of text data, building on the text segmentation and word frequency analysis described in Section 3.3, the literature on accident causal factors was analyzed, and a customized synonym lexicon of accident report causal factors was developed, as shown in the Table 1.

Based on the above table, a synonym database of key terms is constructed, input into a custom lexicon and sub-word of the document, and word frequency statistics are performed on the sub-words to read the construction risk information efficiently and clearly and to analyze the textual features of the accident risk letter initially. According to the keyword information and the possible negative expressions, we can get the accident causative factors, as shown in the following Figure 3.

According to the accident frequency statistics, workers are often found to be the key subjects. Their lack of safety awareness, unsafe behavior, failure to follow the rules and regulations, and failure to wear or properly wear safety equipment are some of the critical factors that lead to accidents. On the other hand, failure to identify and eliminate safety hazards in a timely manner is the most frequently cited cause of accidents in various accident reports. In addition, inadequate safety management, insufficient safety management responsibilities, inadequate safety education and training, failure to implement safety operating procedures, and incomplete implementation of safety technical briefings are all significant factors attributed to the occurrence of accidents. As a result, accidents are the result of the collective responsibility of workers, contractors, construction units, supervisory units, and other responsible parties. The top 20 causal factors were identified based on high-frequency keywords, as shown in Table 2.

In the table, a single keyword may correspond to multiple causal factors or synonymous expressions of the same factor. For example, in an accident report, the phrase “site safety management is not in place” could be expressed as “unified and coordinated management is not in place”, “safety management is not in place” or “safety management responsibilities are not in place”. These phrases all encompass the lack of management responsibility and indicate a broad cause that encompasses key aspects that contribute to accidents. For example, the phrases “management is not in place” and “supervision is not in place” suggest that construction managers or supervisors failed to fulfill their responsibilities to identify and correct potential safety hazards in a timely manner.

To illustrate, consider two accident investigation reports that describe the same factor in different ways. Report 1 states: “Supervision of the site inspection is not in place; the outer frame stop bar of the work platform was missing, and violations of high workers not wearing safety harnesses were not identified”. Report 2 states: “Pre-sent site safety management is not in place; the hidden danger of the accident was not detected and eliminated in time”. Each causal factor is attributed to the workers, the construction unit, the supervision unit, and the construction unit. Thus, the same causal factor may have different responsible subjects, and two responsible subjects may simultaneously fail to identify and eliminate safety hazards, resulting in different responsible subjects for the same keyword.

4.3.2. Text Feature Extraction to Obtain Key Causal Factors

Due to variations in the length of each text, the keywords of the causal factors in each text may be repeated several times. Therefore, the frequency of accident causality alone does not accurately reflect the importance of risk factors. To address this, the common approach is to calculate the TF-IDF score of words, where a higher score is considered more important, and the importance of accident causal factors is determined by ranking the words. However, the TF-IDF method lacks semantic features and often struggles to yield satisfactory results when extracting key causal factors for input into LDA modeling.

In this study, the limitations of TF-IDF scores are considered, and the importance of causal factor keywords is calculated using the information entropy TF-IDF approach proposed in Section 3.4. Both traditional TF-IDF and information entropy TF-IDF algorithms are utilized to calculate scores for all accident causal factors. The resulting technical terms are sorted in descending order, and the top 30 statistical results for each score are taken for comparative analysis. This approach enables a more comprehensive and precise evaluation of the importance of accident causal factors, as shown in Figure 4.

The keywords obtained through the TF-IDF method consist of more professional equipment names, individuals, and company names in the construction field. These words appear less frequently across all texts but are more prevalent in specific texts. However, the TF-IDF method alone fails to extract the key causal factors. In contrast, the information entropy TF-IDF method, which considers the key causal factors, proves to be more effective in identifying important elements. For instance, the top 10 keywords obtained through information entropy TF-IDF include “working at height” and “scaffolding”, suggesting that accidents are more likely to occur in these scenarios.

The information entropy TF-IDF method effectively considers both the significance of words in distinguishing documents and their distribution within individual documents. By combining the frequency of occurrence and uniqueness calculations, the resulting keywords can offer valuable insights for effective management suggestions and risk control measures in the construction industry.

Based on the keyword scores, it is evident that low worker safety awareness and a lack of safety equipment are significant factors contributing to accidents, which is consistent with the findings of previous studies [59]. The researchers attribute these factors to the failure of construction and work units to provide adequate safety education and training. Inadequate safety education and training results in low safety awareness among workers and their involvement in irregularities. This lack of awareness was identified as a key factor that hinders the promotion of safety awareness and hinders the ability of workers to identify and eliminate hazards. In contrast to the critical factors extracted from accident investigation reports in this study, Zhang et al. identified poor site working conditions, inadequate safety training, and a lack of safety awareness as the most critical factors leading to accidents [24]. The information entropy TF-IDF score of safety education and training did not stand out compared to the previous factors. Thus, the role of safety education and training may not be as important as originally thought. In an environment with poor site safety management, it is unlikely that workers will consistently adhere to safety protocols. In addition, Newaz et al. highlighted the limited number of studies evaluating the effectiveness of safety training programs [55], and Perlman et al. concluded that there is no apparent correlation between safety training and hazard identification/assessment skills, making it difficult to determine the extent to which it improves worker safety awareness [60].

The information entropy TF-IDF scores also indicate that work at height, scaffolding, and tower crane operations are critical areas of operations that contribute to accidents. In these areas, it is imperative that worker safety be prioritized and that safety hazards be promptly identified and corrected. In addition, qualification issues are emerging as important factors identified in accident investigation reports that lead to accidents. However, current studies have largely overlooked the issue of worker qualification, which is a prevalent problem in the Chinese construction industry, especially in small-scale projects. Scholarly attention has focused primarily on high-impact and large-scale construction projects while neglecting common small-scale general construction or renovation projects. These projects often have lower levels of site safety management, which poses greater challenges for managers in identifying and correcting safety hazards. In addition, both workers and contractors often lack the necessary qualifications for the job. The lack of skills translates into a lack of safety awareness and an inability to meet the stringent standards of site safety management. As a result, it significantly increases the likelihood of accidents occurring.

4.4. LDA Modeling

After calculating the information entropy TF-IDF matrix, the weighted word frequency matrix is computed using Equation (9). The LDA modeling is performed using the sklearn library in Python. The LDA topic model effectively identifies relevant topics and their associated feature words from the textual data of the accident reports. Moreover, it clusters the feature words that exhibit a strong connection with each topic, particularly those related to risks. The higher the probability, the more likely it is that a feature word belongs to a particular topic. Consequently, the top 20 feature words with strong associations for each topic are arranged in descending order of frequency and visualized in a word cloud as shown in Figure 5, and Table 3 and Table 4 show the specific 20 keywords under each topic in Figure 5, respectively.

Undoubtedly, the LDA model based on information entropy TF-IDF weighting generated more comprehensive and meaningful topics compared to those generated by the standard LDA model. It successfully identifies more informative and representative keywords specific to the study area within the dataset. An illustrative example is Topic 4, which reveals that accidents occurred while working on window cleaning machine platforms. The factors contributing to the accident include “fatigue fracture of the screw”, “inadequate testing”, and “overloading”. Another significant finding in Topic 13 pertains to accidents related to the construction of tower cranes. It highlights factors such as errors in signal transmission by the slave signal worker, failure to strictly implement construction programs or prepare special programs, and the lack of timely detection and elimination of safety hazards.

In contrast, the traditional frequency-based LDA model only identified a limited number of information nouns, as they frequently appeared in accident reports. Furthermore, several causative factors with high word frequency, such as “inadequate safety protection”, “lack of safety equipment”, and “failure to timely eliminate safety hazards”, could be obtained through word frequency statistics. However, the standard LDA model failed to capture a substantial amount of more valuable information.

In addition, to further evaluate the performance of both models, perplexity was calculated to measure how well the models fit after inputting the text data of the accident investigation report. The results of the model perplexity are shown in Figure 6.

The findings demonstrate that the LDA model employing the information entropy TF-IDF weighting scheme outperforms the standard LDA model in terms of perplexity. Specifically, the information entropy TF-IDF-LDA model exhibits a significant reduction in perplexity compared to the standard LDA model. This reduction indicates that the information entropy TF-IDF-LDA model fits better, and the weighting scheme enhances the model’s ability to capture the underlying themes in the corpus. Furthermore, the topics generated by the weighted model exhibit greater cohesiveness and meaningfulness compared to those generated by the standard LDA model.

Based on the aforementioned experimental results, it is evident that the improved information entropy TF-IDF method represents an effective approach for text feature representation. In comparison to the traditional word frequency matrix, this method takes into consideration the information entropy of words and their distribution in the text, thereby providing a better reflection of word importance and text characteristics. The application of this method in the LDA model further confirms its effectiveness in identifying key topics and risk feature words from accident investigation reports.

5. Discussion

5.1. Discussion of the Research Findings

Chinese accident investigation reports often consist of long texts that contain multiple levels of information, making it difficult to extract useful information from them. In this study, we mainly focus on analyzing the title, cause, and type of accident sections of the accident report to extract key information.

Through title analysis, Chinese accident investigation reports usually provide direct information about the location of the accident, the time of occurrence, the type of accident, and the severity of the accident. Using the location information, statistical analysis can be performed to determine the number of accidents and their types within a specific area and time period. Company-specific information can also be obtained to understand the number of safety incidents that occurred within a particular company during a given year. This data can be used to build a safety responsibility information database and facilitate the development of a comprehensive production safety accident investigation and management information system.

Compared with the traditional TF-IDF method, keyword extraction using the information entropy TF-IDF approach is more effective. It intuitively reveals the causes of accidents and identifies the types of accidents with high frequency, avoiding the extraction of only rare technical terms. This indicates the effectiveness of the improved method in determining keywords. In addition, the perplexity of the LDA model constructed with the improved TF-IDF is reduced, suggesting that the information entropy TF-IDF method improves the fitting effect of the model to some extent. By considering the information entropy of words and their distribution in the text, this method better reflects the importance of words and the characteristics of the text. The application of this method in the LDA model further confirms its effectiveness.

In summary, the improved information entropy TF-IDF method is a viable approach for text feature representation. Its integration with the LDA model enhances the model’s fitting effect and yields richer subject words that better reflect the key information in Chinese accident investigation reports.

The identified keywords related to accident information indicate that the most common safety hazards on construction sites are the primary causes of accidents. These safety hazards include various factors, including pre-existing hazards and hazards resulting from unsafe behaviors of workers during construction. These unsafe behaviors include workers working in violation of rules and regulations and a lack of protective equipment. This suggests that despite the repeated emphasis on construction safety management in the current Chinese construction industry, a significant proportion of construction site workers often work in unsafe conditions, leading to inevitable accidents. While weak safety awareness and worker non-compliance contribute to accidents, it is not enough to blame workers alone. According to accident causation theory, the primary cause of the accident lies in safety management, and the primary cause of the accident lies in safety management [61]. Inadequate safety management at the construction site, coupled with a failure to comply with safety production responsibilities, led to an inability to effectively identify and address safety hazards, which ultimately served as the underlying factors responsible for the accident. Effective safety management should prioritize hazard identification and elimination as a primary requirement. Failure to do so may allow multiple hazards to accumulate on site, ultimately leading to accidents [24]. Unfortunately, in reality, site managers often fail to identify existing safety hazards and allow them to persist. Many construction sites can be likened to sieves full of loopholes, where it is uncertain when workers will fall victim to accidents. Therefore, it is imperative to adopt high-standard and efficient safety management technologies or new management approaches, such as automated inspection methods based on artificial intelligence, to promptly identify, detect, and eliminate safety hazards. These advances would replace inadequate site managers, greatly improve site safety management, and eliminate hazards in the early stages of accident development.

Among the existing studies, there have been scholars who used the LDA method to mine public opinion on building health and safety [62]. In their study, they adopted a big data analysis approach and conducted visual, content, user, and sentiment analysis of Instagram posts to investigate public understanding of building health and safety concerns. In contrast, our study makes a significant contribution to the existing body of knowledge through the methodology and results presented.

First, concerning text processing, the aforementioned studies do not address the data preprocessing stage, nor do they explicitly discuss the handling of high-frequency, low-information words. Extracting critical information from non-standardized text is a major challenge in improving design safety. However, our study focuses on standardized accident investigation report texts issued by the government, which are rich in information and provide detailed descriptions of accident processes and causal analyses at construction sites. We focus on extracting features and key information from these texts to support safety management research.

Second, our study improves the keyword extraction method by considering word frequency statistics, TF-IDF weighted information, and information entropy distribution. By incorporating text semantics and word importance, we provide a more comprehensive approach to describing building safety-related texts. This enhanced method significantly improves the efficiency of keyword extraction in text mining.

In contrast, the previous study mainly focused on public opinion analysis using big data analysis methods and social media data. Our study, on the other hand, focuses on key information extraction and safety management research using the texts of government-issued accident investigation reports.

By comparing these two studies, it becomes clear that our research focuses on the specific domain of construction safety and emphasizes improved methods for extracting key information from government-issued accident investigation report texts. Consequently, our study has important implications for improving the efficiency and accuracy of construction safety management.

5.2. Limitations of the Study and Future Research Perspectives

This study still has some limitations that need to be further explored in future research.

In terms of research content, although this paper proposes a novel method for keyword extraction and identification of important words, there are still challenges in reproducing a complete technical term structure between them. The Chinese language, as a semantically rich corpus, faces limitations due to word segmentation techniques. Chinese word segmentation primarily relies on two-character combinations as benchmarks, which often results in overly fine-grained word segmentation. In many cases, single-word segmentation does not convey effective information, and a significant proportion of the segmented words in the text are invalid, which may affect the accuracy of the results. This paper has addressed this problem to some extent by introducing an improved method. However, in order to extract more important information from the text, further processing of the word segmentation results is necessary to achieve in-depth information extraction from accident report texts [63]. This involves the reintegration and re-extraction of key information. Future research should focus on the integration of keywords and key information derived from the word segmentation results to better filter out irrelevant or redundant information. This will lead to a more comprehensive and automatic information extraction process during word segmentation.

Despite the contributions and findings of our study, it is essential to acknowledge the limitations that may affect the generalizability and reliability of our findings. These limitations prompt the need for further research and provide valuable directions for future studies in the field of construction safety.

In terms of data collection, a limitation of our study is the size and potential underrepresentation of the data set. Relying on a specific set of accident investigation reports collected from various government and emergency management agencies may not capture the full range of incidents across the construction industry. It is important to note that the primary focus of our research was to demonstrate the effectiveness of text mining techniques in extracting key information from accident investigation reports. Consequently, our study primarily contributes to the advancement of analytical methods in the field of construction safety rather than providing a comprehensive analysis of all accidents in the country. While our study provides valuable insights into the application of text mining techniques and the identification of key factors in construction safety, further research using larger and more diverse datasets is needed to validate and extend these findings.

To address these limitations and improve the reliability and applicability of future research, we suggest the following:

(1): Collaborate with local governments or contractors: Future research in this area could collaborate with local governments or construction companies to obtain a broader and more representative dataset. By partnering with these stakeholders, researchers will have access to a wider range of accident investigation reports, increasing the comprehensiveness and representativeness of the data. This will allow for more robust and reliable conclusions.
(2): Expand data collection efforts: To obtain a more comprehensive data set, future studies should explore additional data sources and collection methods. This could include using existing databases maintained by relevant government agencies, establishing a centralized crash reporting system, or implementing standardized data collection protocols across regions. Such studies would allow researchers to conduct more robust statistical analyses, gain a broader understanding of construction safety in China, explore accident patterns and causal factors in greater depth, and generate more generalizable findings.

Our study sheds light on the potential of text mining techniques in the analysis of accident investigation reports in the construction industry. However, the limitations in terms of dataset size, representativeness, and data sources highlight the need for future research to overcome these challenges. By collaborating with stakeholders, expanding data collection efforts, and addressing sample size limitations, future studies can provide more comprehensive insights into construction safety in China. These insights will facilitate evidence-based decision-making for industry practitioners, policymakers, and researchers, ultimately leading to improved construction safety practices.

6. Conclusions

The construction industry is known for its high accident rate, and inadequate safety management poses a significant threat to the personal safety of workers. Extracting key information that leads to accidents through automated mining of the standardized text semantics currently used in the industry is a challenging task. However, achieving this goal is essential for more efficient safety management. In this study, key accident information is extracted from the title of the report and the definition of the cause of the accident, leading to the following main conclusions:

First, an analysis of the titles of 183 accident investigation reports revealed that the accidents were of 12 different types, including falls from height, object strikes, electrocution, and collapse. Falls accounted for more than 50% of the accidents. The majority of accidents were general safety responsibility accidents, with collapses and confined space poisoning resulting in three major responsibility cases. Working at height and in confined spaces are high-risk activities that require strengthened safety management and improved worker safety awareness.

Next, by analyzing the keywords and structure of negatively expressed causal factors in Chinese accident investigation reports, common causal factors and responsible parties were identified based on word frequency statistics. However, word frequency information alone is not sufficient to identify key factors. Therefore, a new TF-IDF weighting scheme based on information entropy is proposed to improve the performance of the LDA model in this study. Experimental results show that the information entropy TF-IDF weighting scheme improves the model’s ability to capture fundamental topics in the corpus, resulting in more coherent and meaningful topics. The method presented in this paper has the potential to be applied to other topic modeling approaches and different types of text corpora.

Finally, based on the results of the information entropy TF-IDF and the extracted topic terms from the LDA model, it is evident that workers are the main topics in accidents. Failure to detect and eliminate safety hazards and a lack of safety equipment are the most important factors contributing to accidents. Other important factors include violations of safety rules, poor site safety management, and inadequate safety education and training. Working on scaffolds, at heights, and with tower cranes poses potential risks for accidents. Failure to follow construction plans, inadequate safety training, and worker non-compliance also play a significant role in accidents. Although low worker safety awareness is often mentioned in Chinese accident investigation reports, it is not the key factor. In addition, attention should be paid to the qualification of personnel for special operations and the qualification audit process for project contracting in the Chinese construction industry.

In future research, we can further focus on the extraction and analysis of specialized terms and consider the overall extraction method of complete keywords plus negative expressions by integrating the information after the text is divided into words to extract key information from the text more effectively.

Author Contributions

Conceptualization, J.W. (Junwu Wang) and J.W. (Jinyingjun Wan); data curation, J.W. (Junwu Wang) and J.W. (Junyingjun Wan); formal analysis, Y.L.; funding acquisition, J.W. (Junwu Wang), Y.L. and S.T.; investigation, S.T. and J.W. (Junyingjun Wan); methodology, S.T. and Y.L.; software, Y.L. and J.Z.; supervision, J.W. (Junwu Wang) and J.W. (Wan Junyingjun); validation, S.T. and J.Z.; visualization, J.W. (Junyingjun Wan) and Y.L.; writing—original draft, Y.L.; writing—review and editing, J.W. (Junwu Wang), J.W. (Junyingjun Wan) and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by Major science and technology projects in Hainan Province (Funding: Science and Technology Department of Hainan Province; Funding number: ZDKJ2021024), the Hainan Special Ph.D Scientific Research Foundation of Sanya Yazhou Bay Science and Technology City (Funding: Sanya Yazhou Bay Science and Technology City; Funding number: HSPHDSRF-2022-03-006), and the Ph.D. Scientific Research and Innovation Foundation of Sanya Yazhou Bay Science and Technology City (Funding: Sanya Yazhou Bay Science and Technology City; Funding number: HSPHDSRF-2023-03-010).

Data Availability Statement

The case analysis data used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the result. The company had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Choudhry, R.M. Behavior-based safety on construction sites: A case study. Accid. Anal. Prev. 2014, 70, 14–23. [Google Scholar] [CrossRef] [PubMed]
Ansari, R.; Dehghani, P.; Mahdikhani, M.; Jeong, J. A Novel Safety Risk Assessment Based on Fuzzy Set Theory and Decision Methods in High-Rise Buildings. Buildings 2022, 12, 2126. [Google Scholar] [CrossRef]
Tian, D.; Li, M.; Shen, Y.; Han, S. Intelligent mining of safety hazard information from construction documents using semantic similarity and information entropy. Eng. Appl. Artif. Intel. 2023, 119, 105742. [Google Scholar] [CrossRef]
Li, R.Y.M.; Chau, K.W.; Zeng, F.F. Ranking of Risks for Existing and New Building Works. Sustainability 2019, 11, 2863. [Google Scholar] [CrossRef] [Green Version]
Salama Dareen, M.; El-Gohary Nora, M. Semantic Text Classification for Supporting Automated Compliance Checking in Construction. J. Comput. Civ. Eng. 2016, 30, 04014106. [Google Scholar] [CrossRef]
Zhong, B.; Xing, X.; Love, P.; Wang, X.; Luo, H. Convolutional neural network: Deep learning-based classification of building quality problems. Adv. Eng. Inf. 2019, 40, 46–57. [Google Scholar] [CrossRef]
Beach, T.H.; Rezgui, Y.; Li, H.; Kasim, T. A rule-based semantic approach for automated regulatory compliance in the construction sector. Expert. Syst. Appl. 2015, 42, 5219–5231. [Google Scholar] [CrossRef] [Green Version]
Ferrari, A.; Gori, G.; Rosadini, B.; Trotta, I.; Bacherini, S.; Fantechi, A.; Gnesi, S. Detecting requirements defects with NLP patterns: An industrial experience in the railway domain. Empir. Softw. Eng. 2018, 23, 3684–3733. [Google Scholar] [CrossRef]
Zhang, F.; Fleyeh, H.; Wang, X.; Lu, M. Construction site accident analysis using text mining and natural language processing techniques. Autom. Constr. 2019, 99, 238–248. [Google Scholar] [CrossRef]
Soibelman, L.; Wu, J.; Caldas, C.H.; Brilakis, I.K.; Lin, K.-Y. Management and analysis of unstructured construction data types. Adv. Eng. Inf. 2008, 22, 15–27. [Google Scholar] [CrossRef]
Tian, D.; Li, M.; Shi, J.; Shen, Y.; Han, S. On-site text classification and knowledge mining for large-scale projects construction by integrated intelligent approach. Adv. Eng. Inf. 2021, 49, 101355. [Google Scholar] [CrossRef]
Lukic, D.; Littlejohn, A.; Margaryan, A. A framework for learning from incidents in the workplace. Saf. Sci. 2012, 50, 950–957. [Google Scholar] [CrossRef]
Luo, X.; Liu, Q.; Qiu, Z. A Correlation Analysis of Construction Site Fall Accidents Based on Text Mining. Front. Built Environ. 2021, 7, 690071. [Google Scholar] [CrossRef]
Shuang, Q.; Zhang, Z. Determining Critical Cause Combination of Fatality Accidents on Construction Sites with Machine Learning Techniques. Buildings 2023, 13, 345. [Google Scholar] [CrossRef]
Yan, H.; Ma, M.; Wu, Y.; Fan, H.; Dong, C. Overview and analysis of the text mining applications in the construction industry. Heliyon 2022, 8, e12088. [Google Scholar] [CrossRef] [PubMed]
Sun, J.; Lei, K.; Cao, L.; Zhong, B.; Wei, Y.; Li, J.; Yang, Z. Text visualization for construction document information management. Autom. Constr. 2020, 111, 103048. [Google Scholar] [CrossRef]
Kim, J.-S.; Kim, B.-S. Analysis of Fire-Accident Factors Using Big-Data Analysis Method for Construction Areas. Ksce J. Civil Eng. 2018, 22, 1535–1543. [Google Scholar] [CrossRef]
Love, P.E.D.; Smith, J.; Teo, P. Putting into practice error management theory: Unlearning and learning to manage action errors in construction. Appl. Erg. 2018, 69, 104–111. [Google Scholar] [CrossRef]
Liu, G.; Boyd, M.; Yu, M.; Halim, S.Z.; Quddus, N.A. Identifying causality and contributory factors of pipeline incidents by employing natural language processing and text mining techniques. Process Saf. Environ. 2021, 152, 37–46. [Google Scholar] [CrossRef]
Tixier, A.J.P.; Hallowell, M.R.; Rajagopalan, B.; Bowman, D. Automated content analysis for construction safety: A natural language processing system to extract precursors and outcomes from unstructured injury reports. Autom. Constr. 2016, 62, 45–56. [Google Scholar] [CrossRef] [Green Version]
Liu, C.; Yang, S. Using text mining to establish knowledge graph from accident/incident reports in risk assessment. Expert. Syst. Appl. 2022, 207, 117991. [Google Scholar] [CrossRef]
Zhou, K.; Wang, J.; Ashuri, B.; Chen, J. Discovering the Research Topics on Construction Safety and Health Using Semi-Supervised Topic Modeling. Buildings 2023, 13, 1169. [Google Scholar] [CrossRef]
Zhong, B.; Pan, X.; Love, P.E.D.; Ding, L.; Fang, W. Deep learning and network analysis: Classifying and visualizing accident narratives in construction. Autom. Constr. 2020, 113, 103089. [Google Scholar] [CrossRef]
Zhang, W.; Zhu, S.; Zhang, X.; Zhao, T. Identification of critical causes of construction accidents in China using a model based on system thinking and case analysis. Saf. Sci. 2020, 121, 606–618. [Google Scholar] [CrossRef]
Li, J.; Wang, J.; Xu, N.; Hu, Y.; Cui, C. Importance Degree Research of Safety Risk Management Processes of Urban Rail Transit Based on Text Mining Method. Information 2018, 9, 26. [Google Scholar] [CrossRef] [Green Version]
Li, S.; You, M.; Li, D.; Liu, J. Identifying coal mine safety production risk factors by employing text mining and Bayesian network techniques. Process Saf. Environ. 2022, 162, 1067–1081. [Google Scholar] [CrossRef]
Xu, N.; Ma, L.; Liu, Q.; Wang, L.; Deng, Y. An improved text mining approach to extract safety risk factors from construction accident reports. Saf. Sci. 2021, 138, 105216. [Google Scholar] [CrossRef]
Yue, A.; Mao, C.; Chen, L.; Liu, Z.; Zhang, C.; Li, Z. Detecting Changes in Perceptions towards Smart City on Chinese Social Media: A Text Mining and Sentiment Analysis. Buildings 2022, 12, 1182. [Google Scholar] [CrossRef]
Du, Y.; Yi, Y.; Li, X.; Chen, X.; Fan, Y.; Su, F. Extracting and tracking hot topics of micro-blogs based on improved Latent Dirichlet Allocation. Eng. Appl. Artif. Intell. 2020, 87, 103279. [Google Scholar] [CrossRef]
Suh, Y. Sectoral patterns of accident process for occupational safety using narrative texts of OSHA database. Saf. Sci. 2021, 142, 105363. [Google Scholar] [CrossRef]
Forman, G. BNS feature scaling: An improved representation over tf-idf for svm text classification. In Proceedings of the 17th ACM conference on Information and knowledge management, Napa Valley, CA, USA, 26–30 October 2008; pp. 263–270. [Google Scholar]
Zhou, P.; El-Gohary, N. Ontology-Based Multilabel Text Classification of Construction Regulatory Documents. J. Comput. Civ. Eng. 2016, 30, 04015058. [Google Scholar] [CrossRef]
Gao, J.; Li, M.; Huang, C.-N.; Wu, A. Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach. Comput. Linguist. 2005, 31, 531–574. [Google Scholar] [CrossRef]
Curiskis, S.A.; Drake, B.; Osborn, T.R.; Kennedy, P.J. An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit. Inf. Process Manag. 2020, 57, 102034. [Google Scholar] [CrossRef]
Cheng, L.; Yang, Y.; Zhao, K.; Gao, Z. Research and Improvement of TF-IDF Algorithm Based on Information Theory. In Proceedings of the 8th International Conference on Computer Engineering and Networks (CENet2018), Shanghai, China, 17–19 August 2020; pp. 608–616. [Google Scholar]
Wang, Y. Unsupervised Representative Feature Selection Algorithm Based on Information Entropy and Relevance Analysis. IEEE Access 2018, 6, 45317–45324. [Google Scholar] [CrossRef]
Blei, D.M.; Ng, A.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2001, 3, 993–1022. [Google Scholar]
Griffiths, T.L.; Steyvers, M. Finding scientific topics. Proc. Natl. Acad. Sci. USA 2004, 101, 5228–5235. [Google Scholar] [CrossRef]
Lee, N.; Kim, E.; Kwon, O. Combining TF-IDF and LDA to generate flexible communication for recommendation services by a humanoid robot. Multimed. Tools Appl. 2018, 77, 5043–5058. [Google Scholar] [CrossRef]
Yang, G.; Wen, D.; Kinshuk; Chen, N.-S.; Sutinen, E. A novel contextual topic model for multi-document summarization. Expert. Syst. Appl. 2015, 42, 1340–1352. [Google Scholar] [CrossRef]
Raheem, A.A.; Issa, R.R.A. Safety implementation framework for Pakistani construction industry. Saf. Sci. 2016, 82, 301–314. [Google Scholar] [CrossRef]
Goncalves Filho, A.P.; Waterson, P.; Jun, G.T. Improving accident analysis in construction—Development of a contributing factor classification framework and evaluation of its validity and reliability. Saf. Sci. 2021, 140, 105303. [Google Scholar] [CrossRef]
Chi, S.; Han, S. Analyses of systems theory for construction accident prevention with specific reference to OSHA accident reports. Int. J. Proj. Manag. 2013, 31, 1027–1041. [Google Scholar] [CrossRef] [Green Version]
Rafindadi, A.D.U.; Shafiq, N.; Othman, I.; Ibrahim, A.; Aliyu, M.M.; Mikić, M.; Alarifi, H. Data mining of the essential causes of different types of fatal construction accidents. Heliyon 2023, 9, e13389. [Google Scholar] [CrossRef] [PubMed]
Tam, C.M.; Zeng, S.X.; Deng, Z.M. Identifying elements of poor construction safety management in China. Saf. Sci. 2004, 42, 569–586. [Google Scholar] [CrossRef]
Ayob, A.M.; Shaari, A.; Zaki, M.; Munaaim, M.A.C. Fatal occupational injuries in the Malaysian construction sector–causes and accidental agents. In Proceedings of the 4th International Conference on Civil and Environmental Engineering for Sustainability (IConCEES 2017), Langkawi, Malaysia, 4–5 December 2017; Volume 140, p. 012095. [Google Scholar]
Yap, J.B.H.; Lee, W.K. Analysing the underlying factors affecting safety performance in building construction. Prod. Plan. Control 2020, 31, 1061–1076. [Google Scholar] [CrossRef]
Mosly, I. Factors influencing safety performance in the construction industry of Saudi Arabia: An exploratory factor analysis. Int. J. Constr. Eng. Manag. 2022, 28, 901–908. [Google Scholar] [CrossRef]
Winge, S.; Albrechtsen, E.; Mostue, B.A. Causal factors and connections in construction accidents. Saf. Sci. 2019, 112, 130–141. [Google Scholar] [CrossRef]
Wong, L.; Wang, Y.; Law, T.; Lo, C.T. Association of Root Causes in Fatal Fall-from-Height Construction Accidents in Hong Kong. J. Constr. Eng. Manag. 2016, 142, 04016018. [Google Scholar] [CrossRef]
Hinze, J.; Pedersen, C.; Fredley, J. Identifying Root Causes of Construction Injuries. J. Constr. Eng. Manag. 1998, 124, 67–71. [Google Scholar] [CrossRef]
Meng, X.; Chan, A.H.S.; Lui, L.K.H.; Fang, Y. Effects of individual and organizational factors on safety consciousness and safety citizenship behavior of construction workers: A comparative study between Hong Kong and Mainland China. Saf. Sci. 2021, 135, 105116. [Google Scholar] [CrossRef]
Jeong, G.; Kim, H.; Lee, H.-S.; Park, M.; Hyun, H. Analysis of safety risk factors of modular construction to identify accident trends. J. Asian Archit. Build. Eng. 2022, 21, 1040–1052. [Google Scholar] [CrossRef]
Lu, Y.; Yin, L.; Deng, Y.; Wu, G.; Li, C. Using cased based reasoning for automated safety risk management in construction industry. Saf. Sci. 2023, 163, 106113. [Google Scholar] [CrossRef]
Newaz, M.T.; Ershadi, M.; Jefferies, M.; Davis, P. Assessing safety management factors to develop a research agenda for the construction industry. Saf. Sci. 2021, 142, 105396. [Google Scholar] [CrossRef]
Huang, X.; Hinze, J. Analysis of Construction Worker Fall Accidents. J. Constr. Eng. Manag. 2003, 129, 262–271. [Google Scholar] [CrossRef]
Chi, S.; Han, S.; Kim, D.Y. Relationship between Unsafe Working Conditions and Workers’ Behavior and Impact of Working Conditions on Injury Severity in U.S. Construction Industry. J. Constr. Eng. Manag. 2013, 139, 826–838. [Google Scholar] [CrossRef] [Green Version]
Chan, A.P.C.; Wong, F.K.W.; Chan, D.W.M.; Yam, M.C.H.; Kwok, A.W.K.; Lam, E.W.M.; Cheung, E. Work at Height Fatalities in the Repair, Maintenance, Alteration, and Addition Works. J. Constr. Eng. Manag. 2008, 134, 527–535. [Google Scholar] [CrossRef]
Maiti, S.; Choi, J.-h. An evidence-based approach to health and safety management in megaprojects. Int. J. Constr. Manag. 2019, 21, 997–1010. [Google Scholar] [CrossRef]
Perlman, A.; Sacks, R.; Barak, R. Hazard recognition and risk perception in construction. Saf. Sci. 2014, 64, 22–31. [Google Scholar] [CrossRef]
Yang, X.; Haugen, S. Implications from major accident causation theories to activity-related risk analysis. Saf. Sci. 2018, 101, 121–134. [Google Scholar] [CrossRef]
Zeng, L.; Li, R.Y.M.; Yigitcanlar, T.; Zeng, H. Public Opinion Mining on Construction Health and Safety: Latent Dirichlet Allocation Approach. Buildings 2023, 13, 927. [Google Scholar] [CrossRef]
Huang, L.; Dou, Z.; Hu, Y.; Huang, R. Textual Analysis for Online Reviews: A Polymerization Topic Sentiment Model. IEEE Access 2019, 7, 91940–91945. [Google Scholar] [CrossRef]

Figure 1. Specific implementation framework.

Figure 2. Accident-type statistics.

Figure 3. Accident causation keywords word frequency statistics.

Figure 4. Comparison of information entropy TF-IDF score and standard TF-IDF score.

Figure 5. Word frequency weighted LDA and standard LDA subject terms.

Figure 6. Word frequency weighted LDA and standard LDA model perplexity.

Table 1. Table of custom synonyms for causal factors of accident reports.

Accident Causes Key Terms	Description of Causal Factors in the Literature and Possible Chinese Synonymous Expressions in Incident Reports	Citation
Safety training and education	Lack of safety knowledge and training, lack of a safety training system, and lack of implementation of safety training.	[41,42,43,44]
Employment of unskilled personnel	Workers are not licensed to work and lack operational qualifications.	[45,46]
Safety and protective equipment	Defective protective equipment, lack of safety protection equipment, Personal protective equipment.	[41,42,43]
Security Check	Inadequate safety inspection, inadequate site safety inspection.	[24,41,43,44,47,48]
Site Safety Management	Lack of strict on-site safety supervision and failure to detect and stop workers from working illegally in time.	[24,46]
Construction Plan	Poor implementation of construction plans; non-standard construction plans.	[24,42,44,49,50]
Site safety protection	Inadequate site safety protection and inadequate safety protection measures.	[42,51]
Worker behavior	Workers violate regulations, and workers are inexperienced; Violation of rules and regulations.	[49,52]
Security Awareness	Thin/weak/Insufficient/not strong/low safety awareness.	[47,52]
Safety Prevention	Lack of safety warning signs; lack of site protection; no risk indication.	[53,54]
Safety Hazards	Untimely handling of safety hazards and failure to identify safety hazards; Hidden danger inspection.	[43,49]
Security risk identification	Safety risk identification; Hidden danger inspection; insufficient safety awareness; and misjudgment of dangerous scenarios.	[24,43]
Materials and Equipment	Defective materials/equipment, tool failure, and equipment failure	[43,49,53]
Environment	Poor site conditions, poor housekeeping, and site organization; lifting or crane equipment or machine failure.	[43,55]
Weather conditions	Adverse weather conditions, adverse weather conditions.	[42,43,48]
Security enforcement and regulations	Violation of safety enforcement and regulations; Violation of production safety law.	[42,56,57]
Falling from a height	Falling from an unprotected edge; making a fall from a height; scaffolding collapse; falling from a height.	[51,58]
Trench collapse	Trench collapse under excavation; Collapse.	[51]
Emergency Response Plan	Failure to develop emergency rescue plans promptly; emergency rescue.	[24,44]
Safety operating procedures	Improper arrangement of construction procedures and improper construction procedures; Improper operation.	[24]

Table 2. Top 20 keywords and causal factors.

Keyword	Causal Factors
Safety accident hazards	Failure to eliminate accident hazards in a timely manner Failure to identify and eliminate safety hazards in a timely manner
Workers	Workers are the main victims of the accident
Site safety management	Inadequate site safety management Safety management responsibilities are not in place
Safety production management	Ineffective performance of production safety management responsibilities Failure to conscientiously implement production safety management responsibilities
Violation	Workers violate safety regulations
Safety protective equipment	Lack of safety protective equipment Protective measures are not in place No lack of personal protective equipment was found
Safety education and training	Inadequate safety education and training
Weak safety awareness	Low awareness of worker safety
Work in violation of regulations	Operating Violations Violation of operating rules
Identify and eliminate	Failure to identify and eliminate accident hazards in a timely manner
Safety technical handover	No safety technical briefing for workers
Supervision is not in place	Supervision is not in place
Safety protection measures	Site safety protection measures are not in place Protection measures are not in place
Construction plan	The special construction program is not implemented
Supervision and inspection	Failure to supervise the inspection unit safety work
Rectification	Failure to promptly supervise subcontracting units for rectification
Safety operating procedures	Failure to strictly implement the safety operation procedures Failure to supervise workers to implement safety procedures
Qualification	Lack of qualification of special operators Failure to obtain construction qualification Failure to review operational qualifications
Safety inspection	The safety inspection is not in place The safety inspection is not implemented
Risk	Inadequate safety risk prevention Inadequate risk identification Failure to detect the risk of a high fall
Keyword	Causal factors
Safety production management	Ineffective performance of production safety management responsibilities Failure to conscientiously implement production safety management responsibilities
Violation	Workers violate safety regulations
Safety protective equipment	Lack of safety protective equipment Protective measures are not in place No lack of personal protective equipment found
Safety education and training	Inadequate safety education and training
Weak safety awareness	Low awareness of worker safety
Work in violation of regulations	Operating Violations Violation of operating rules
Identify and eliminate	Failure to identify and eliminate accident hazards in a timely manner
Safety technical handover	No safety technical briefing for workers

Table 3. Weighted LDA model with 20 subject terms per topic.

Topic	Subject Terms per Topic
Topic1	‘decoration’, ‘safety procedures’, ‘violation’, ‘modest blue’, ‘elimination’, ‘decoration works’, ‘inspection’, ‘power line’, ‘placement’, ‘strictly enforced’, ‘Baiheng’, ‘Quqianlan’, ‘technical’, ‘rectified’, ‘director’, ‘construction project’, ‘identify and eliminated’, ‘general manager’, ‘Huang ‘, ‘Jin Jia Ya’
Topic2	‘insulation’, ‘electrical’, ‘wiring’, ‘leakage’, ‘gloves’, ‘laying’, ‘electrician license’, ‘protection’, ‘protection device’, ‘power’, ‘terminal block’, ‘wear use’, ‘contracting’, ‘floor’, ‘power switch’, ‘a dragon’, ‘detection’, ‘lighting’, ‘man-ladder’, ‘Wiring’
Topic3	‘trench’, ‘excavation’, ‘risk’, ‘famous’, ‘safety measures’, ‘treatment recommendation’, ‘municipal works’, ‘collapse’, ‘earthwork’, ‘ Ex un’, ‘Supervisory unit’, ‘Over’, ‘Entry’, ‘Accident responsibility’, ‘Emergency rescue’, ‘Propping plate’, ‘Exclusion’, ‘Special’, ‘Soil’, ‘stop’
Topic4	‘Window cleaner’, ‘Sansin’, ‘Screw’, ‘Inspection’, ‘Break’, ‘Acceptance’, ‘Platform’, ‘Operation’, ‘Advance’, ‘Failure to observe’, ‘delivery’, ‘stress’, ‘overload’, ‘fracture’, ‘fatigue’, ‘URA’, ‘device’, ‘Chen’, ‘Article’, ‘load-bearing’
Topic5	‘Failure’, ‘Unloading’, ‘Operation violation’, ‘This rise’, ‘Safety accident hazards’, ‘Found and eliminated’, ‘Compliance’, ‘Duty’, ‘Legal representative’, ‘hoisting’, ‘safety inspection’, ‘hoisting’, ‘this’, ‘exclusion’, ‘lattice’, ‘regular’, ‘forty-sixth’, ‘safety production Status’, ‘Failure to timely’, ‘Supervision and inspection’
Topic6	‘construction plan’, ‘basket’, ‘qualification’, ‘special work’, ‘pouring’, ‘found and eliminated’, ‘hazardous’, ‘large’, ‘formwork’, ‘Safety management’, ‘Sub-projects’, ‘Pipe jacking’, ‘Safety technical delivery’, ‘Division’, ‘Tall’, ‘Lifting’, ‘Safety production management’, ‘Demolition’, ‘Supervision’, ‘Support system’
Topic7	‘Pit’, ‘Safety accident potential’, ‘Support’, ‘Found and eliminated’, ‘CCB’, ‘Not timely’, ‘Production safety accident’, ‘Specification’, ‘Cover ‘, ‘violation’, ‘three bureaus’, ‘wellhead’, ‘fall’, ‘measures’, ‘accident scene’, ‘Yixing’, ‘Shenzhen’, ‘future’, ‘Garden’, ‘Project’
Topic8	‘Scaffolding’, ‘Safety supplies’, ‘Steel structure’, ‘Recovery’, ‘Shenzhen’, ‘Construction plan’, ‘Safety accident hazards’, ‘Violation’, ‘Rectification’, ‘wear’, ‘GD’, ‘erect’, ‘erector’, ‘licensed’, ‘correct’, ‘worker’, ‘eliminated’, ‘supervised’, ‘under construction’, ‘untimely’
Topic9	‘cleanup’, ‘a construction’, ‘working face’, ‘pit’, ‘side’, ‘shift’, ‘supervisor’, ‘earthwork’, ‘facilities’, ‘handover’, ‘backfill’, ‘missing’, ‘State Council order’, ‘Article’, ‘violation’, ‘proximity protection’, ‘yue peng’, ‘protective facilities’, ‘proximity’, ‘ Facing’
Topic10	‘workers’, ‘safety accident hazards’, ‘not in place’, ‘safety protective equipment’, ‘not timely’, ‘found and eliminated’, ‘safety production management’, ‘violation’, ‘Site safety management’, ‘Work at height’, ‘Safety education and training’, ‘Work in violation of regulations’, ‘Safety technical instructions’, ‘Falls’, ‘Rectification’, ‘ Supervision not in place’, ‘Concrete’, ‘Poor safety awareness’, ‘Inspection’, ‘Measures’
Topic11	‘safety measures’, ‘holes’, ‘safety hazards’, ‘construction plan’, ‘reserved’, ‘done’, ‘eliminated’, ‘cover’, ‘protection’, ‘area’, ‘violation’, ‘safety manager’, ‘manager’, ‘coordination’, ‘safety technical briefing’, ‘hazardous work area’, ‘contract’, ‘configuration’, ‘safety warning’, ‘stop’
Topic12	‘Backfill’, ‘Reported’, ‘Sanjiang’, ‘Enterprise’, ‘Aerospace’, ‘Earth’, ‘Town and Country’, ‘Project Management’, ‘Internal’, ‘Stop’, ‘Tire mold’, ‘Construction plan’, ‘Outside’, ‘Jiansheng’, ‘Examination’, ‘Safety permit’, ‘Awareness’, ‘Driver’, ‘City Construction Committee’, ‘Bystander’
Topic13	‘tower crane’, ‘not in place’, ‘construction plan’, ‘tower crane’, ‘special equipment’, ‘hoisting’, ‘work permit’, ‘found and eliminated’, ‘lifting’, ‘Sisol’, ‘signal’, ‘untimely’, ‘violation’, ‘dismantling’, ‘involved’, ‘inspection’, ‘high temperature’, ‘hazardous work area ‘, ‘purlins’, ‘penetration’
Topic14	‘Work at height’, ‘Waterproofing’, ‘Reinforcement’, ‘Edge protection’, ‘Safety supplies’, ‘Ceiling’, ‘Construction plan’, ‘Roof’, ‘Specification’, ‘qualification’, ‘platform’, ‘professional’, ‘adjacent’, ‘deep look’, ‘Beijing’, ‘approval’, ‘warning’, ‘Jiangsu’, ‘elimination ‘, ‘Team’

Table 4. Standard LDA model with 20 subject terms per topic.

Topic	Subject Terms per Topic
Topic1	‘safety protective equipment’, ‘rectification’, ‘not in place’, ‘work violation’, ‘Identify and eliminate’, ‘safety production management’, ‘stoppage’, ‘site’, ‘period’, ‘Weak safety awareness’, ‘safety accident hazards’, ‘protection’, ‘wearing use’, ‘failure to be timely’, ‘violation’, ‘hazards’, ‘fall’, ‘measures’, ‘On-site safety management’, ‘workers’
Topic2	‘Workers’, ‘Safety management’, ‘Weak safety awareness’, ‘Safety accident hazards’, ‘On-site safety management’, ‘Illegal’, ‘Work in violation of rules’, ‘Not in place’, ‘safety education and training’, ‘safety technical briefing’, ‘Identify and eliminate’, ‘Shenzhen’, ‘measures’, ‘supervision not in place’, ‘trial’, ‘supervision and inspection’, ‘ punishment’, ‘severe’, ‘several’, ‘ninety-second’
Topic3	‘steel pipe’, ‘support frame’, ‘safety hazard’, ‘demolition wall’, ‘violation’, ‘support’, ‘concrete’, ‘pump truck’, ‘bottom’, ‘worker’, ‘hollowing’, ‘method’, ‘untimely’, ‘violation’, ‘erector’, ‘qualification’, ‘exclusion’, ‘space’, ‘nominal’, ‘wall’
Topic4	‘not in place’, ‘safety management’, ‘safety education and training’, ‘Identify and eliminate’, ‘safety protective equipment’, ‘identified’, ‘management’, ‘system’, ‘operating procedures’, ‘Work at height’, ‘Hidden danger investigation’, ‘Safety inspection’, ‘Safety production law’, ‘Sound’, ‘Inspection’, ‘Violation’, ‘Treatment recommendation’, ‘Untimely’, ‘On-site safety management’, ‘General’
Topic5	‘Not timely’, ‘Identify and eliminate’, ‘concrete’, ‘cross work’, ‘safety accident hazards’, ‘falling’, ‘pouring’, ‘dumping’, ‘workers’, ‘safety education and training’, ‘violation’, ‘plastering’, ‘striking’, ‘main work’, ‘tertiary’, ‘suspected’, ‘establishment’, ‘education and training’, ‘archive’, ‘region’
Topic6	‘workers’, ‘Safety accident hazards’, ‘not timely’, ‘Identify and eliminate’, ‘On-site safety management’, ‘assurance’, ‘weak safety awareness’, ‘inadequate supervision’, ‘Safety management’, ‘Supervision’, ‘Item’, ‘Purpose’, ‘Safety management’, ‘Safety technical handover’, ‘Industry’, ‘Safety operation procedures’, ‘China Construction’, ‘Safety Education and Training’, ‘Measures’, ‘Excavator’
Topic7	‘Scaffolding’, ‘Safety protection’, ‘Safety accident hazards’, ‘Shenzhen’, ‘Erection’, ‘Erector’, ‘Construction plan’, ‘Elimination’, ‘Housing’, ‘platform’, ‘wear’, ‘steel pipe’, ‘management’, ‘acceptance’, ‘supervision’, ‘untimely’, ‘preparation’, ‘long-term’, ‘supervised’, ‘erected’
Topic8	‘not in place’, ‘safety education and training’, ‘safety technical briefing’, ‘On-site safety management’, ‘Identify and eliminate’, ‘unauthorized work’, ‘safety protective equipment’, ‘CCB’, ‘operation certificate’, ‘entry’, ‘lax’, ‘management’, ‘untimely’, ‘three bureaus’, ‘safety production management responsibility’, ‘safety production responsibility’, ‘Safety supervision’, ‘Inspection’, ‘Supervision not in place’, ‘Training’
Topic9	‘Safety accident hazards’, ‘Safety precautions’, ‘failure’, ‘holes’, ‘Identify and eliminate’, ‘duties’, ‘this case’, ‘exclusion’, ‘unauthorized work’, ‘compliance’, ‘this’, ‘unloading’, ‘On-site safety management’, ‘legal representative’, ‘reserved’, ‘failure to be timely’, ‘doing well’, ‘ Safety protection’, ‘Protection’, ‘Elimination’
Topic10	‘Hainan’, ‘Safety production management’, ‘Special operations’, ‘On-site safety management’, ‘Violation’, ‘Not in place’, ‘Safety accident hazards’, ‘Branch’, ‘ operation’, ‘worker’, ‘maintenance’, ‘imposed’, ‘responsible’, ‘failed to timely’, ‘Weak safety awareness’, ‘meran’, ‘vehicle’, ‘Administrative’, ‘Safety protective equipment’, ‘Violation of work rules’
Topic11	‘Construction plan’, ‘Pit’, ‘Hazard’, ‘Identify and eliminate’, ‘Trench’, ‘Excavation’, ‘Large’, ‘Safety accident hazards’, ‘Support’, ‘not timely’, ‘division’, ‘violation’, ‘sub-project’, ‘specification’, ‘rectification’, ‘taken effective’, ‘slope’, ‘On-site safety management’, ‘Safety management’, ‘Safety technical briefing’
Topic12	‘Identify and eliminate’, ‘Safety accident hazards’, ‘steel’, ‘production safety accidents’, ‘wellheads’, ‘untimely’, ‘three bureaus’, ‘concealed’, ‘engineering projects’, ‘fall’, ‘not in place’, ‘accident scene’, ‘safety precautions’, ‘cover’, ‘edge’, ‘manager’, ‘involved’, ‘ Special operations’, ‘Supervision not in place’, ‘Elevator shaft’
Topic13	‘not in place’, ‘restored’, ‘steel structure’, ‘not timely’, ‘On-site safety management’, ‘Identify and eliminate’, ‘violation’, ‘general contract’, ‘inspection’, ‘rectification’, ‘production safety’, ‘unauthorized’, ‘liable’, ‘hoisting’, ‘violation’, ‘legal’, ‘internal’, ‘supervisory’, ‘safety management’, ‘emergency management’
Topic14	‘basket’, ‘safety protection’, ‘wire rope’, ‘status’, ‘far’, ‘a country’, ‘in’, ‘left’, ‘a rich’, ‘On-site safety management’, ‘Shenzhen’, ‘Liable’, ‘Failure to timely’, ‘Safety management responsibility’, ‘Two persons’, ‘Locking’, ‘Falling’, ‘Zhonghai’, ‘Violation’, ‘not in place’
Topic15	‘safety protective equipment’, ‘workers’, ‘working at height’, ‘violation’, ‘safety accident hazards’, ‘safety education and training’, ‘wearing’, ‘safety production management’, ‘Weak safety awareness’, ‘On-site safety management’, ‘qualification’, ‘Identify and eliminate’, ‘not in place’, ‘safety precautions’, ‘waterproof’, ‘eliminated’, ‘unified Coordinated management’, ‘untimely’, ‘accident responsibility’, ‘platform’
Topic16	‘pit’, ‘not in place’, ‘not timely’, ‘wuhan’, ‘qualification’, ‘special’, ‘violation’, ‘management’, ‘contracting’, ‘qualification’, ‘procedures’, ‘links’, ‘slopes’, ‘municipal construction’, ‘support’, ‘certificates’, ‘safety supervision’, ‘competence’, ‘technology’, ‘Time’
Topic17	‘risk’, ‘treatment recommendation’, ‘ceiling’, ‘safety equipment’, ‘fall’, ‘determination’, ‘investigation team’, ‘responsible person’, ‘production safety’, ‘scaffolding’, ‘specification’, ‘exempt’, ‘protection’, ‘safety precautions’, ‘responsible accident’, ‘general’, ‘wearing’, ‘dumping’, ‘Hold responsible’, ‘Safety operation procedures’
Topic18	‘tower crane’, ‘Identify and eliminate’, ‘not in place’, ‘insulation’, ‘construction plan’, ‘safety accident hazards’, ‘lifting’, ‘On-site safety management’, ‘safety awareness Weak’, ‘Sizo’, ‘Violation’, ‘Safety education and training’, ‘Gloves’, ‘Command’, ‘Signal’, ‘Tower crane’, ‘Stop’, ‘untimely’, ‘aerospace’, ‘glass’
Topic19	‘decoration’, ‘violation’, ‘Identify and eliminate’, ‘decoration works’, ‘safety operating procedures’, ‘charged’, ‘safety accident hazards’, ‘On-site safety management’, ‘Issued’, ‘Shenzhen’, ‘Conductor’, ‘Not in time’, ‘Not in place’, ‘Renovation’, ‘Rectification’, ‘Proximity protection’, ‘Qualification’, ‘ modesty blue’, ‘electrocution’, ‘breakage’
Topic20	‘cleanup’, ‘rectification’, ‘supervision’, ‘safety procedures’, ‘safety management’, ‘edge protection’, ‘violation’, ‘prevention’, ‘zhejiang’, ‘weather’, ‘work at height’, ‘untimely’, ‘management’, ‘nanshan’, ‘report’, ‘handover’, ‘facilities’, ‘safety accident hazards’, ‘pit’, ‘Identify and eliminate’

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Wang, J.; Tang, S.; Zhang, J.; Wan, J. Integrating Information Entropy and Latent Dirichlet Allocation Models for Analysis of Safety Accidents in the Construction Industry. Buildings 2023, 13, 1831. https://doi.org/10.3390/buildings13071831

AMA Style

Liu Y, Wang J, Tang S, Zhang J, Wan J. Integrating Information Entropy and Latent Dirichlet Allocation Models for Analysis of Safety Accidents in the Construction Industry. Buildings. 2023; 13(7):1831. https://doi.org/10.3390/buildings13071831

Chicago/Turabian Style

Liu, Yipeng, Junwu Wang, Shanrong Tang, Jiaji Zhang, and Jinyingjun Wan. 2023. "Integrating Information Entropy and Latent Dirichlet Allocation Models for Analysis of Safety Accidents in the Construction Industry" Buildings 13, no. 7: 1831. https://doi.org/10.3390/buildings13071831

APA Style

Liu, Y., Wang, J., Tang, S., Zhang, J., & Wan, J. (2023). Integrating Information Entropy and Latent Dirichlet Allocation Models for Analysis of Safety Accidents in the Construction Industry. Buildings, 13(7), 1831. https://doi.org/10.3390/buildings13071831

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrating Information Entropy and Latent Dirichlet Allocation Models for Analysis of Safety Accidents in the Construction Industry

Abstract

1. Introduction

2. Literature Review

2.1. Status of Text Mining

2.2. Key Methods for Text Mining

2.3. Research Gaps and Contributions

3. Materials and Methods

3.1. Text Pre-Processing

3.2. Add Custom Dictionaries and Word Frequency Statistics

3.3. Word Division and Word Frequency Statistics

3.4. Calculation of Information Entropy TF-IDF

3.5. Construction of LDA Model

4. Results

4.1. Data Sources

4.2. Statistical Analysis of Accident Types

4.3. Extraction of Key Information on the Cause of the Accident

4.3.1. Data Pre-Processing and Word Frequency Statistics

4.3.2. Text Feature Extraction to Obtain Key Causal Factors

4.4. LDA Modeling

5. Discussion

5.1. Discussion of the Research Findings

5.2. Limitations of the Study and Future Research Perspectives

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI