Identification of Safety Risk Factors for Shield Construction in Urban Drainage Deep Tunnel Based on Text Mining

Kai Hu; Junwu Wang; Xuetao Hu; Zhiyuan Cheng

doi:10.3390/pr13092782

,

and

¹

School of Architecture and Material Engineering, Hubei University of Education, Wuhan 430205, China

²

School of Civil Engineering and Architecture, Wuhan University of Technology, Wuhan 430070, China

^*

Author to whom correspondence should be addressed.

Processes2025, 13(9), 2782;https://doi.org/10.3390/pr13092782

This article belongs to the Section Process Control and Monitoring

Version Notes

Order Reprints

Abstract

Shield construction of deep tunnels for urban drainage involves many risk factors, and potential safety hazards are difficult to monitor and identify directly. In order to improve the risk management level of shield construction in urban drainage deep tunnel, this study proposes a method for identifying risk factors by combining text mining technology and the entropy weight method. By using this method, 34 safety risk factors were successfully extracted from the safety accident reports of urban drainage deep tunnel shield construction and the related text data. The results of this study show that the text mining method could play an important role in the risk management of urban drainage deep tunnel shield construction; the introduction of the entropy weight method further improved the accuracy of risk factor identification. The results of this study not only enrich the research content of risk management in urban drainage deep tunnel shield construction but also provide theoretical guidance for managers to formulate risk management measures and optimize risk management procedures.

Keywords:

urban deep drainage tunnel; shield construction; text mining; risk factor identification

1. Introduction

The pipe network system is an important lifeline of a city, and it is also a crucial infrastructure for ensuring the normal operation of the city [1]. In 2024, 154 cities in China suffered from urban flooding due to heavy rainstorms, causing extensive damage to urban infrastructure. The number of affected people reached 2.55 million, resulting in significant casualties and economic losses [2]. This indicates that the traditional construction mode of the pipe network system can no longer meet the current urban development needs. Deep tunnel drainage is an important means to solve urban waterlogging and improve urban drainage capacity [3], and many cities have begun to build deep tunnel drainage systems to ensure their smooth operation. The shield construction method has been widely applied in the construction of urban drainage deep tunnels due to its characteristics of high efficiency, safety, and environmental protection [4]. The diameter of drainage tunnels is small, and the small-diameter shield construction method is generally adopted [5]. Compared with the construction of large- and medium-diameter shield machines, small-diameter shield machines are smaller in size and prone to tunneling deviations. Small-diameter shield machines have insufficient adaptability to complex geological conditions, and there are considerable difficulties in equipment selection and control. Due to the narrow space of the tunnel, the transportation and installation of building materials in the construction of small-diameter shield machines are relatively difficult, and the ventilation and heat dissipation conditions are poor [1,3,5]. Moreover, urban drainage deep tunnel projects are characterized by large burial depth, small tunnel diameter, long route, complex geological conditions, and high environmental protection requirements [6], which further increase the difficulty and risk of small-diameter shield construction. How to do a good job in project risk management and reduce the probability of safety accidents is a key issue that urgently needs to be solved at present.

Identification of risk factors is an important step in risk management and an important basis for risk assessment and the formulation of risk-control measures [7]. Many experts and scholars have studied the identification of safety risk factors in shield construction from different perspectives. He et al. [8] combined the methods of literature review and expert group evaluation to identify the safety risk factors in the tunnel shield construction process under the condition of crossing the bridge. Xu et al. [9] adopted the work breakdown structure–risk breakdown structure method to identify the safety risk factors across different stages of tunnel shield construction and then established a risk evaluation index system. Liu et al. [10] proposed an automatic identification method for tunnel shield construction safety risks based on ontology. Li et al. [11] used knowledge graph analysis to examine the connections between the safety risk factors in tunnel shield construction, thereby improving the efficiency of risk factor identification. Wu et al. [12] used text mining techniques to identify the risk factors in tunnel shield construction and then used association rule mining to further analyze the interrelationships among these factors. Tang et al. [13] used text mining techniques to identify the risk factors and reuse knowledge in tunnel shield construction and constructed a corpus for risk identification in subway shield construction, thereby improving the efficiency of word segmentation.

From the above analysis, it can be seen that research on identifying risk factors in shield construction has shifted from traditional empirical analysis and literature analysis to the application of big data analysis technology. The existing research results showed that the introduction of big data technologies, such as text mining, knowledge graphs, and knowledge base construction, could enhance the efficiency and accuracy of identifying risk factors in shield tunneling. However, the degree of standardization of basic data plays a significant role in the application effect of big data technology. The standardization level of safety management materials for the shield construction of the urban drainage deep tunnel project is not high, and some accident reports also lack a unified format and standardized wording, which increases the difficulty of processing basic data and the amount of manual work required. Constructing a dedicated vocabulary database and enhancing the effectiveness of feature item selection are effective means to improve the applicability of traditional text mining techniques [13]. Therefore, this study improved the traditional text mining techniques from these two aspects. Based on the risk characteristics of the shield construction in the urban drainage deep tunnel project and the relevant norms and standards, a safety risk dictionary was constructed to improve the efficiency of word segmentation. The entropy-weighted frequency had advantages such as dynamic weight adjustment, document content sensitivity, and anti-interference ability. This method was adopted to extract the characteristic items from the risk management data. Based on this improved text mining technology, a method for identifying the safety risks in the shield construction of urban drainage deep tunnels is proposed, which provides theoretical guidance for identifying key risk factors and formulating safety management plans.

2. Literature Review

2.1. Identification of Safety Risk Factors in Tunnel Shield Construction

The accuracy of risk factor identification directly affects the overall effect of risk management [14], and it is a hot issue in the risk management research field. Xu et al. [9] and Fu et al. [15] adopted the WBS–RBS method to identify the potential risks of tunnel shield construction. Zhang et al. [16] established a risk assessment index system for shield construction from the aspects of technology, geology, equipment, management, and safety accidents, drawing on their engineering experience. Liu et al. [10] combined ontology and rule reasoning techniques and utilized semantic reasoning to achieve risk identification in shield construction. Isah and Kim [7] built a system that could automatically identify and respond to risks in tunnel construction based on knowledge graphs and generative AI. Wu et al. [17] identified the risk of water gushing in the tunnel through theoretical and practical analysis of engineering cases. Shelake and Gogate [18] analyzed the risk factors affecting the progress of tunnel construction and developed a risk identification framework. Zhou et al. [19] proposed a framework based on BIM for automatically identifying and classifying environmental risks in the metro design stage.

2.2. Text Mining in the Risk Management of Tunnel Shield Construction

Text mining refers to the process of extracting valuable information from a large amount of text data by using technologies such as computer science, statistics, and machine learning [20,21]. Tang et al. [13] adopted the text mining method and developed a professional vocabulary bank. Through the analysis of risk reports, it identified the safety risk factors of metro shield construction. Vagnoli and Remenyte-Prescott [22] proposed a data mining method for identifying risks in tunnel construction; the key areas and occurrence times where risks exist in tunnel construction were identified via data analysis. Compared with traditional data mining techniques, it has some major advantages:

(1): Understanding semantics. With the aid of natural language processing technology, text mining is not merely about simply counting the key words and their word frequencies in the text content but can also provide a deeper understanding of the connotations in the text data.
(2): Revealing complex relationships among risk factors. Traditional data mining techniques had difficulty capturing the correlations between words in text data and the similarities among multiple different documents, while text mining can further extract the connections between words.
(3): Handling unstructured data. Most traditional data mining techniques can only handle structured data. Most engineering materials, such as accident investigation reports and construction plans, are unstructured data. Text mining technology can handle these text data, which not only greatly saves data processing time but also increases the objectivity of text analysis to a certain extent.

2.3. Gap in the Existing Relevant Research

The existing related studies adopted methods such as expert investigation, case analysis, literature analysis, work decomposition, and laboratory decision-making to identify the risk factors of shield construction in tunnel engineering. Since most of the shield construction data for urban drainage deep tunnel were unstructured text data, the entire shield construction process was dynamic. Many existing studies adopted structured data for analysis, and it was difficult to achieve dynamic data updates [12,23]. Some studies have applied the concept of big data analysis and data mining techniques to the research on safety risk identification in the shield construction of tunnel engineering, which improved the accuracy and objectivity of risk factor identification to a certain extent. These studies provided references and guidance for this research in terms of methodology. However, compared with general tunnel projects, urban drainage deep tunnels have specific particularities, such as the adoption of small-diameter shield construction methods. The transportation of materials is difficult. The working surface for personnel operations is narrow. At the very bottom of the underground space, the surrounding geological environment is difficult to monitor. The current list of tunnel safety risk factors and index system do not meet the needs of risk control for shield construction in the urban drainage deep tunnel.

In terms of the risk factor identification method, traditional text mining technology requires the standardization of basic data, and the selection of feature items has a significant impact on the accuracy of risk factors. At present, there is no lexicon available regarding the safety risks associated with the shield construction of the urban drainage deep tunnel project. Meanwhile, there are numerous methods for extracting features in text mining technology, such as the LDA topic model [24], BERT text embedding [25], inverse document frequency [12], etc. These methods have their own distinct advantages and application scenarios. However, the urban drainage deep tunnel project involved different cities, different regions, and different countries [5]. The safety of the shield construction in this project involves numerous and diverse risk factors, which require the risk identification methods to have strong adaptability. Compared with other methods, entropy-weighted frequency can dynamically adjust the weights based on the content of a single document, which enables it to better reflect the importance of words within the document [26]. Moreover, this method pays more attention to the distribution of words in a single document and can reflect the specific content of the document [27]. Furthermore, this method can also assign different weights to the words in each document based on their distribution, making the weight allocation more diverse and detailed. Moreover, this method can take into account the entropy of word distribution, which provides stronger resistance to interference from these noises [28]. There are few studies that apply entropy-weighted frequency analysis to extract the safety risk characteristics of the shield construction of urban drainage deep tunnels.

3. Methodology

3.1. Research Framework for Risk Factor Identification Based on Text Mining

In order to identify the key risk factors in shield construction of urban deep tunnel drainage projects, this study used text mining technology. The text mining process mainly includes data collection, text preprocessing, data mining, and result visualization [21]. First, the identification of risk factors based on text mining required basic text data. We obtained the safety management materials for the shield construction of the urban drainage deep tunnel project through various means, such as online searches and on-site investigations. Since these safety management materials did not have a unified format or expression, a corpus was established by screening the risk statements. Then, the Jieba toolkit in Python 3.12 was used to construct the word bank, perform word segmentation, and develop a professional word database to improve the accuracy of word segmentation. Then, entropy-weighted frequency was used to conduct parameterized extraction of frequent words, thereby enhancing the accuracy and efficiency of feature item extraction. Finally, from the most frequently used words, words with risk-related meanings were selected to establish a set of safety risk factors. The key risk factors were classified according to the 4M1E theory. The research framework for risk factor identification is shown in Figure 1.

Figure 1. Research framework for identifying safety risk factors in shield construction of deep-buried urban drainage tunnels.

3.2. Feature Selection Model of Risk Factors

Feature selection of the initial feature items was carried out by combining the word frequency analysis method and manual screening. First, high-frequency words were screened out from the initial feature items according to the feature value threshold. Then, through manual identification, words containing security risk factors were extracted from high-frequency words.

The characteristic parameters of the case text of safety accidents in shield construction of urban drainage deep tunnels mainly included Term Frequency (TF), Document Frequency (DF), and Document Frequency–Inverse Document Frequency (TF–IDF). Furthermore, in order to improve the accuracy of the research results, in this study, information entropy was introduced into the text mining method, and Term Frequency–Information Entropy (TF–H) was also used as a feature selection index.

The high-frequency vocabulary definition formula was adopted to set the threshold of high-frequency vocabulary.

T

represents the threshold of high-frequency vocabulary.

I_{1}

represents the number of words related to risk factors that had only appeared once in all safety accident investigation reports. Then, the threshold calculation formula was as follows:

T = \frac{(- 1 + \sqrt{1 + 8 \times I_{1}})}{2}

(1)

Word frequency represents the frequency of a certain word’s occurrence in a certain case text and is expressed by the following calculation formula:

t f_{i, j} = \frac{n_{i, j}}{\sum_{k} n_{k, j}}

(2)

where

n_{i, j}

represents the occurrence frequency of the i-th word in the j-th case text.

\sum_{k} n_{k, j}

represents the total number of occurrences of all feature items in all case texts.

n_{i}

represents the occurrence frequency of the characteristic item representing the i-th safety risk factor of shield construction in the urban drainage deep tunnel in all safety accident investigation reports.

T F_{i} = n_{i}

. When the TF value is higher, it indicates that this risk factor characteristic item contributes more to the occurrence of safety accidents.

D F_{i}

represents the document frequency of the occurrence of the characteristic item representing the i-th safety risk factor of the shield construction of the urban drainage deep tunnel.

D F_{i}

=

|\{j : t_{i} \in d_{j}\}|

represents the number of documents containing the i-th characteristic item in all safety accident investigation reports.

TF–IDF was used to assess the significance of a risk factor in the investigation report of safety accidents in the shield construction of urban drainage deep tunnels. If this risk factor appears more frequently in all accident investigation reports, it indicates that the discrimination of this risk factor is poorer and its importance is lower.

|D|

represents the number of documents in the document collection. Then, the calculation formula of TF–IDF is as follows:

I D F_{i} = \log \frac{|D|}{|\{j : t_{i} \in d_{j}\}|}

(3)

T F_{i} - I D F_{i} = \frac{n_{i, j}}{\sum_{k} n_{k, j}} \times \log \frac{|D|}{∣ \{j : t_{i} \in d_{j}\} ∣}

(4)

TF–H reflects the distribution of the characteristic items representing risk factors in all safety accident investigation reports. If the words representing risk factors are distributed more evenly in the accident investigation report, it indicates that such risk factors occur frequently.

p_{i}

represents the probability distribution of the occurrence of feature items in all safety accident investigation reports. Its calculation formula is as follows:

p_{i} = \frac{T F_{j}^{i}}{\sum_{j = 1}^{m} T F_{j}^{i}}

(5)

where

T F_{j}^{i}

represents the frequency of appearance of the characteristic items of safety risk factors in the shield construction of the i-th urban drainage deep tunnel project in the j-th safety accident investigation report.

Information entropy is denoted by

H_{i}

, representing the distribution degree of the characteristic items of safety risk factors in the shield construction of the i-th urban drainage deep tunnel in the safety accident investigation report. If its value is larger, the uncertainty of the occurrence of risk factors is greater. Its calculation formula is as follows:

H_{i} = - \sum p_{i} \log p_{i}

(6)

By integrating information entropy and Term Frequency, the functional equation of TF–H is obtained.

T F - H = - n_{i} \times \sum p_{i} \log p_{i}

(7)

Drawing on previous research results, the initial feature items with cumulative TF–H values within the range of 0% to 90% were recorded as high-frequency words, and the rest were regarded as low-frequency words.

4. Data Analysis

4.1. Sample

In order to improve the generalizability of the method and the external validity of the results, this study collected cases of shield construction safety accidents and accident investigation reports of urban drainage tunnel projects in different countries and regions around the world over the past ten years through channels such as news reports, announcements, safety management-related websites, and on-site investigations. A total of 176 accident cases were collected. The types of accidents include object strike accidents, collapse accidents, water inrush accidents, explosion accidents, poisoning and asphyxiation accidents, falls from heights accidents, fire accidents, electric shock accidents, mechanical injury accidents, and other injury accidents. To ensure the authenticity and authority of the collected samples, the cases of shield construction safety accidents collected were verified through contacting the parties involved and relevant personnel from the safety management department. At the same time, experts in the industry were invited to review the time, completeness, and relevance of the collected samples, eliminating those with incomplete information, missing information, or an inability to fully reflect the characteristics of the accidents. For example, samples with duplicate data, excessive subjective descriptions, abnormal data, and those not having typical significance were excluded. Finally, 93 valid samples were obtained. In terms of accident types, there were 14 object impact accidents, 35 collapse accidents, 18 mechanical injury accidents, and 26 water gushing accidents.

In addition, most of the investigation reports on safety accidents in the shield construction of urban drainage tunnel projects were unstructured data. Due to human language writing errors, deviations or incompleteness in the records, etc., some incorrect words could be mixed in the investigation report text, which would affect the effect of text mining. Through text preprocessing, meaningless data, duplicate data, and defective data were processed; for example, the typos in the report were corrected, such as changing “shield excavator” and “shield drilling rig” to “shield machine” and “accident” to “safety accident”, etc.

4.2. Word Bank and Word Segmentation

4.2.1. Establishment of the Word Bank

This study adopted a dictionary-based word segmentation method to construct a safety risk word bank for shield construction in urban drainage deep tunnels. The word bank for the safety risks of urban drainage deep tunnel shield construction was mainly divided into special word banks and stop word banks, as shown in Figure 2.

Figure 2. The composition of the word bank.

In order to establish the security risk lexicon, the general dictionary of “Civil Construction” was downloaded from the Sogou input method, and “Safety Engineering”, “engineering construction”, and “Safety management” were downloaded from the Baidu lexicon to form a security risk lexicon. Furthermore, the text file, userdict.txt, was created to store the dictionary in the security risk database, and the load_userdic() method of Jieba was directly called to load the custom dictionary file when the Jieba was used.

In terms of establishing the custom vocabulary database, since the general vocabulary database cannot meet the requirements for the text mining of safety factors in the shield construction of urban drainage deep tunnels, relevant terms extracted from industry standards and specifications, such as “Urban Rail Transit Engineering Basic Terminology Standard GB/T 50833-2012” [29] and “Shield Tunnel Construction Technical Specifications” [30], were used to form a custom vocabulary database, for example, “Shield Tail Brush Wear” and “Synchronous Grouting Pressure”.

In terms of establishing a stop word library, to further enhance the efficiency of text mining, some words with no mining value can be disabled. The general dictionary mainly consists of three categories. Firstly, some descriptive words were included in the stopword dictionary. Secondly, punctuation marks and numerical forms of words were included. Thirdly, the Hit Stop Words and Baidu Stop Words lists were selected as supplementary dictionaries. In addition, in order to reduce the influence of construction sites, the “interval name” and “engineering name” were extracted, and the project names in the custom thesaurus were included in the stop word thesaurus to improve the efficiency of text mining.

4.2.2. Text Word Segmentation

Based on the collected samples, a corpus was established, and using the established security risk word library, custom word library, and stop word library, the Jieba segmentation package was employed to segment the corpus into word segments. The safety risk statements of the shield construction for the urban drainage deep tunnel project were segmented into individual words. The results of partial word segmentation are shown in Table 1.

Table 1. The results of partial word segmentation.

4.3. Risk Factor Identification

The concept of cumulative word frequency was combined with the ABC classification method as the criterion for defining the threshold. The threshold for high-frequency words was determined based on the curve distribution of word frequency, word quantity, and the proportion of cumulative word frequency. The result of threshold definition is shown in Figure 3.

Figure 3. High-frequency threshold determination based on TF–H.

To extract high-frequency words from the initial feature items, three methods were adopted: the high-frequency word definition formula, the cumulative TF value, and the cumulative TF–H value. The threshold of high-frequency vocabulary was calculated by Formula (1), and the TF value and the TF–H value were calculated by Formulas (2)–(7). The calculation results are shown in Table 2.

Table 2. The calculation result of feature items selection.

When the high-frequency vocabulary definition formula was adopted, since the number of words that appear only once was relatively large, the number of high-frequency words was only 28. The security risk factors selected by using this method could have the problem of missing items. When the cumulative TF value was greater than 90%, the number of high-frequency words was 1216, and there was lexical redundancy. When the cumulative TF–H value was greater than 90%, there were 204 high-frequency words, which basically conformed to the proportion of 90% of the important factors in the ABC classification method. Compared with other methods, the high-frequency words extracted by the TF–H method were more accurate and effective. Table 3 shows the vocabularies with TF–H values greater than 80.

Table 3. TF–H values greater than 80, based on the high-frequency vocabulary list.

Among these extracted high-frequency words, some words had similarities, such as geological conditions, construction environment, and environmental impact. Meanwhile, there were also some terms that had little correlation with risk factors, such as construction progress, construction noise, laws and regulations, energy consumption, and lighting conditions. Thus, when identifying risk factors, it is also necessary to select high-frequency words that truly contain the meaning of risk factors based on the context and semantics. This study adopted the method of manual screening, comprehensively considering the characteristics of shield construction in urban drainage deep tunnel, and further screened out 34 high-frequency words containing the semantics of safety risk factors. The initial set of risk factors was shown in Table 4.

Table 4. The initial set of safety risk factors for urban drainage deep tunnel shield construction.

4.4. Risk Factor List

Through the application of text mining technology, 34 safety risk factors for shield construction in urban deep tunnel drainage projects were identified and screened from the collected valid samples. This study further used the principal component analysis method to reduce the dimensionality of these risk factors. The scatter plot is shown in Figure 4.

Figure 4. Cluster scatter plot of safety risk factors for shield construction in the urban drainage deep tunnel project.

Combining the results of the cluster analysis with the 4ME1 theory, these risk factors were classified into four aspects: personnel management, mechanical equipment and materials, construction techniques, and surrounding environment, forming a list of safety risk factors for shield construction in urban drainage deep tunnel projects, as shown in Table 5.

Table 5. List of safety risk factors for urban drainage deep tunnel shield construction.

5. Verification and Implication

5.1. Verification of Risk Factor Identification Methods

Based on the principles of text mining technology, this study established a risk word bank for the shield construction of urban drainage deep tunnels and extracted the characteristic items of risk factors using entropy weight and frequency. It identified the risk factors for the safety of shield construction in urban drainage deep tunnels. In order to verify the effectiveness of the proposed method, the risk factor identification results were compared with the analysis results of multiple urban drainage deep tunnel project shield construction safety accident reports collected from different regions. The results showed that most of the risk factors identified in this study appeared in the cause analysis of the safety accident reports.

In addition, in order to further verify the advantages of the method proposed in this study, the ROC curve was used for comparative analysis of the risk identification effects of entropy-weighted frequency, inverse document frequency, the LDA topic model, and the BERT text embedding method. The ROC curves of these four different methods were shown in Figure 5. According to the application principle of the ROC curve method, the evaluation indicators for assessing the effectiveness of risk factor identification mainly included accuracy, precision, recall rate, and F1–score. Based on the risk factor identification results of the four different methods, these four evaluation indicators were calculated. The results are shown in Table 6.

Figure 5. The ROC curves of four risk-identifying methods.

Table 6. The calculation results of the evaluation indicators for risk identification effectiveness.

Compared with other methods, the entropy-weighted frequency method has the highest accuracy (AUC = 0.82) in extracting the safety risk factors for shield construction in urban drainage deep tunnel projects. This indicates that this method has a good application effect in identifying the safety risk factors of shield construction in urban drainage deep tunnel projects.

5.2. Management Implication

The TF–H value of the risk feature items indicated the significance of the corresponding risk factor in the text. The importance of these risk factors could be ranked according to the TF–H value, thereby providing a basis for the formulation of safety management measures for the shield construction of the urban drainage deep tunnel project.

Based on the list of safety risk factors for the shield construction of the urban drainage deep tunnel project, the importance of the risk types and risk factors could be ranked. In order to rank the importance of risk types, the TF–H values of risk factors in each risk type were added up, and the calculation results are as follows: The score for personnel management is 1689.56; the score for mechanical equipment and materials is 1060.51; the score for construction techniques is 1834.23; and the score for the surrounding environment is 1941.28. According to the calculation results, in the safety risk management of the shield construction of the urban drainage deep tunnel project, the risk of the surrounding environment > the risk of the construction technology > the risk of personnel management > the risk of the mechanical equipment and materials. Table 5 shows the importance ranking of risk factors in each of the different risk types.

Based on the above analysis, the importance ranking of the safety risk factors and their types for the shield construction of the urban drainage deep tunnel project was obtained. Based on this, in the safety management process of shield construction for urban drainage deep tunnels, the surrounding environmental risk management should be given priority. Before the project planning and construction, a comprehensive assessment of the surrounding environment is necessary, including geological conditions, underground pipelines, and surface buildings, in order to minimize the impact of environmental risks on the construction. Although the construction process risks are lower than those of the surrounding environment, they still remain a significant factor affecting construction safety. It is necessary to continuously optimize the construction process, adopt advanced construction technologies and methods, improve construction quality and efficiency, and reduce uncertainties during the construction. At the same time, the risk of personnel management cannot be ignored, as the operations and behaviors of the personnel directly affect construction safety. Strengthening the training and management of construction personnel, enhancing their safety awareness and skills, and ensuring that they can carry out construction in accordance with standard operating procedures is crucial. Furthermore, although the risks associated with machinery and materials are relatively low, they form the foundation for the smooth progress of the construction process. Ensuring the maintenance and proper quality control of machinery and materials is an important measure to reduce construction risks.

6. Conclusions

The shield construction of deep tunnels for urban drainage involves many risk factors that are difficult to manage. This study proposed a method for identifying the safety risk factors in the urban drainage deep tunnel shield construction based on the text mining method. Through practical application, 34 key risk factors were extracted from 93 safety accident reports, and they were classified into different risk types through cluster analysis. As a result, a safety risk list for the shield construction of the urban drainage deep tunnel project was formed. Furthermore, the ROC curve method was used to comparatively analyze the risk identification effects of the entropy-weighted frequency, inverse document frequency, LDA topic model, and BERT text embedding methods. The results showed that the entropy weight term frequency method proposed in this study had a better application effect in identifying the safety risk factors of shield construction in urban drainage deep tunnel projects. Furthermore, through the analysis of the importance of risk factors and risk types, a priority ranking of the safety risk factors and their types for the shield construction of the urban drainage deep tunnel project was proposed, providing theoretical guidance for the formulation of safety risk management measures.

This study still has some limitations, and these need to be improved and enhanced in future research. Firstly, the amount of data was relatively insufficient, which may limit the generalizability of the results. Secondly, during the text preprocessing stage, the insufficient coverage of the security risk lexicon constructed in this study affected the effectiveness of risk factor identification. Meanwhile, some words or phrases that are useful for risk identification were removed, which could affect the comprehensiveness of the mining results. Then, due to the influence of factors such as the text language mode, report writing style, the quantity and quality of the text, and time, the risk identification results were affected to some extent. The generalization ability of the model still needs to be strengthened and verified. Finally, given that different data sources may contain valuable information, future research should consider combining data from multiple sources to supplement the list of accident attributes.

Author Contributions

Conceptualization, K.H.; methodology, K.H. and J.W.; validation, J.W.; formal analysis, K.H.; data curation, K.H. and X.H.; writing—original draft preparation, K.H.; writing—review and editing, Z.C. and K.H.; project administration, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the 2022 Annual Research Plan Project of the Education Department of Hubei Province (B2022215).

Data Availability Statement

Data generated or analyzed during the study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hu, K.; Wang, J.W.; Wu, H. Construction safety risk assessment of large-sized deep drainage tunnel projects. Math. Probl. Eng. 2021, 2021, 7380555. [Google Scholar] [CrossRef]
Sun, J.Y.; Wu, X.W.; Wang, G.H.; He, J.G.; Li, W.T. The governance and optimization of urban flooding in dense urban areas utilizing deep tunnel drainage systems: A case study of guangzhou, China. Water 2024, 16, 2429. [Google Scholar] [CrossRef]
Hu, K.; Wang, J.W.; Wu, D.H.; Wang, Y.A. Risk assessment of small-diameter shield construction in a deep drainage tunnel based on an ism-critic-cloud model. Buildings 2024, 14, 3920. [Google Scholar] [CrossRef]
Hyun, K.C.; Min, S.; Choi, H.; Park, J.; Lee, I.M. Risk analysis using fault-tree analysis (fta) and analytic hierarchy process (ahp) applicable to shield tbm tunnels. Tunn. Undergr. Space Technol. 2015, 49, 121–129. [Google Scholar] [CrossRef]
Ding, L.; Sun, Y.J.; Zhang, W.Z.; Bi, G.; Xu, H.Z. Stress monitoring of segment structure during the construction of the small-diameter shield tunnel. Sensors 2023, 23, 8023. [Google Scholar] [CrossRef]
Jiang, J.; Liu, G.Y.; Ou, X.D. Risk coupling analysis of deep foundation pits adjacent to existing underpass tunnels based on dynamic bayesian network and n-k model. Appl. Sci. 2022, 12, 10467. [Google Scholar] [CrossRef]
Isah, M.A.; Kim, B.S. Question-answering system powered by knowledge graph and generative pretrained transformer to support risk identification in tunnel projects. J. Constr. Eng. Manag. 2025, 151, 04024193. [Google Scholar] [CrossRef]
He, K.; Zhu, J.; Wang, H.; Huang, Y.L.; Li, H.J.; Dai, Z.S.; Zhang, J.X.; Stathopoulos, T. Safety risk evaluation of metro shield construction when undercrossing a bridge. Buildings 2023, 13, 2540. [Google Scholar] [CrossRef]
Xu, N.; Guo, C.R.; Wang, L.; Zhou, X.Q.; Xie, Y. A three-stage dynamic risk model for metro shield tunnel construction. KSCE J. Civ. Eng. 2024, 28, 503–516. [Google Scholar] [CrossRef]
Liu, P.; Jin, X.Q.; Shang, Y.T. Ontology-based automated knowledge identification of safety risks in metro shield construction. J. Asian Archit. Build. Eng. 2025, 1–30. [Google Scholar] [CrossRef]
Li, X.W.; Li, S.C.; Yuan, J.F.; Wan, Z.; Liu, X. A data-driven and knowledge graph-based research on safety risk-coupled evolution analysis and assessment in shield tunneling. Tunn. Undergr. Space Technol. 2025, 162, 106657. [Google Scholar] [CrossRef]
Wu, K.P.; Zhang, J.S.; Huang, Y.L.; Wang, H.; Li, H.J.; Chen, H.H. Research on safety risk transfer in subway shield construction based on text mining and complex networks. Buildings 2023, 13, 2700. [Google Scholar] [CrossRef]
Tang, C.; Shen, C.X.; Zhang, J.J.; Guo, Z. Identification of safety risk factors in metro shield construction. Buildings 2024, 14, 492. [Google Scholar] [CrossRef]
Koc, K.; Gurgun, A.P. Stakeholder-associated life cycle risks in construction supply chain. J. Manag. Eng. 2021, 37, 4020107. [Google Scholar] [CrossRef]
Fu, T.; Shi, K.B.; Shi, R.Y.; Lu, Z.P.; Zhang, J.M. Risk assessment of tbm construction based on a matter-element extension model with optimized weight distribution. Appl. Sci. 2024, 14, 5911. [Google Scholar] [CrossRef]
Zhang, Z.X.; Wang, B.; Wang, X.F.; He, Y.T.; Wang, H.X.; Zhao, S.B. Safety-risk assessment for tbm construction of hydraulic tunnel based on fuzzy evidence reasoning. Processes 2022, 10, 2597. [Google Scholar] [CrossRef]
Wu, B.; Chen, H.H.; Huang, W.; Meng, G.W. Dynamic evaluation method of the ew-ahp attribute identification model for the tunnel gushing water disaster under interval conditions and applications. Math. Probl. Eng. 2021, 2021, 6661609. [Google Scholar] [CrossRef]
Shelake, A.G.; Gogate, N.G. An integrated risk prioritization and determination of activity-wise delay (irpad) framework for enhancing schedule management in tunnel projects. Eng. Constr. Archit. Manag. 2024; ahead-of-print. [Google Scholar]
Zhou, M.K.; Tang, Y.G.; Jin, H.; Zhang, B.; Sang, X.W. Abim-based identification and classification method of environmental risks in the design of beijing subway. J. Civ. Eng. Manag. 2021, 27, 500–514. [Google Scholar] [CrossRef]
Lu, L.Y.; Ji, M.L.; Wen, X.; Xiang, Y. An empirical study on construction emergency disaster management and risk assessment in shield tunnel construction project with big data analysis. Int. J. Data Min. Bioinform. 2024, 28, 406–425. [Google Scholar] [CrossRef]
Brown, C.K.; Cameron, B.G. Assessing changes in reliability methods over time: An unsupervised text mining approach. Qual. Reliab. Eng. Int. 2024, 40, 3597–3619. [Google Scholar] [CrossRef]
Vagnoli, M.; Remenyte-Prescott, R. An ensemble-based change-point detection method for identifying unexpected behaviour of railway tunnel infrastructures. Tunn. Undergr. Space Technol. 2018, 81, 68–82. [Google Scholar] [CrossRef]
Luan, T.T.; Zhang, X.; Li, H.R.; Wang, K.; Li, X.Y. Dynamic risk analysis of hazardous materials highway tunnel transportation based on fuzzy bayesian network. J. Loss Prev. Process Ind. 2024, 92, 105443. [Google Scholar] [CrossRef]
Zhou, Z.Y.; Guo, J.H.; Huang, J.H. Chemical safety risk identification and analysis based on improved lda topic model and bayesian networks. Appl. Sci. 2025, 15, 6197. [Google Scholar] [CrossRef]
Macêdo, J.B.; Moura, M.D.; Aichele, D.; Lins, I.D. Identification of risk features using text mining and bert-based models: Application to an oil refinery. Process Saf. Environ. Prot. 2022, 158, 382–398. [Google Scholar] [CrossRef]
Xu, N.; Ma, L.; Liu, Q.; Wang, L.; Deng, Y.L. An improved text mining approach to extract safety risk factors from construction accident reports. Saf. Sci. 2021, 138, 105216. [Google Scholar] [CrossRef]
Liu, Y.P.; Wang, J.W.; Tang, S.R.; Zhang, J.J.; Wan, J.Y.J. Integrating information entropy and latent dirichlet allocation models for analysis of safety accidents in the construction industry. Buildings 2023, 13, 1831. [Google Scholar] [CrossRef]
Song, W.Y.; Rong, W.; Tang, Y.Q. Quantifying risk of service failure in customer complaints: A textual analysis-based approach. Adv. Eng. Inform. 2024, 60, 102377. [Google Scholar] [CrossRef]
GB/T 50833-2012; Basic Terminology Standard for Urban Rail Transit Engineering. General Administration of Quality Supervision, Inspection and Quarantine: Beijing, China, 2012.
GB 50446-2008; Specifications for Construction and Acceptance of Shield Tunneling. Institute of Standards and Quantities of Ministry of Housing and Urban-Rural Development: Beijing, China, 2008.

Figure 1. Research framework for identifying safety risk factors in shield construction of deep-buried urban drainage tunnels.

Figure 2. The composition of the word bank.

Figure 3. High-frequency threshold determination based on TF–H.

Figure 4. Cluster scatter plot of safety risk factors for shield construction in the urban drainage deep tunnel project.

Figure 5. The ROC curves of four risk-identifying methods.

Table 1. The results of partial word segmentation.

No.	Segmentation Result
1	The geological conditions/of/the construction route/were/not adequately estimated/, and/the presence/of/hard strata/was/not promptly/identified/./
2	The design/of/the lining structure/was/unreasonable/, and/during/the construction process/, the defects/in/the lining/were/not promptly detected/./
3	The maintenance/of/mechanical equipment/was/inadequate/, failing/to/promptly detect/and/eliminate/potential faults/./

Table 2. The calculation result of feature items selection.

No.	Method	Threshold	High-Frequency Words Number	Estimate
1	the high-frequency word definition formula	T = 36	28	There might be omissions.
2	TF	Cumulative TF value ≥ 90%	1216	There might be redundant items.
3	TF–H	Cumulative TF–H value ≥ 90%	204	Relatively reasonable

Table 3. TF–H values greater than 80, based on the high-frequency vocabulary list.

No.	Characteristic Item	TF–H	No.	Characteristic Item	TF–H
1	Geological conditions	996.21	21	Safety investment	194.92
2	Risk of water gushing	835.69	22	Safety awareness	184.56
3	Collapse accident	596.75	23	Protective measures	177.32
4	Shield machine failure	553.33	24	Environmental impact	171.15
5	Construction plan	505.55	25	Laws and regulations	155.12
6	Safety management	466.67	26	Quality control	152.21
7	Emergency response	426.37	27	Communication and coordination	147.79
8	Risk assessment	411.12	28	Construction machinery	144.42
9	Early warning system	388.53	29	Resource allocation	143.36
10	Safety training	365.89	30	Formulation of contingency plans	132.21
11	Monitoring and surveillance	355.46	31	Earthquake safety	111.56
12	Equipment maintenance	332.67	32	Construction noise	105.63
13	Construction environment	311.53	33	Dust pollution	100.87
14	Underground pipeline	294.45	34	Lighting conditions	95.34
15	Casualties	269.04	35	Energy consumption	94.28
16	Structural stability	251.54	36	safe distance	91.11
17	Construction progress	244.48	37	Tunnel ventilation	88.86
18	Material quality	236.89	38	Foundation treatment	88.21
19	Construction technology	212.74	39	Soil improvement	85.34
20	Management system	202.18	40	Support structure	81.87

Table 4. The initial set of safety risk factors for urban drainage deep tunnel shield construction.

No.	High-Frequency Words	Risk Factor	TF–H	Interpretation of Risk Factors
R1	Geological conditions	The geological and hydrological conditions are poor.	996.21	The natural conditions, such as soil, rocks, and hydrology, at the construction site have affected the safety and stability of shield construction.
R2	Risk of water gushing	Water and sand gushing out of the tunnel	835.69	Groundwater suddenly rushed into the tunnel, causing waterlogging in the tunnel and interruption of construction.
R3	Collapse accident	Surface subsidence and collapse	596.75	Sudden subsidence of the ground or tunnel structure endangers construction safety and surrounding buildings.
R4	Shield machine failure	The shield machine equipment malfunctioned.	553.33	Mechanical failures that occur during the construction process of shield machines affect the construction progress and safety.
R5	Construction plan	The construction site was poorly organized.	505.55	The specific plans and methods of shield construction directly affect construction safety.
R6	Safety management	The investigation of potential safety hazards was inadequate.	466.67	The safety management system of the shield construction site, including safety regulations and rules, safety training, etc.
R7	Monitoring and surveillance	Inadequate monitoring	355.46	Continuously observe and record the process and results of shield tunneling construction to improve the construction quality.
R8	Equipment maintenance	The correction of the shield machine’s advancement was not timely.	332.67	Regular inspection and maintenance of shield construction equipment should be carried out.
R9	Underground pipeline	Damage to underground pipelines	294.45	The impact and protection of underground pipelines during shield construction
R10	Material quality	Damaged segments	236.89	The quality standards and usage conditions of materials used in shield construction
R11	Safety awareness	The safety awareness of construction workers is insufficient.	184.56	The awareness and emphasis of shield construction personnel on construction safety
R12	Protective measures	The safety protection of construction workers is insufficient.	177.32	Specific protective measures taken to prevent accidents from happening
R13	Quality control	Starting base, rails	152.21	Measures and procedures to ensure that the construction quality of the base meets the standards
R14	Tunnel ventilation	The installation accuracy is not high.	88.86	The ventilation situation inside the tunnel during the tunneling process
R15	Soil improvement	The exhaust equipment is not set up reasonably.	85.34	Improve the properties of soil by physical, chemical, or biological methods to enhance its bearing capacity.
R16	Support structure	Improper reinforcement of the soil at the cave entrance	81.87	Structures used to support soil in shield construction, such as steel pipes, wooden braces, etc.
R17	Operating procedures	The installation accuracy of the reaction frame is not high.	79.21	Operating procedures for shield machines
R18	Base	The tunneling parameters are set improperly.	76.53	The installation status of the shield machine base
R19	Construction waste soil	The base is damaged.	67.75	The soil generated during the shield machine’s tunneling process and its transportation conditions
R20	Segment	The efficiency of construction waste transportation is low.	55.56	The installation status of segments during the shield machine’s tunneling process
R21	Assembly	The installation accuracy of the negative ring tube sheet is not high.	50.21	The installation status of segments during the shield machine’s tunneling process
R22	Soil stability	The segments were assembled improperly.	43.33	The stability of the soil during shield construction
R23	Collapse	The cave entrance is unstable.	43.31	Sudden subsidence of the ground or tunnel structure
R24	Grouting	The starting well collapsed.	32.54	The grouting situation of the tunnel during the shield tunneling stage
R25	Shield cutting tool	Improper grouting control	25.69	Cutting tools for shield machines
R26	Ground uplift	The cutter head and cutting tools have worn out and failed.	21.23	The surface rises due to underground construction.
R27	Axis	The reinforcement of the working face is insufficient.	16.75	Whether the tunnel control axis meets the requirements
R28	Hoisting	Receiving axis deviation	14.32	Hoisting of mechanical equipment during construction
R29	Seal	Improper hoisting of the shield machine equipment	11.24	Whether the sealing condition of the opening meets the requirements
R30	Liquefaction	The sealing of the opening is not in place.	10.54	The soil loses stability due to the rise of the groundwater level.
R31	Cave gate	Soil loss at the cave entrance	9.34	Construction procedures and quality of door openings
R32	Split	The process of chiseling the cave door was unreasonable.	9.25	Shield machine equipment separation
R33	Receiving well	The separation of the shield machine equipment is not in place.	7.69	The construction quality of the receiving well at the arrival stage
R34	Receiving base	The receiving well collapsed.	7.26	Upon arrival, receive the construction quality of the base.

Table 5. List of safety risk factors for urban drainage deep tunnel shield construction.

No.	Risk Type	Risk Factor	TF–H
1	Personnel management	The construction site was poorly organized.	505.55
2		The investigation of potential safety hazards was inadequate.	466.67
3		Inadequate monitoring	355.46
4		The safety awareness of construction workers is insufficient.	184.56
5		The safety protection of construction workers is insufficient.	177.32
6	Mechanical equipment and materials	The shield machine equipment malfunctioned.	553.33
7		Damaged segments	236.89
8		The exhaust equipment is not set up reasonably.	88.86
9		The tunneling parameters are set improperly.	79.21
10		The base is damaged.	76.53
11		The cutter head and cutting tools have worn out and failed.	25.69
12	Construction technology	Water and sand gushing out of the tunnel	835.69
13		The correction of the shield machine’s advancement was not timely.	332.67
14		The installation accuracy of the receiving base is not high.	152.21
15		Improper reinforcement of the soil at the cave entrance	85.34
16		The installation accuracy of the reaction frame is not high.	81.87
17		The efficiency of construction waste transportation is low.	67.75
18		The installation accuracy of the negative ring tube sheet is not high.	55.56
19		The segments were assembled improperly.	50.21
20		The starting well collapsed.	43.31
21		Improper grouting control	32.54
22		The reinforcement of the working face is insufficient.	21.23
23		Receiving axis deviation	16.75
24		Improper hoisting of the shield machine equipment	14.32
25		The sealing of the opening is not in place.	11.24
26		The process of chiseling the cave door was unreasonable.	9.34
27		The separation of the shield machine equipment is not in place.	9.25
28		The receiving well collapsed.	7.69
29		The installation accuracy of the starting base is not high.	7.26
30	Surrounding environment	The geological and hydrological conditions are poor.	996.21
31		Surface subsidence and collapse	596.75
32		Damage to underground pipelines	294.45
33		The cave entrance is unstable.	43.33
34		Soil loss at the cave entrance	10.54

Table 6. The calculation results of the evaluation indicators for risk identification effectiveness.

Method	Accuracy	Precision	Recall	F1–Score
TF–H	0.82	0.83	0.86	0.83
TF–IDF	0.73	0.71	0.74	0.78
LDA	0.76	0.78	0.81	0.75
BERT	0.71	0.79	0.82	0.81

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Identification of Safety Risk Factors for Shield Construction in Urban Drainage Deep Tunnel Based on Text Mining

Abstract

1. Introduction

2. Literature Review

2.1. Identification of Safety Risk Factors in Tunnel Shield Construction

2.2. Text Mining in the Risk Management of Tunnel Shield Construction

2.3. Gap in the Existing Relevant Research

3. Methodology

3.1. Research Framework for Risk Factor Identification Based on Text Mining

3.2. Feature Selection Model of Risk Factors

4. Data Analysis

4.1. Sample

4.2. Word Bank and Word Segmentation

4.2.1. Establishment of the Word Bank

4.2.2. Text Word Segmentation

4.3. Risk Factor Identification

4.4. Risk Factor List

5. Verification and Implication

5.1. Verification of Risk Factor Identification Methods

5.2. Management Implication

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics